Ad Widget

**cameronsto** · 31-01-2006, 15:59

This is an interesting theory. I might be able to try this out later today. This also might explain why one of my hosts is having issues in regards to responding to snmp requests. I see the requests come through in the snmpd logs, but the zabbix_server log says that it times out waiting for a response. I'm wondering if snmpd is able to respond to the first few requests but isn't able to keep up with the frequent requests.

Anyway, I'll let you know what I find.

-cameron

**edeus** · 01-02-2006, 02:14

Now if you could only come up with a solution to break down numbers into their prime components quickly

**Markus** · 01-02-2006, 05:28

Performance gains by using prime numbers for update interval(?)

Originally posted by edeus

Now if you could only come up with a solution to break down numbers into their prime components quickly

Well, I am not trying to break public-key cryptography here. If you just need a list of prime numbers, look here..

**elkor** · 02-02-2006, 16:59

well, essentially you are correct.

the issue however is procedural and arises when you take maintainence of a large environment into account (particuarly if you are not the one who maintains the environment after deployment)

explaining this to a number of admins or operators and asking them to maintain a rotating prime number delay scheme is already making me bang my head on a desk. many people simply don't have that level of attention to detail.

In addition, this scheme is simple enough to understand with 10 items for a given host. What if each host has 30 items? you would potentially need to reuse primes in order to keep your checks within your tolerence window. AND, chances are there would probably be a lot of concurrency between hosts.

All that being said, and returning to my original statement, you're right.. it would reduce concurrency... just maybe not enough to deal with the complexity of it

**Markus** · 02-02-2006, 23:09

Performance gains by using prime numbers for update interval(?)

You are of course right that one would have to re-use prime numbers and I fully agree that the practical implications are an issue here. For most people it probably is not a real issue anyway since Zabbix seems pretty fast and hardware is relatively cheap nowadays. Maybe one day I will have responsibility for a big enough Zabbix installation so I can try out my theory in practice. On the other hand it might simply be cheaper to spend a few (Australian) dollars on bigger hardware instead.

Markus

**edeus** · 03-02-2006, 00:12

I have a Pentium 866 doing my monitoring. This copes with about 25 servers@30 items each.

Would be interesting to see how extensively they use zabbix. Post your stats!

**Alexei** · 05-02-2006, 13:44

I like that idea in general. However I do not see how this can be implemented for large number of items.

A better approach could be advanced calculation of initial timestamp for the next refresh in order to minimize spikes. Currently next check is guaranteed to have (timestamp mod refresh_time == 0).

**SAT QPass** · 13-02-2006, 22:22

What about introducing a bit of randomness into the timestamp. Rather than say it is a perfectly fixed interval, it becomes a guideline (say tolerance of +-5%) or some other suitable variance which would allow for better spike management. You could simply achieve this with the application of RAND on the check interval and a multiplier. You could even have the tolerance value a db/config value such that a those that must/want perfect intervals could have them.

What do you think?

**edeus** · 14-02-2006, 02:20

By using these odd values, would this cause a lot of load for graphing?

Perhaps not, but I am not sure how the graphing is coded. Variable -/+ is a good idea though.

**Markus** · 14-02-2006, 04:31

This is similar to what I am currently planning to implement for a potentially large scale deployment of Zabbix.

In fact it's going to be a mix of the 'randomness' and 'prime numbers' methods suggested in this thread. For example instead of using a 300 second interval (5 min) for all items I thought of randomly picking values out of a set including 271, 277, 281, 283, 293, 307, 311, 313, 317 (ca. 4:30mins - 5:30 mins).

Sometime later today I will try to analyse whether either the 'random' method mentioned by SATQPass or the one I described above produce fewer concurrent checks.

Edeus might have a valid point though by wondering whether any such scheme might actually have a negative impact on performance because it might clash with the way Zabbix is designed. For instance, how would Zabbix (and the database backend) cope if it had to run a few checks every second instead lots of them every 30 seconds? Furthermore, maybe we are wasting our time here on something which would bring only small improvements while other changes, e.g. active checks instead of passive ones, might give much better performance improvements.

I hope, that sometime later in February I will have the chance to apply all this to a real Zabbix installation. I will post the results...

Markus

**Alexei** · 14-02-2006, 09:12

Thanks for the ideas. I'm going to implement the following schema:

1. When an item is created nextcheck (next time to refresh) will be set to itemid%3600
2. Next refresh time will be calculated as, simplified, (itemid%3600+N*delay)

This will ensure some ramdomness and will help to eliminate possible peaks.

**Markus** · 14-02-2006, 09:17

Just a quick followup: the 'random' method suggested a few posts earlier in this thread is possibly the way to go. I did some quick analysis and it distributes the test much better over time than my 'prime numbers' concept.

I concede that my initial idea was defeated but I still win by getting a better Zabbix anyway

Markus

**SAT QPass** · 14-02-2006, 23:43

Originally posted by Markus

In fact it's going to be a mix of the 'randomness' and 'prime numbers' methods suggested in this thread. For example instead of using a 300 second interval (5 min) for all items I thought of randomly picking values out of a set including 271, 277, 281, 283, 293, 307, 311, 313, 317 (ca. 4:30mins - 5:30 mins).

I think statistically you will see a more even distribution if you use a mechansim similar to the one I suggested. Ideally it would also be configurable to allow for more or less tolerance which would probably be directly linked to the size of the monitored pool.

Originally posted by Markus

Edeus might have a valid point though by wondering whether any such scheme might actually have a negative impact on performance because it might clash with the way Zabbix is designed. For instance, how would Zabbix (and the database backend) cope if it had to run a few checks every second instead lots of them every 30 seconds? Furthermore, maybe we are wasting our time here on something which would bring only small improvements while other changes, e.g. active checks instead of passive ones, might give much better performance improvements.

I will post the results...

I think that is a very valid point, how well does zabbix handle continuous tests vs. spikes of tests and would definitely require a bit of analysis. Obviously as you correctly observed, active checks transfer the load to the clients, however, I am personally striving to minimize the impact on the clients as much as possible (and just put a robost machine on the zabbix side).

I myself am running zabbix against several hundred hosts with nearly a 60 tests per node (that is just the baseline). Some hosts have far more.

**SAT QPass** · 15-02-2006, 00:23

Originally posted by Alexei

Thanks for the ideas. I'm going to implement the following schema:

1. When an item is created nextcheck (next time to refresh) will be set to itemid%3600
2. Next refresh time will be calculated as, simplified, (itemid%3600+N*delay)

This will ensure some ramdomness and will help to eliminate possible peaks.

I am curious how this will work... I am not a C programmer, so forgive me if I am mistaken, but would not the value of itemid%3600 = itemid every time unless itemid>3600? And what is the value of N? And therefore my next question is this whole seconds or ms? I currently have a itemids of >100 and I am not even done listing out my complete test template.

Assuming the above is true, as itemid increases, distribution increases proportionally.

Again, I am not a programmer, and I may have this all turned around.

Ad Widget

Performance gains by using prime numbers for update interval(?)

Performance gains by using prime numbers for update interval(?)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment