Ad Widget

**raycast** · 16-03-2006, 15:26

Prime number based scheduling

Using prime numbers for scheduling sounds good at first, but is actually even worse

. Instead, the probing should be queued, and all probes should share a common denominator (e.g. 1 minute, 5 minutes).
Then zabbix needs to spread out the probes over the available interval.

If you do prime number based scheduling, it will always happen after some time that all things are probed at the same time. So most of them will timeout once in a while!

Instead, if they share a common denominator, you have (or will eventually end up) with a fixed pattern, i.e. in second 1 ping the web servers, in second 2 ping the mail servers, in second 3 ping the routers, in second 4 test http, in second 5 test smtp, ...

Given that all your probes have an ID number, bit-reverse it (to avoid placing sequentially added probes next to each other) and then use common-interval/(2^32)*reversed-seqnr as scheduling offset.

Then probe 0 will run at t+0, probe 1 at t+1/2, probe 2 at t+1/4, probe 3 at t+3/4, probe 4 at t+1/8, probe 5 at t+5/8, probe 6 at t+3/8, probe 7 at t+7/8, probe 9 at t+1/16, probe 10 at t+9/16 and so on, you get the idea.

This scheme will basically (if you have a useful common interval) that any two probes are ever to be ran at the same time. If you have n probes, common-interval/(2^floor(log(n)+1)) should be the main scheduling interval you need. (i.e. common-interval/N with N the next power of 2 which is larger than n)

Note that this scheme also avoids running all ping checks at the same time, all http checks at the same time, all $very-expensive checks at the same time. Instead it should try to distribute them evenly (at least when they were inserted in-sequence into the database, or you have some other ordered numbers for them)

This approach is used for linear hashing, btw.

A different scheme would basically just run the events when it wants, but when they timeout shift them slightly back. I.e. use check_finished + interval for the next check. When you have timeouts, this will be larger than when it succeeds. If you have high timeouts this will however decrease the test density a lot; and if timeout == interval it won't help anything.

Ad Widget

Performance gains by using prime numbers for update interval(?)

Comment