Hello,
I've been doing a lot of config tweaking (trying to get the minimum number of pollers vs performance). I woke up this morning to the following graph:

I fired up
and noticed that the pollers were taking 2-3 seconds to grab values vs 0.00000 - 0.25 average. That also corresponds to the red poller busy line in the graph. Over the course of several hours, the amount of time they are busy increases by 20%+.
I've increased the number of starting pollers to 20 (from 15) and will let it run for 24 hours to see what happens. Currently my monitoring stats are as follows:
Number of hosts (monitored) 212
Number of items (monitored) 29246
Number of triggers (enabled) 6386
Required server performance, new values per second 414.25
We have some network issues with a data center in another country, so I also start 15 unreachable pollers to take over when the starting pollers timeout.
If I understand the process correctly, when a poller times out and the host is unreachable, the unreachable poller takes over trying to make the connection and the starting poller is freed up to continue getting values. This decreases the likelihood of all the pollers getting hung up waiting for replies, which decreases false-positives. If I'm wrong, please let me know. That is what I'm basing my tweaking on.
Also, has anyone seen the above behavior on their setup? Any thoughts on why a poller grabbing 150 values after a restart takes under 0.5s but a few hours later it takes 2-3 seconds? Is there a bottleneck somewhere I should look into?
Thanks for any comments and/or suggestions!
(Also, the first few hours on that graph were from some tweaking that I was doing. Hence the values being all over the place until around 12pm.)
I've been doing a lot of config tweaking (trying to get the minimum number of pollers vs performance). I woke up this morning to the following graph:
I fired up
Code:
watch -n 0.2 ps -fu zabbix
I've increased the number of starting pollers to 20 (from 15) and will let it run for 24 hours to see what happens. Currently my monitoring stats are as follows:
Number of hosts (monitored) 212
Number of items (monitored) 29246
Number of triggers (enabled) 6386
Required server performance, new values per second 414.25
We have some network issues with a data center in another country, so I also start 15 unreachable pollers to take over when the starting pollers timeout.
If I understand the process correctly, when a poller times out and the host is unreachable, the unreachable poller takes over trying to make the connection and the starting poller is freed up to continue getting values. This decreases the likelihood of all the pollers getting hung up waiting for replies, which decreases false-positives. If I'm wrong, please let me know. That is what I'm basing my tweaking on.
Also, has anyone seen the above behavior on their setup? Any thoughts on why a poller grabbing 150 values after a restart takes under 0.5s but a few hours later it takes 2-3 seconds? Is there a bottleneck somewhere I should look into?
Thanks for any comments and/or suggestions!
(Also, the first few hours on that graph were from some tweaking that I was doing. Hence the values being all over the place until around 12pm.)