Ad Widget

Collapse

Getting SNMP values slows down over time

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kargh
    Junior Member
    • Feb 2014
    • 21

    #1

    Getting SNMP values slows down over time

    Hello,

    I've been doing a lot of config tweaking (trying to get the minimum number of pollers vs performance). I woke up this morning to the following graph:
    Click image for larger version

Name:	zabbix-busyprocess-small.jpg
Views:	1
Size:	83.6 KB
ID:	316964
    I fired up
    Code:
    watch -n 0.2 ps -fu zabbix
    and noticed that the pollers were taking 2-3 seconds to grab values vs 0.00000 - 0.25 average. That also corresponds to the red poller busy line in the graph. Over the course of several hours, the amount of time they are busy increases by 20%+.

    I've increased the number of starting pollers to 20 (from 15) and will let it run for 24 hours to see what happens. Currently my monitoring stats are as follows:
    Number of hosts (monitored) 212
    Number of items (monitored) 29246
    Number of triggers (enabled) 6386
    Required server performance, new values per second 414.25

    We have some network issues with a data center in another country, so I also start 15 unreachable pollers to take over when the starting pollers timeout.

    If I understand the process correctly, when a poller times out and the host is unreachable, the unreachable poller takes over trying to make the connection and the starting poller is freed up to continue getting values. This decreases the likelihood of all the pollers getting hung up waiting for replies, which decreases false-positives. If I'm wrong, please let me know. That is what I'm basing my tweaking on.

    Also, has anyone seen the above behavior on their setup? Any thoughts on why a poller grabbing 150 values after a restart takes under 0.5s but a few hours later it takes 2-3 seconds? Is there a bottleneck somewhere I should look into?

    Thanks for any comments and/or suggestions!

    (Also, the first few hours on that graph were from some tweaking that I was doing. Hence the values being all over the place until around 12pm.)
Working...