Ad Widget

Collapse

Managing queue size with distributed monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • abjornson
    Member
    • Oct 2013
    • 34

    #1

    Managing queue size with distributed monitoring

    Hello,

    I've seen much on the topic of performance tuning, but less on the topic of performance tuning distributed zabbix proxies. Any help on this problem that's been plaguing me is greatly appreciated.

    I have a proxy (2.2.8) in Kigali, Rwanda monitoring ~200 hosts, 75 items each host. My Zabbix server is with Amazon in EU west (also 2.2.8). Internet connectivity between the two is very good, though the latency is certainly longer than if the two were sitting right next to each other.

    Mostly, things are good. However, I do sometimes get gaps in my bandwidth graphs. I had been getting a lot of

    Code:
    2404:20150122:034606.303 SNMP agent item "If.32.ifOutErrors.["2"]" on host "AXK-CONCEPT PLUS" failed: first network error, wait for 15 seconds
    2778:20150122:034621.913 resuming SNMP agent checks on host "AXK-CONCEPT PLUS": connection restored
    for nodes with good connectivity, and good, responsive snmpwalks. However, when I snmpwalk with -r 0 (to turn off retries) I do sometimes get timeouts. My monitoring is for wireless network devices, and there are times when a single SNMP UDP request will be lost. For reference, my zabbix proxy timeout=30, and my pollers, unreachable pollers, etc are no where near busy (see image)



    I believe my gaps and errors are tied to the recent changes in 2.2.3 where SNMP retries were eliminated. When I upgraded from 2.2.7 to 2.2.8 (which allows one SNMP retry), the issue got slightly better. (I do wish I could add multiple retries...but I gather I can't do that without recompiling zabbix?)

    I also know that the devices I monitor don't support bulk SNMP gets. I discovered that I could disable bulk requests for the whole proxy by setting EnableSNMPBulkRequests=0. This made a huge difference in my proxy queue, greatly reducing queue size. (see graph, BulkRequests was disabled at 5pm..note the change)



    However, the change made the queue on my zabbix *server* balloon (see graph below BulkRequests was disabled at 5pm..note the increase in server queue). I'm assuming this is because SNMP data is now being sent individually to the server, instead of in bulk? Any suggestions what I can do to reduce the queue size on my server to get my queue back to normal?



    Any suggestions? Thanks so much!
  • ingus.vilnis
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Mar 2014
    • 908

    #2
    Hello,

    Very interesting graph you have there and I am curious what causes the queue to flip from proxy to server as well.

    Assuming that you have read the topics on performance tuning I can tell you that there is no difference if you do it for the server only or for proxies as well. The same conditions apply to both.

    First thing to do is to check all performance graphs on both server and proxy, particularly "Zabbix internal process busy" and "Zabbix cache usage". Do you see any spikes there?

    Then check zabbix_server.conf and zabbix_proxy.conf files and adjust the settings according to graphs. You may post the files here as well if you like.

    Next thing. Go to Administration -> Queue and check all views there (Overview, Overview by proxy, Details can be selected from top right dropdown.). There you will find more clues what is going on. Especially the "Details" view will show exact hosts and for how long the items have been delayed.

    What catches my eye is the amount of items in the server queue. See, your proxy queue got like 20 items avg after the EnableBulkRequests change. Your server queue jumped up to 7000 after that. Looks like a lost network segment or something. This needs to be investigated.

    Another very important aspect is database both on server and proxy. I am not saying that it is causing problems in your very particular issue but don't forget to check and tune that as well?

    What database are you using on both servers? What configuration parameters?

    And the last thing - logs. Server and proxy log files may contain lots of useful information. Maybe you can spot something more than network errors there.

    That would be all I could tell you for now. May be that can lead you to the right path.

    Best Regards,
    Ingus

    Comment

    Working...