Ad Widget

Collapse

Solving the alert: Zabbix unreachable poller processes more than 75% busy

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • HYPERMAN
    Junior Member
    • Sep 2018
    • 4

    #1

    Solving the alert: Zabbix unreachable poller processes more than 75% busy

    Hello,

    we were for a long time plagued by the alert: Zabbix unreachable poller processes more than 75% busy

    There is a lot of info about this message on the net, but none really helped me. My main problem was finding out what exactly the unreachable pollers were doing. So I thought I'd share what I've discovered, even it might not be 100% correct. I am still a zabbix newby, so feel free to correct where necessary or provide better methodology

    STEP 1: Cleaning up unreachable items
    • Go to Configuration > Hosts, click on any random 'items' link.
    • Open the filter, and clean all fields to emtpy/all/.... IMPORTANT: This includes the 'Host' field you just filled
    • Change State from all to Not supported. This will cause Status to change to Enabled.
    Searching produces a report of all items that are unpollable. Unfortunately, it also includes items from disabled hosts. I disabled any item that had no chance of becoming available.

    STEP 2: Cleaning up unreachable hosts.
    • Go again to Configuration > Hosts
    • Look at the column 'Availablity' with Red/green leds for ZBX|SNMP|JMX|IPMI
    • Everything red takes up capacity from an unreachable poller.
    Again I disabled any host that would never come up again

    STEP 3: Finding out what the unreachable pollers are doing.

    This is what led me to discover step 2.
    • Open a linux terminal and do something like ps axu|grep -i unreachable
    • Note the unreachable pollers that are slow. E.g. I had some saying 1 item in 60 seconds. Note the PID (of the thread, not of the whole zabbix process)
    • Use strace to find out what that thread is doing, e.g. strace -p 1234
    • I got some IO on an IP adress (bingo) and a select on fd 0 with time out of 30 seconds.
    • For the fd number, do something like ls -hal /proc/1234/fd/0 , this is for PID 1234 and FD 0. You can now see what file/socket/... is causing the slowdown.
    This also yielded an interesting fact:

    In /etc/zabbix/zabbix_server.conf there was a line Timeout=30 . It turns out some of our items do in rare circumstances need 30 seconds to check so this is impossible to change. But it also meant every unreachable SNMP host took 30 seconds to check, and there were a lot of these. It would be nice to be able to tune this setting specifically for the unreachable pollers.
  • mauriciomsr20
    Junior Member
    • Sep 2018
    • 2

    #2
    This is explained in this video of Dmitry Lambert (zabbix team) near the end of video:

    Comment

    Working...