Hello,
we were for a long time plagued by the alert: Zabbix unreachable poller processes more than 75% busy
There is a lot of info about this message on the net, but none really helped me. My main problem was finding out what exactly the unreachable pollers were doing. So I thought I'd share what I've discovered, even it might not be 100% correct. I am still a zabbix newby, so feel free to correct where necessary or provide better methodology
STEP 1: Cleaning up unreachable items
STEP 2: Cleaning up unreachable hosts.
STEP 3: Finding out what the unreachable pollers are doing.
This is what led me to discover step 2.
In /etc/zabbix/zabbix_server.conf there was a line Timeout=30 . It turns out some of our items do in rare circumstances need 30 seconds to check so this is impossible to change. But it also meant every unreachable SNMP host took 30 seconds to check, and there were a lot of these. It would be nice to be able to tune this setting specifically for the unreachable pollers.
we were for a long time plagued by the alert: Zabbix unreachable poller processes more than 75% busy
There is a lot of info about this message on the net, but none really helped me. My main problem was finding out what exactly the unreachable pollers were doing. So I thought I'd share what I've discovered, even it might not be 100% correct. I am still a zabbix newby, so feel free to correct where necessary or provide better methodology
STEP 1: Cleaning up unreachable items
- Go to Configuration > Hosts, click on any random 'items' link.
- Open the filter, and clean all fields to emtpy/all/.... IMPORTANT: This includes the 'Host' field you just filled
- Change State from all to Not supported. This will cause Status to change to Enabled.
STEP 2: Cleaning up unreachable hosts.
- Go again to Configuration > Hosts
- Look at the column 'Availablity' with Red/green leds for ZBX|SNMP|JMX|IPMI
- Everything red takes up capacity from an unreachable poller.
STEP 3: Finding out what the unreachable pollers are doing.
This is what led me to discover step 2.
- Open a linux terminal and do something like ps axu|grep -i unreachable
- Note the unreachable pollers that are slow. E.g. I had some saying 1 item in 60 seconds. Note the PID (of the thread, not of the whole zabbix process)
- Use strace to find out what that thread is doing, e.g. strace -p 1234
- I got some IO on an IP adress (bingo) and a select on fd 0 with time out of 30 seconds.
- For the fd number, do something like ls -hal /proc/1234/fd/0 , this is for PID 1234 and FD 0. You can now see what file/socket/... is causing the slowdown.
In /etc/zabbix/zabbix_server.conf there was a line Timeout=30 . It turns out some of our items do in rare circumstances need 30 seconds to check so this is impossible to change. But it also meant every unreachable SNMP host took 30 seconds to check, and there were a lot of these. It would be nice to be able to tune this setting specifically for the unreachable pollers.
Comment