I've just upgraded our Zabbix infrastructure from the last 2.4 release to the current 3.2.6, and it seems there's something heavily messed up with the server->proxy->agent communication since then.
I noticed that because our Zabbix server suddenly had an alert "More than 100 items without data in the last 10 minutes", and so I started investigating the queue.
There's about a thousand items in the queue, all configured to poll via proxies in the respective local networks.
I've set debugging to the highest level and going through logs on the server, proxy and agents, inspecting traffic with tcpdump, and it all just doesn't add up sensibly.
The proxy log has many lines like "Zabbix agent item "GetClockDrift" on host "MY-SVR01" failed: first network error, wait for 15 seconds", after which the Proxy->Agent communication completely dies for some time, even though the hosts are on the same network, ping and other communications work fine, and the command executes flawlessly when run manually with zabbix_get.
The item is set to check every 300 seconds, but I still see it being attempted every ~18 seconds.
On the agent log, it appears to get handled perfectly:
EXECUTE_STR() command:'cscript.exe //B "c:\Program Files\Zabbix\Bin\GetClockDrift.vbs"' len:8 cmd_result:'-0.11778'
Sending back [-0.11778]
Other items that show up on the queue, like Get Free Space on C: or any of the other disk-space related queries never seems to actually be submitted to the agent, or at least do not show up in the agent log at all.
When I disable the "offending" item, the "first network error" messages go away, but the queued checks still don't get resolved. As soon as I enable that item again, the network error messages return.
I'm also seeing a whole ton of error messages "PDH_CALC_NEGATIVE_DENOMINATOR error occurred in counterpath '\Processor(31)\% Processor Time'. Value ignored", but they all seem to concern CPU info, which doesn't show up on the missing-data queue at all, so it seems to not matter much.
There's so much going wrong since the update, that I'm at a loss about where even to start debugging.. but since this is rather a large monitoring infrastructure, I can't just tear everything down and start from scratch.
If anyone can provide any insight on drilling down on these issues, please do so.
Best regards
~woo
I noticed that because our Zabbix server suddenly had an alert "More than 100 items without data in the last 10 minutes", and so I started investigating the queue.
There's about a thousand items in the queue, all configured to poll via proxies in the respective local networks.
I've set debugging to the highest level and going through logs on the server, proxy and agents, inspecting traffic with tcpdump, and it all just doesn't add up sensibly.
The proxy log has many lines like "Zabbix agent item "GetClockDrift" on host "MY-SVR01" failed: first network error, wait for 15 seconds", after which the Proxy->Agent communication completely dies for some time, even though the hosts are on the same network, ping and other communications work fine, and the command executes flawlessly when run manually with zabbix_get.
The item is set to check every 300 seconds, but I still see it being attempted every ~18 seconds.
On the agent log, it appears to get handled perfectly:
EXECUTE_STR() command:'cscript.exe //B "c:\Program Files\Zabbix\Bin\GetClockDrift.vbs"' len:8 cmd_result:'-0.11778'
Sending back [-0.11778]
Other items that show up on the queue, like Get Free Space on C: or any of the other disk-space related queries never seems to actually be submitted to the agent, or at least do not show up in the agent log at all.
When I disable the "offending" item, the "first network error" messages go away, but the queued checks still don't get resolved. As soon as I enable that item again, the network error messages return.
I'm also seeing a whole ton of error messages "PDH_CALC_NEGATIVE_DENOMINATOR error occurred in counterpath '\Processor(31)\% Processor Time'. Value ignored", but they all seem to concern CPU info, which doesn't show up on the missing-data queue at all, so it seems to not matter much.
There's so much going wrong since the update, that I'm at a loss about where even to start debugging.. but since this is rather a large monitoring infrastructure, I can't just tear everything down and start from scratch.
If anyone can provide any insight on drilling down on these issues, please do so.
Best regards
~woo
Comment