Ad Widget

**batchenr** · 29-05-2017, 13:06

Originally posted by woo

I've just upgraded our Zabbix infrastructure from the last 2.4 release to the current 3.2.6, and it seems there's something heavily messed up with the server->proxy->agent communication since then.
I noticed that because our Zabbix server suddenly had an alert "More than 100 items without data in the last 10 minutes", and so I started investigating the queue.
There's about a thousand items in the queue, all configured to poll via proxies in the respective local networks.
I've set debugging to the highest level and going through logs on the server, proxy and agents, inspecting traffic with tcpdump, and it all just doesn't add up sensibly.

The proxy log has many lines like "Zabbix agent item "GetClockDrift" on host "MY-SVR01" failed: first network error, wait for 15 seconds", after which the Proxy->Agent communication completely dies for some time, even though the hosts are on the same network, ping and other communications work fine, and the command executes flawlessly when run manually with zabbix_get.
The item is set to check every 300 seconds, but I still see it being attempted every ~18 seconds.
On the agent log, it appears to get handled perfectly:
EXECUTE_STR() command:'cscript.exe //B "c:\Program Files\Zabbix\Bin\GetClockDrift.vbs"' len:8 cmd_result:'-0.11778'
Sending back [-0.11778]
Other items that show up on the queue, like Get Free Space on C: or any of the other disk-space related queries never seems to actually be submitted to the agent, or at least do not show up in the agent log at all.

When I disable the "offending" item, the "first network error" messages go away, but the queued checks still don't get resolved. As soon as I enable that item again, the network error messages return.

I'm also seeing a whole ton of error messages "PDH_CALC_NEGATIVE_DENOMINATOR error occurred in counterpath '\Processor(31)\% Processor Time'. Value ignored", but they all seem to concern CPU info, which doesn't show up on the missing-data queue at all, so it seems to not matter much.

There's so much going wrong since the update, that I'm at a loss about where even to start debugging.. but since this is rather a large monitoring infrastructure, I can't just tear everything down and start from scratch.
If anyone can provide any insight on drilling down on these issues, please do so.

Best regards
~woo

first of all i hope you made some backups \rollback plans.

i have upgraded from 3.0 to 3.2.6 and i had to make changes in all the database.

see here : https://www.zabbix.com/forum/showthread.php?t=58036

there is a certin recomadtion to do before yoy upgrade.
and did you update also zabbix agent in this machines ? maybe it will help.

**woo** · 07-06-2017, 13:31

The database is already in UTF8 mode as far as I can see.
The Zabbix proxies and agents were also updated to 3.2.x.

I've been searching and experimenting and stabbing in the dark, and nothing comes from it. I have no idea where my proxied checks get lost.

The next thing I'm going to try is dig through the sources for where this "failed: first network error" message is generated, and tracing back what condition could trigger this erroneously - because there definitely is no network error at all.. and I assume, the subsequent 15 seconds of service interruption that follows this error is what actually causes so many checks to not being run.

Ad Widget

proxy communication messed up since upgrade

proxy communication messed up since upgrade

Comment

Comment