Ad Widget
Collapse
Problem with Zabbix 7 agents active checks with Zabbix HA deployment.
Collapse
X
-
Comment
-
Update for this morning. The two hosts I have targeting the containers directly stopped communicating with the primary Zabbix node since the last agent restart. Same behavior as when they were targeting the load balancer, they start attempting to communicate with the stand-by node and continue to do so without ever sending traffic back to the primary. So it doesn't appear to be a load balancer issue. Unfortunately, increasing the logging verbosity didn't help because I misunderstood the way Zabbix does 'log rotation'. It only keeps one file as .old, so the increased log volume caused it to being overwritten far to frequently and I could not go back to the time the issue started.
What I do see in the Wireshark capture of the agent is that there's a reset even at the time the issue starts and the agent then flips to talk to the stand-by node after that. All traffic to the primary ceases.
I'm restarting the Agent on these and I'm increasing the log file size to the 1GB max in hopes it retains enough entries to be able to see what the logs show the next time the agent drops.Comment
-
Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.Comment
-
Oh, sorry that I wasn't clear: Normally (without increasing the logging level) Zabbix server logs the proxy configuration messages every 10 seconds, yes, generating "lots" of logs. (There is a ZBXNEXT for changing this.) But excluding those log lines, there shouldn't be much more, so you should be able to see if there are anything else logged there during the times some agents decide to change their behaviors. Like incomplete agent/proxy connections, or cluster switchover messages, giving you hints what was happening.Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.
MarkkuComment
-
After much troubleshooting and no clear answers from the server logs, I disabled the HA mode and went back to a single Zabbix server container to see if the agents would still randomly drop the connections. One week in so far, and not a single issue with active checks failing. We might simply rely on Fargate to handle a Zabbix server container failure by replacing the task with a new one if that's what it takes to keep the monitoring solution stable.Comment
Comment