Ad Widget

**Markku** · 21-08-2024, 09:10

Originally posted by jhboricua

I have two of them pointed directly to the Fargate container task's IP addresses, bypassing the Network Load Balancer. Should be fine as long as the containers are not stopped.

Curious: Why do you need the NLBs there in the first place if you can reach the containers directly anyway? Is it because the container IP can change any time and it is not possible to have up-to-date DNS name directly for the container?

Markku

**Markku** · 21-08-2024, 15:28

FWIW, yesterday I set up a Zabbix HA cluster with one Linux agent2 just for testing this, no issues yet (but I don't have any LBs or other middleboxes here).

Markku

**jhboricua** · 21-08-2024, 19:50

Originally posted by Markku

Curious: Why do you need the NLBs there in the first place if you can reach the containers directly anyway? Is it because the container IP can change any time and it is not possible to have up-to-date DNS name directly for the container?

Markku

Yes, this is why.

**jhboricua** · 21-08-2024, 20:02

Update for this morning. The two hosts I have targeting the containers directly stopped communicating with the primary Zabbix node since the last agent restart. Same behavior as when they were targeting the load balancer, they start attempting to communicate with the stand-by node and continue to do so without ever sending traffic back to the primary. So it doesn't appear to be a load balancer issue. Unfortunately, increasing the logging verbosity didn't help because I misunderstood the way Zabbix does 'log rotation'. It only keeps one file as .old, so the increased log volume caused it to being overwritten far to frequently and I could not go back to the time the issue started.

What I do see in the Wireshark capture of the agent is that there's a reset even at the time the issue starts and the agent then flips to talk to the stand-by node after that. All traffic to the primary ceases.

I'm restarting the Agent on these and I'm increasing the log file size to the 1GB max in hopes it retains enough entries to be able to see what the logs show the next time the agent drops.

**Markku** · 21-08-2024, 20:41

Reminder that one key is still the server-side logs, what they are saying at the same time the agent ceases to connect to the active node.

Markku

**jhboricua** · 21-08-2024, 21:46

Yes, I plan up updating the loglevel on the backend side today. I might not be able to come back with another update until the end of this month due to planned time off starting tomorrow.

Is the Zabbix cluster you deployed also container based?

**jhboricua** · 22-08-2024, 00:36

Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.

**Markku** · 22-08-2024, 06:11

Originally posted by jhboricua

Is the Zabbix cluster you deployed also container based?

No, because I don't need the intrastructure complexity caused by using containers, I'm interested in Zabbix performance. So I'm using VMs for this.

Markku

**Markku** · 22-08-2024, 06:18

Originally posted by jhboricua

Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.

Oh, sorry that I wasn't clear: Normally (without increasing the logging level) Zabbix server logs the proxy configuration messages every 10 seconds, yes, generating "lots" of logs. (There is a ZBXNEXT for changing this.) But excluding those log lines, there shouldn't be much more, so you should be able to see if there are anything else logged there during the times some agents decide to change their behaviors. Like incomplete agent/proxy connections, or cluster switchover messages, giving you hints what was happening.

Markku

**jhboricua** · 07-10-2024, 16:11

After much troubleshooting and no clear answers from the server logs, I disabled the HA mode and went back to a single Zabbix server container to see if the agents would still randomly drop the connections. One week in so far, and not a single issue with active checks failing. We might simply rely on Fargate to handle a Zabbix server container failure by replacing the task with a new one if that's what it takes to keep the monitoring solution stable.

Ad Widget

Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment