Ad Widget

**Markku** · 19-08-2024, 20:10

The way Zabbix server HA works is that the standby node does not listen to the trapper port (10051). Thus, the agent possibly trying to connect to it gets a TCP reset from the server's TCP stack.

Now, your log looks a bit suspicious when it says that an "existing connection was forcibly closed". This does not match the behavior I just described: since there is nobody listening to the connection, nobody ever accepts the connection, and thus the connection cannot be closed either (because there is no TCP connection).

So, is it possible that there is something (like a firewall or other TCP proxy) that intercepts those connections?

Since you are using Wireshark (that's great!), you are also able to see exactly what happens at the TCP level.

About some agents being suddenly unable to connect to the active node, that's also something that you should see better with Wireshark. And, why would some agents behave this way and not the others, there is something for you to think about: how is the topology different, etc.

You should also increase the logging level in the agent side to see what the agent is sending, receiving and thinking.

Markku

**Markku** · 19-08-2024, 20:14

To add to the previous, it would also help to capture the affected agent connections (using tcpdump) at the server sides. Then you can again see exactly if the connection (or attempt) from the agent matches that on the server side, or is someone intercepting the connection (attempt) in transit.

Markku

**jhboricua** · 19-08-2024, 20:34

Originally posted by Markku

The way Zabbix server HA works is that the standby node does not listen to the trapper port (10051). Thus, the agent possibly trying to connect to it gets a TCP reset from the server's TCP stack.

This is what I see in the wireshark captures, resets from the 2nd node when communications with the primary stops. But that's what I expect the 2nd node to do. I'm more interested in why is the Windows Agent2 suddenly not attempting to reach the primary. I mean, there was no traffic whatsoever to the primary in the wireshark capture when the issue happens.

Originally posted by Markku

Now, your log looks a bit suspicious when it says that an "existing connection was forcibly closed". This does not match the behavior I just described: since there is nobody listening to the connection, nobody ever accepts the connection, and thus the connection cannot be closed either (because there is no TCP connection).

So, is it possible that there is something (like a firewall or other TCP proxy) that intercepts those connections?

This is a container deployment in AWS Fargate. There is a network load balancer in front of each backend server.

zbx-node-01.uat.zabbix.mydomain.com --> backend1_NLB --> backend1_container
zbx-node-02.uat.zabbix.mydomain.com --> backend2_NLB --> backend2_container

By the way these are network load balancers, operating at layer 4, no layer 7 application load balancers.

Originally posted by Markku

About some agents being suddenly unable to connect to the active node, that's also something that you should see better with Wireshark. And, why would some agents behave this way and not the others, there is something for you to think about: how is the topology different, etc.

The only difference is the Os of the server running the agent. So far, Linux servers are unaffected, only windows agents are doing this. They are both configured exactly the same from a Server/ServerActive standpoint.

Originally posted by Markku

You should also increase the logging level in the agent side to see what the agent is sending, receiving and thinking.

I'll try that. Do you recommend level 4 or 5?

**Markku** · 20-08-2024, 14:54

Logging level 4 should be enough for us mere mortals, 5 is for hardcore Zabbix engineers

By the way these are network load balancers, operating at layer 4, no layer 7 application load balancers.

Ok, I don't have enough knowledge to off the bat say how it works, but I'd assume NLB does not do TCP proxy but just packet forwarding (and mangling). So, if you said that in the capture you saw TCP resets (for the same new connection attempts) but the agent log says "existing connection was forcibly closed", so be it, even though it doesn't make sense (output-wise).

Interesting case anyway, hopefully the additional logging gives you some useful information about why the agent thinks it should change to using the other server.

Markku

**jhboricua** · 20-08-2024, 16:00

The AWS NLB can do proxy protocol v1 or v2. We have client IP preservation enabled on them, meaning the target (zabbix server) sees the agent traffic source IP of the actual client (the NLB becomes transparent).

But I'm concerned with the fact that when this problems with the active agent not communicating happens, the agent simply stops generating traffic towards the active node. This morning I have 8 out of 17 host having this problem. Five of them are Windows hosts and three Linux, so yes, not exclusive to window agents I guess. On the windows host I see no traffic being generated to the active node by the agent in Wireshark, it just keeps trying to reach the standby. Why does the agent suddenly stops attempting to reach the primary? That's not a NLB issue.

Unfortunately I didn't have additional logging enabled on these new failed agents.

**Markku** · 20-08-2024, 17:43

Do you have any idea how the failing agents differ from the other agents? Things like recently restarted or not, network connectivity/topology, configuration.

How are your active agents configured in Zabbix server, do you have agent interfaces configured, if you have, do you use IP (127.0.0.1 or real IP?) or DNS name?

Markku

**jhboricua** · 20-08-2024, 18:33

We install and enforce the agent configuration with Chef. The agent package installed via the chef cookbook is: Zabbix Agent2 version 7.0.2 (Windows/Linux).
The monitored servers receive the same parameters in their agent configuration file. We only set the following parameters via Chef, leaving the rest at default values:

Code:

Server and ServerActive: as per my opening post in this thread.
Hostname: fqdn of server
HostInterface: IP address of server. We had to add this parameter with Zabbix 7.0 because without it, the agents don't autoregister their agent interface correctly. Not sure if this is a bug or not because we didn't have to do this on Zabbix 5 or 6. See my thread on this: https://www.zabbix.com/forum/zabbix-help/489529-odd-autoregistration-behavior-with-zabbix-7-0
DebugLevel: I just added this attribute on our Chef cookbook this morning to make it easier to flip the debug level on agents.

I have a simple autoregistration action defined that attaches the Linux or Windows by Zabbix agent active template to the servers as they register. That's it. They register with their real IP as per the thread I reference above.

**Markku** · 20-08-2024, 18:37

Btw, about this error:

cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60222->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'

This is coming from this code (link works as of today): https://git.zabbix.com/projects/ZBX/...s/comms.go#387

Before reaching the line that emits that error text, there has been the connect phase and also one send (write) of data.

The takeaway here is that something had accepted the connection and then disconnected it (= that is not a standard situation for the agent and unexpected results will result). I'd recommend you to check the packet capture again, if it gives you more information.

Also, can you capture packets on the containers?

Markku

**Markku** · 20-08-2024, 18:44

I don't see you telling anything about the Zabbix servers' logs, so how are things shown there?

Markku

**jhboricua** · 20-08-2024, 18:55

For some reason my reply to your post #7 got flagged as spam and it's pending moderation, so you probably can't see it.

Originally posted by Markku

I don't see you telling anything about the Zabbix servers' logs, so how are things shown there?

Markku

Might have to increase logging level on the backend servers too. Right now the server logs are riddled with 'sending configuration data to proxy....' entries which is pretty annoying when trying to search stuff.

**Markku** · 20-08-2024, 19:16

Originally posted by jhboricua

For some reason my reply to your post #7 got flagged as spam and it's pending moderation, so you probably can't see it.

Right

Originally posted by jhboricua

Might have to increase logging level on the backend servers too. Right now the server logs are riddled with 'sending configuration data to proxy....' entries which is pretty annoying when trying to search stuff.

Yeah, "grep -v" is your friend there

Markku

**jhboricua** · 20-08-2024, 19:42

Can't do grep on the backend logs as they are Fargate containers, I have to search the logs via Cloudwatch, and that is an artform in itself, lol. Here's my filter to exclude those entries after multiple attempts:

Code:

[unixtime, var1, var2, var3, var4, var5, proxy !=*zbx-proxy*, var6, var7, var8, var9, var10, var11, var12, var13, var14, var15]

Its a PITA.

**Markku** · 20-08-2024, 19:50

I'm not much of a CloudWatch user but I'm accustomed to using there search terms like -"not_this_word" (= minus in front of the term), would that suffice?

Markku

**jhboricua** · 20-08-2024, 21:23

I turned up the logging level on my test hosts and I have two of them pointed directly to the Fargate container task's IP addresses, bypassing the Network Load Balancer. Should be fine as long as the containers are not stopped.

Ad Widget

Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment