We're in the process to upgrade our environment to Zabbix 7 and have setup a containerized HA deployment of Zabbix 7.0.2 in our non-prod environment. Our current production deployment is on Zabbix 5.0. We picked a subset of our Linux and Windows hosts and pointed them to the new 7.0 non-prod deployment so that we can start tuning templates. Their agents were upgraded to the 7.0.2 Agent2 package. On these agents we configured the server and server_active parameter as follows:
Notice we are using a semi-colon between the two values in the server active parameter as this is a Zabbix HA native HA deployment.
On the 5.0 deployment we do passive checks. For the 7.0 deployment we would like to use active checks to reduce the load on the Zabbix servers. We are running into a problem. After while, could be hours or could be days, the active agents stop sending data and we get Alarms for 'Active checks are not available'.
On the agent logs we see this when the problem starts happening:
For some reason these failed agents stop communicating with the active node and then just continuously attempt to check with the standby node, which of course tells them to take a hike. The only way to recover is to restart the Agent service. I ran a Wireshark capture and no traffic is generated to the active node from the Zabbix Agent, only to the passive. As soon as I restart the Agent Service, it starts sending traffic to the active node again.
What could possibly be triggering this behavior? So far, it seems to be only happening with Windows agents.
Code:
Server=zbx-node-01.uat.zabbix.mydomain.com,zbx-node-02.uat.zabbix.mydomain.com ServerActive=zbx-node-01.uat.zabbix.mydomain.com;zbx-node-02.uat.zabbix.mydomain.com
On the 5.0 deployment we do passive checks. For the 7.0 deployment we would like to use active checks to reduce the load on the Zabbix servers. We are running into a problem. After while, could be hours or could be days, the active agents stop sending data and we get Alarms for 'Active checks are not available'.
On the agent logs we see this when the problem starts happening:
Code:
2024/08/19 11:08:47.198891 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail 2024/08/19 11:09:05.197837 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60222->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.' 2024/08/19 11:09:05.197837 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail 2024/08/19 11:09:11.198846 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60232->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.' 2024/08/19 11:09:11.198846 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail 2024/08/19 11:09:17.197849 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60244->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.' 2024/08/19 11:09:17.197849 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail 2024/08/19 11:09:18.197859 [103] sending of heartbeat message to [zbx-node-02.uat.zabbix.mydomain.com:10051] is working again 2024/08/19 11:09:23.197859 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60255->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.' 2024/08/19 11:09:23.197859 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail 2024/08/19 11:09:29.197862 [103] connection closed2024/08/19 11:09:29.197862 [103] connection closed
What could possibly be triggering this behavior? So far, it seems to be only happening with Windows agents.

Comment