We recently upgraded our Zabbix server to version 7.0.18, and since then have several times had issues where after running fine for a few days, alerts for all clients were generated because of a lack of updates. Looking on all the Zabbix client logs we see this:
5583:20251003:025848.279 active check data upload to [zabbix.company.com:10051] started to fail ([connect] cannot connect to [[zabbix.company.com]:10051]: [4] Interrupted system call)
5583:20251003:030118.338 active check data upload to [zabbix.company.com:10051] is working again
Even the Zabbix client on the Zabbix server itself logged this:
893612:20251003:025810.835 Unable to connect to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]
893612:20251003:025810.835 Unable to send heartbeat message to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]
On the Zabbix server we verified that the zabbix_server process was still running, and still had TCP port 10051 open for listening on. We did note that "netstat" showed several thousand TCP connections in CLOSE_WAIT state for port 10051. A tcpdump showed data arriving to port 10051, and at least some even got a response from the Zabbix server, although I assume not all clients did, given the errors they logged. A restart of the zabbix_server program immediately resolves the problem for a few days.
Can someone please point us in the right direction of how to debug the issue? Thank you in advance.
5583:20251003:025848.279 active check data upload to [zabbix.company.com:10051] started to fail ([connect] cannot connect to [[zabbix.company.com]:10051]: [4] Interrupted system call)
5583:20251003:030118.338 active check data upload to [zabbix.company.com:10051] is working again
Even the Zabbix client on the Zabbix server itself logged this:
893612:20251003:025810.835 Unable to connect to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]
893612:20251003:025810.835 Unable to send heartbeat message to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]
On the Zabbix server we verified that the zabbix_server process was still running, and still had TCP port 10051 open for listening on. We did note that "netstat" showed several thousand TCP connections in CLOSE_WAIT state for port 10051. A tcpdump showed data arriving to port 10051, and at least some even got a response from the Zabbix server, although I assume not all clients did, given the errors they logged. A restart of the zabbix_server program immediately resolves the problem for a few days.
Can someone please point us in the right direction of how to debug the issue? Thank you in advance.
there can be many reasons... Starting with network issues, Or missing items (ok this is maybe relevant in case of trapper items, not with agent items). Seems you have connection ({"response":"success") but data does not get processed..
Comment