I have an interesting problem on Zabbix server and agent 1.4.6.
I have two servers that randomly set themselves to Unreachable and stop recording new values.
If I restart the agent it will start collecting for a few hours and then set itself as unreachable again. I can see the zabbix_agentd process running - however I cannot telnet to the default port from the Zabbix server.
I have another server with similar symptoms, but I am able to telnet to the agent and pull values. These servers are on the same subnet as the Zabbix server so there are no firewalls in place that would prevent it from collecting data or communicating over the network. I can see in the server log that a generic "network error" had occurred but I am unsure what this could mean or why it cannot recover.
Doing a netstat on the servers being monitored, I can see an extremely high number of connections back to the zabbix server in the CLOSE_WAIT state. One of them has 3 ESTABLISHED connections and 195 connections in the CLOSE_WAIT state to the Zabbix server.
I am monitoring probably 50 servers nearly identical (hardware, software) to the ones having problems with. Only these 3 are having a problem.
Log:
29015:20090515:003521 Host [Server1]: first network error, wait for 15 seconds
29015:20090515:003521 Parameter [perf_counter["\Memory\Page Faults/sec"]] will be checked after 300 seconds on host [Server1]
29026:20090515:005706 Host [Server1]: first network error, wait for 15 seconds
29026:20090515:005706 Parameter [perf_counter[\System\File Write Bytes/sec]] will be checked after 300 seconds on host [Server1]
29015:20090515:040009 Host [Server1: first network error, wait for 15 seconds
29057:20090515:040036 Host [Server1]: another network error, wait for 15 seconds
29057:20090515:040101 Host [Server1] will be checked after 60 seconds
Any ideas?
I have two servers that randomly set themselves to Unreachable and stop recording new values.
If I restart the agent it will start collecting for a few hours and then set itself as unreachable again. I can see the zabbix_agentd process running - however I cannot telnet to the default port from the Zabbix server.
I have another server with similar symptoms, but I am able to telnet to the agent and pull values. These servers are on the same subnet as the Zabbix server so there are no firewalls in place that would prevent it from collecting data or communicating over the network. I can see in the server log that a generic "network error" had occurred but I am unsure what this could mean or why it cannot recover.
Doing a netstat on the servers being monitored, I can see an extremely high number of connections back to the zabbix server in the CLOSE_WAIT state. One of them has 3 ESTABLISHED connections and 195 connections in the CLOSE_WAIT state to the Zabbix server.
I am monitoring probably 50 servers nearly identical (hardware, software) to the ones having problems with. Only these 3 are having a problem.
Log:
29015:20090515:003521 Host [Server1]: first network error, wait for 15 seconds
29015:20090515:003521 Parameter [perf_counter["\Memory\Page Faults/sec"]] will be checked after 300 seconds on host [Server1]
29026:20090515:005706 Host [Server1]: first network error, wait for 15 seconds
29026:20090515:005706 Parameter [perf_counter[\System\File Write Bytes/sec]] will be checked after 300 seconds on host [Server1]
29015:20090515:040009 Host [Server1: first network error, wait for 15 seconds
29057:20090515:040036 Host [Server1]: another network error, wait for 15 seconds
29057:20090515:040101 Host [Server1] will be checked after 60 seconds
Any ideas?
Comment