It was all working fine until 2 days ago. Now Many Windows servers are being Unreachable more than 5 minutes. It does recover after an hour or two and then again after 2 hours it says it's unreachable. We did not make any changes.
I'm able to telnet and ping without any issue when it says it's unreachable.
Tried restarting Zabbix agent and zabbix server as well.
When I restart Zabbix agent sometimes it does recover (still fails again though) and sometimes it does not.
Looks like something is wrong as per the Server logs:
bc@zabbix_server:/etc/zabbix$tail -f /var/log/zabbix/zabbix_server.log
snmp_build: unknown failuresnmp_build: unknown failure 994:20140313:053931.839 enabling SNMP checks on host [HOST-092]: host became available
snmp_build: unknown failuresnmp_build: unknown failure 993:20140313:054235.345 resuming Zabbix agent checks on host [HOST-002]: connection restored
986:20140313:054347.246 Zabbix agent item [net.if.out[WAN Miniport (PPPOE)]] on host [HOST003] failed: first network error, wait for 15 seconds
984:20140313:054501.894 Zabbix agent item [perf_counter[\2\18]] on host [HOST-192] failed: first network error, wait for 15 seconds
988:20140313:054517.968 Zabbix agent item [system.cpu.load[,avg5]] on host [HOST-040] failed: first network error, wait for 15 seconds
994:20140313:055214.461 resuming Zabbix agent checks on host [HOST003]: connection restored
993:20140313:055220.382 resuming Zabbix agent checks on host [HOST-040]: connection restored
991:20140313:055239.542 Zabbix agent item [vfs.fs.size[C:,pfree]] on host [HOST-125] failed: first network error, wait for 15 seconds
985:20140313:055258.236 Zabbix agent item [net.if.out[WAN Miniport (PPTP)]] on host [HOST-002] failed: first network error, wait for 15 seconds
snmp_build: unknown failure 985:20140313:055722.742 SNMP item [sysContact] on host [HOST-092] failed: first network error, wait for 15 seconds
993:20140313:055942.402 resuming Zabbix agent checks on host [HOST-192]: connection restored
And here is the Agent LOG:
This Agent LOG is from HOST-192 as you can see, I had restarted the agent hours ago and it was working fine but now it's unreachable again.
1448:20140122:111621.540 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
2064:20140122:111621.587 agent #0 started [collector]
2068:20140122:111621.587 agent #1 started[listener]
2072:20140122:111621.587 agent #2 started[listener]
2080:20140122:111621.587 agent #4 started [active checks]
2076:20140122:111621.587 agent #3 started[listener]
2080:20140122:111642.632 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
1600:20140312:221226.684 Zabbix Agent shutdown requested
2080:20140312:221227.043 zabbix_agentd active check stopped
2064:20140312:221227.636 zabbix_agentd collector stopped
1600:20140312:221227.714 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
2864:20140312:221232.191 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
5548:20140312:221232.191 agent #0 started [collector]
5252:20140312:221232.191 agent #1 started[listener]
752:20140312:221232.191 agent #2 started[listener]
3928:20140312:221232.191 agent #3 started[listener]
3932:20140312:221232.191 agent #4 started [active checks]
3932:20140312:221253.202 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
This Agent LOG is from HOST-002 and I did not restart the agent. It's keeps recovering itself and becomes unreachable.
1924:20140122:114556.752 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
1928:20140122:114556.767 agent #0 started [collector]
1932:20140122:114556.767 agent #1 started[listener]
1936:20140122:114556.767 agent #2 started[listener]
1940:20140122:114556.767 agent #3 started[listener]
1944:20140122:114556.767 agent #4 started [active checks]
1944:20140122:114617.780 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
1900:20140122:115400.850 Zabbix Agent shutdown requested
1944:20140122:115401.209 zabbix_agentd active check stopped
1928:20140122:115401.240 zabbix_agentd collector stopped
1900:20140122:115401.864 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
1908:20140122:121024.890 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
1912:20140122:121024.906 agent #0 started [collector]
1916:20140122:121024.906 agent #1 started[listener]
1920:20140122:121024.906 agent #2 started[listener]
1924:20140122:121024.906 agent #3 started[listener]
1928:20140122:121024.906 agent #4 started [active checks]
1928:20140122:121045.950 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
I'm able to telnet and ping without any issue when it says it's unreachable.
Tried restarting Zabbix agent and zabbix server as well.
When I restart Zabbix agent sometimes it does recover (still fails again though) and sometimes it does not.
Looks like something is wrong as per the Server logs:
bc@zabbix_server:/etc/zabbix$tail -f /var/log/zabbix/zabbix_server.log
snmp_build: unknown failuresnmp_build: unknown failure 994:20140313:053931.839 enabling SNMP checks on host [HOST-092]: host became available
snmp_build: unknown failuresnmp_build: unknown failure 993:20140313:054235.345 resuming Zabbix agent checks on host [HOST-002]: connection restored
986:20140313:054347.246 Zabbix agent item [net.if.out[WAN Miniport (PPPOE)]] on host [HOST003] failed: first network error, wait for 15 seconds
984:20140313:054501.894 Zabbix agent item [perf_counter[\2\18]] on host [HOST-192] failed: first network error, wait for 15 seconds
988:20140313:054517.968 Zabbix agent item [system.cpu.load[,avg5]] on host [HOST-040] failed: first network error, wait for 15 seconds
994:20140313:055214.461 resuming Zabbix agent checks on host [HOST003]: connection restored
993:20140313:055220.382 resuming Zabbix agent checks on host [HOST-040]: connection restored
991:20140313:055239.542 Zabbix agent item [vfs.fs.size[C:,pfree]] on host [HOST-125] failed: first network error, wait for 15 seconds
985:20140313:055258.236 Zabbix agent item [net.if.out[WAN Miniport (PPTP)]] on host [HOST-002] failed: first network error, wait for 15 seconds
snmp_build: unknown failure 985:20140313:055722.742 SNMP item [sysContact] on host [HOST-092] failed: first network error, wait for 15 seconds
993:20140313:055942.402 resuming Zabbix agent checks on host [HOST-192]: connection restored
And here is the Agent LOG:
This Agent LOG is from HOST-192 as you can see, I had restarted the agent hours ago and it was working fine but now it's unreachable again.
1448:20140122:111621.540 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
2064:20140122:111621.587 agent #0 started [collector]
2068:20140122:111621.587 agent #1 started[listener]
2072:20140122:111621.587 agent #2 started[listener]
2080:20140122:111621.587 agent #4 started [active checks]
2076:20140122:111621.587 agent #3 started[listener]
2080:20140122:111642.632 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
1600:20140312:221226.684 Zabbix Agent shutdown requested
2080:20140312:221227.043 zabbix_agentd active check stopped
2064:20140312:221227.636 zabbix_agentd collector stopped
1600:20140312:221227.714 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
2864:20140312:221232.191 Starting Zabbix Agent [HOST-192]. Zabbix 2.0.6 (revision 35155).
5548:20140312:221232.191 agent #0 started [collector]
5252:20140312:221232.191 agent #1 started[listener]
752:20140312:221232.191 agent #2 started[listener]
3928:20140312:221232.191 agent #3 started[listener]
3932:20140312:221232.191 agent #4 started [active checks]
3932:20140312:221253.202 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
This Agent LOG is from HOST-002 and I did not restart the agent. It's keeps recovering itself and becomes unreachable.
1924:20140122:114556.752 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
1928:20140122:114556.767 agent #0 started [collector]
1932:20140122:114556.767 agent #1 started[listener]
1936:20140122:114556.767 agent #2 started[listener]
1940:20140122:114556.767 agent #3 started[listener]
1944:20140122:114556.767 agent #4 started [active checks]
1944:20140122:114617.780 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
1900:20140122:115400.850 Zabbix Agent shutdown requested
1944:20140122:115401.209 zabbix_agentd active check stopped
1928:20140122:115401.240 zabbix_agentd collector stopped
1900:20140122:115401.864 Zabbix Agent stopped. Zabbix 2.0.6 (revision 35155).
1908:20140122:121024.890 Starting Zabbix Agent [HOST-002]. Zabbix 2.0.6 (revision 35155).
1912:20140122:121024.906 agent #0 started [collector]
1916:20140122:121024.906 agent #1 started[listener]
1920:20140122:121024.906 agent #2 started[listener]
1924:20140122:121024.906 agent #3 started[listener]
1928:20140122:121024.906 agent #4 started [active checks]
1928:20140122:121045.950 active check configuration update from [10.48.12.163:10051] started to fail (cannot connect to [[10.48.12.163]:10051]: [0x0000274C] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)

Comment