Hi
We've been using Zabbix for over a year now in an MSP style with remote agents doing active checks on Windows servers. Over the weekend, for an as-yet unknown reason, almost half of our agents suddenly stopped communicating with our server and, looking at the last times they contacted the Zabbix server it seems to coincide (mostly) with a reboot of the monitored machine.
Last Tuesday (7 days ago) we migrated to a new public IP address but, on the surface, this doesn't seem to be the issue as the monitored servers can /all/ perform successful nslookup, tracert, ping and telnet hostname:10051 tests against our new IP address and the hostname for our zabbix server. I've also flushed DNS and ARP caches, double-checked agent configs, port forward rules, rebooted servers, restarted the zabbix server, restarted our router and firewall.
The zabbix server log shows nothing relevant (as you'd expect) and the only thing relevant in the agent logs (level 4) are these lines:
18660:20121030:103752.280 Get active checks error: ZBX_TCP_READ() failed [A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.]
18660:20121030:103752.280 In process_active_checks('zabbix.mydomain.com',10051)
18660:20121030:103752.280 End of process_active_checks()
18660:20121030:103752.280 In get_min_nextcheck()
18660:20121030:103752.280 In send_buffer() host:'zabbix.mydomain.com' port:10051 values:0/100
18660:20121030:103752.280 End of send_buffer():SUCCEED
Our Zabbix server is v1.8.4 on CentOS 5, behind a firewall, with tcp 10051 forwarded to it.
Our agents are version v1.8.5
All our remote agents use a DNS hostname to contact the server (and all resolve to the correct IP)
All our remote agents use active checks.
This is only affecting around half our remote agents.
There is no obvious common factor between the non-communicating agents -different firewalls/routers at each site, different Windows server versions, not all agents on a given site are non-communicating.
I realise this isn't a networking help forum, but if anyone can offer some advice on how to troubleshoot the zabbix agent in more detail, or if there was a recent Windows update which interferes with agent v.1.8.5, or anything else relevant, it would be much appreciated.
Thanks in advance
Mark
We've been using Zabbix for over a year now in an MSP style with remote agents doing active checks on Windows servers. Over the weekend, for an as-yet unknown reason, almost half of our agents suddenly stopped communicating with our server and, looking at the last times they contacted the Zabbix server it seems to coincide (mostly) with a reboot of the monitored machine.
Last Tuesday (7 days ago) we migrated to a new public IP address but, on the surface, this doesn't seem to be the issue as the monitored servers can /all/ perform successful nslookup, tracert, ping and telnet hostname:10051 tests against our new IP address and the hostname for our zabbix server. I've also flushed DNS and ARP caches, double-checked agent configs, port forward rules, rebooted servers, restarted the zabbix server, restarted our router and firewall.
The zabbix server log shows nothing relevant (as you'd expect) and the only thing relevant in the agent logs (level 4) are these lines:
18660:20121030:103752.280 Get active checks error: ZBX_TCP_READ() failed [A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.]
18660:20121030:103752.280 In process_active_checks('zabbix.mydomain.com',10051)
18660:20121030:103752.280 End of process_active_checks()
18660:20121030:103752.280 In get_min_nextcheck()
18660:20121030:103752.280 In send_buffer() host:'zabbix.mydomain.com' port:10051 values:0/100
18660:20121030:103752.280 End of send_buffer():SUCCEED
Our Zabbix server is v1.8.4 on CentOS 5, behind a firewall, with tcp 10051 forwarded to it.
Our agents are version v1.8.5
All our remote agents use a DNS hostname to contact the server (and all resolve to the correct IP)
All our remote agents use active checks.
This is only affecting around half our remote agents.
There is no obvious common factor between the non-communicating agents -different firewalls/routers at each site, different Windows server versions, not all agents on a given site are non-communicating.
I realise this isn't a networking help forum, but if anyone can offer some advice on how to troubleshoot the zabbix agent in more detail, or if there was a recent Windows update which interferes with agent v.1.8.5, or anything else relevant, it would be much appreciated.
Thanks in advance
Mark


Comment