Ad Widget

**tim.mooney** · 24-04-2020, 19:54

At my workplace, when we've experienced a batch of false-positive pings, it has historically been because I haven't had network topology dependencies set up correctly. If there's a network device between your Zabbix server and all your hosts (like a router), if that device is offline and it's not correctly listed as a dependency for your agent.ping item, then all your hosts will alert. This has happened in my environment because our network team has been upgrading components of our network, so the device(s) between the datacenter where the Zabbix server is have changed over time. If I'm not aware of the change in topology dependencies, then a network outage or upgrade can generate a lot of false positives for hosts, when it really should only generate one alert for the network device.

In your situation, you checked connectivity and that didn't appear to be the problem. But did you check it quickly enough to be able to be certain that there hadn't been a roughly 5 minute network outage that corrected itself perhaps right before you started looking?

If it truly was the server, there should be something in the system logs (not necessarily the zabbix-server.log). For example, a failing disk that's causing I/O timeouts can cause very long hangs on a system. The I/O stall should eventually get logged to either the system's in-memory kernel log buffer (see the dmesg command) or (if /var is on a disk volume that is acting normally) it should eventually get written to one of the logs there. On a RedHat-like system, I would check /var/log/messages for anything suspicious around the time of the alerts.

You may also want to use any "lights out" management interface (e.g. IPMI, Dell's iDRAC, HP's LOM, or whatever) to query any hardware logs to see if any of them report any problems.

Ad Widget

agent.ping dropout on all hosts

agent.ping dropout on all hosts

Comment