Ad Widget

Collapse

agent.ping dropout on all hosts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • extremeeryan
    Junior Member
    • Jul 2019
    • 7

    #1

    agent.ping dropout on all hosts

    Twice in the past three months, all hosts in our organization have reported "Radio Silence from Agent" and sent a flurry of false positives. The "Radio Silence from Agent" trigger is configured as such:
    Code:
    {host[URL="https://zabbix.xes-mad.com/zabbix/items.php?form=update&itemid=99404"]:agent.ping[/URL].[B]nodata([/B]300[B])[/B]}=1
    Sanity checking on the servers revealed no problems with the agent, and the fact that it happened on all of our hosts at once seems to indicate a server-side problem. The first time it happened we did some basic troubleshooting and confirmed the following:
    • The zabbix server was still able to reach hosts via ping and ssh and communicate with port 10050.
    • zabbix_get was still able to retrieve the return value for agent items
    The Zabbix server log stopped logging at the time of the crash. The zabbix-server service was still running without showing any obvious errors. CPU and memory usage was not higher than normal. Looking at the graphs and latest data, all data stopped being logged right at the time that the crash occurred.

    At this point we were out of ideas and needed to restore functionality to avoid more downtime so we restarted the zabbix-server service. All of the triggers resolved, host data started populating again, and logging resumed.

    Hoping to find insight into what may have gone wrong wrong and learn of any helpful debug strategies in case this happens again.
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    At my workplace, when we've experienced a batch of false-positive pings, it has historically been because I haven't had network topology dependencies set up correctly. If there's a network device between your Zabbix server and all your hosts (like a router), if that device is offline and it's not correctly listed as a dependency for your agent.ping item, then all your hosts will alert. This has happened in my environment because our network team has been upgrading components of our network, so the device(s) between the datacenter where the Zabbix server is have changed over time. If I'm not aware of the change in topology dependencies, then a network outage or upgrade can generate a lot of false positives for hosts, when it really should only generate one alert for the network device.

    In your situation, you checked connectivity and that didn't appear to be the problem. But did you check it quickly enough to be able to be certain that there hadn't been a roughly 5 minute network outage that corrected itself perhaps right before you started looking?

    If it truly was the server, there should be something in the system logs (not necessarily the zabbix-server.log). For example, a failing disk that's causing I/O timeouts can cause very long hangs on a system. The I/O stall should eventually get logged to either the system's in-memory kernel log buffer (see the dmesg command) or (if /var is on a disk volume that is acting normally) it should eventually get written to one of the logs there. On a RedHat-like system, I would check /var/log/messages for anything suspicious around the time of the alerts.

    You may also want to use any "lights out" management interface (e.g. IPMI, Dell's iDRAC, HP's LOM, or whatever) to query any hardware logs to see if any of them report any problems.

    Comment

    Working...