Twice in the past three months, all hosts in our organization have reported "Radio Silence from Agent" and sent a flurry of false positives. The "Radio Silence from Agent" trigger is configured as such:
Sanity checking on the servers revealed no problems with the agent, and the fact that it happened on all of our hosts at once seems to indicate a server-side problem. The first time it happened we did some basic troubleshooting and confirmed the following:
At this point we were out of ideas and needed to restore functionality to avoid more downtime so we restarted the zabbix-server service. All of the triggers resolved, host data started populating again, and logging resumed.
Hoping to find insight into what may have gone wrong wrong and learn of any helpful debug strategies in case this happens again.
Code:
{host[URL="https://zabbix.xes-mad.com/zabbix/items.php?form=update&itemid=99404"]:agent.ping[/URL].[B]nodata([/B]300[B])[/B]}=1
- The zabbix server was still able to reach hosts via ping and ssh and communicate with port 10050.
- zabbix_get was still able to retrieve the return value for agent items
At this point we were out of ideas and needed to restore functionality to avoid more downtime so we restarted the zabbix-server service. All of the triggers resolved, host data started populating again, and logging resumed.
Hoping to find insight into what may have gone wrong wrong and learn of any helpful debug strategies in case this happens again.
Comment