After too many troubles this week, i have determined that using the Zabbix agent to check for host/service availability is a mistake.
First a server went down, and agent.ping completely failed to alert anybody about it, it just stopped gathering information, last value was at about 3am and the trigger didn't fire at all. So, i changed it to a simple check of icmmping.
Last night the same server went into rescue mode at about 3am again. In rescue mode, everything goes down but the network and SSH. So the "Server is down!" trigger didn't fire, but all the service checks for HTTP and the like went silent just like agent.ping did. So, the server was effectively down while still technically up. And nobody knew about it until it was too late to prevent troubles. I have changed as many checks as possible to simple checks now.
In sum, agent.ping is in fact useless, because items using it will completely fail to alert you that the agent is in fact not pinging. Be advised and use simple checks for that, specially when the service is vital.
First a server went down, and agent.ping completely failed to alert anybody about it, it just stopped gathering information, last value was at about 3am and the trigger didn't fire at all. So, i changed it to a simple check of icmmping.
Last night the same server went into rescue mode at about 3am again. In rescue mode, everything goes down but the network and SSH. So the "Server is down!" trigger didn't fire, but all the service checks for HTTP and the like went silent just like agent.ping did. So, the server was effectively down while still technically up. And nobody knew about it until it was too late to prevent troubles. I have changed as many checks as possible to simple checks now.
In sum, agent.ping is in fact useless, because items using it will completely fail to alert you that the agent is in fact not pinging. Be advised and use simple checks for that, specially when the service is vital.
Comment