Is it the configuration parameter "Timeout" or "Unavailable" that triggers the "no ping" condition that throws this alert?
We have about 1 or 2 instances per day of "Zabbix Agent unreachable" alerts on random servers. During this time, Zabbix logs lines like this every two minutes:
19636:20140219:154735.205 cannot send list of active checks to [10.0.4.172]: host [blahblah] not monitored
No other log entries are produced from the host. The host is up, but busy, with, perhaps, high CPU utilization (we can't tell because Zabbix doesn't get data during this period, which can last up to 30 minutes.) For some reason, we get the "unreachable" alert and the "OK" recovery at the same time, only after Zabbix is able to poll the host again, so it's hard to catch the server "in the act" with top or ps..
Both Timeout and Unavailable are set to defaults: Timeout=3 and UnreachablePeriod=45. I think I will tune Timeout up to 6 and UnreachablePeriod up to 120. Does that sound like a good approach?
Thanks
w
We have about 1 or 2 instances per day of "Zabbix Agent unreachable" alerts on random servers. During this time, Zabbix logs lines like this every two minutes:
19636:20140219:154735.205 cannot send list of active checks to [10.0.4.172]: host [blahblah] not monitored
No other log entries are produced from the host. The host is up, but busy, with, perhaps, high CPU utilization (we can't tell because Zabbix doesn't get data during this period, which can last up to 30 minutes.) For some reason, we get the "unreachable" alert and the "OK" recovery at the same time, only after Zabbix is able to poll the host again, so it's hard to catch the server "in the act" with top or ps..
Both Timeout and Unavailable are set to defaults: Timeout=3 and UnreachablePeriod=45. I think I will tune Timeout up to 6 and UnreachablePeriod up to 120. Does that sound like a good approach?
Thanks
w
Comment