Hello All!
Shortly:
Let say I have XXX number of hosts outside of my office that I monitor.
I cut internet cable and wait for 15 mins.
On Zabbix dashboard I see only 1 problem (as it should be) for my "Internet check" that has failed (this is due to dependencies I've already set), BUT
after I restore internet connection I get a lot of emails with "ZABBIX: PROBLEM: Zabbix agent on XXXX was unavailable for 5 mins"
and immediately another email with "ZABBIX: OK: Zabbix agent on XXXX is unreachable for 5 minutes".
I've spent a month trying to set up different configurations, tests, etc., but with no success in stopping that post email notification storm!
I've used Zabbix 3.2, then following official instructions, upgraded to 4.0.1 - it's the same issue.
I use passive checks and I obey recommendations for pyramid interval checks for chain dependencies - e.g:
"LEVEL0 Item - Internet check -- 30s" -> "LEVEL1 Item -- 1m" -> "LEVEL2 Items -- 2m" etc ....
Is this kind of a Zabbix design or I can stop these emails that should not be sent at all ... somehow?
I also have some Icinga and Nagios exp. but cannot remember to had such an issue with post email storm.
Partial cut of Zabbix server log right after the cable cut:
" 938:20181113:084312.326 executing housekeeper
938:20181113:084437.084 housekeeper [deleted 54545 hist/trends, 0 items/triggers, 33 events, 0 problems, 0 sessions, 0 alarms, 0 audit items in 84.757251 sec, idle for 1 hour(s)]
958:20181113:091519.432 Zabbix agent item "proc.num[,,run]" on host "XXXX" failed: first network error, wait for 15 seconds
960:20181113:091520.033 Zabbix agent item "system.cpu.intr" on host "XXXX" failed: first network error, wait for 15 seconds
961:20181113:091520.066 Zabbix agent item "service.info[Power,state]" on host "XXXX" failed: first network error, wait for 15 seconds
957:20181113:091524.078 Zabbix agent item "agent.ping" on host "XXXX" failed: first network error, wait for 15 seconds
962:20181113:091619.351 temporarily disabling Zabbix agent checks on host "XXXX1": host unavailable
962:20181113:091623.402 temporarily disabling Zabbix agent checks on host "XXXX2": host unavailable
962:20181113:091627.461 temporarily disabling Zabbix agent checks on host "XXXX3": host unavailable
962:20181113:091631.524 temporarily disabling Zabbix agent checks on host "XXXX4": host unavailable
935:20181113:091645.083 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
936:20181113:091725.134 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
937:20181113:091805.184 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
935:20181113:091845.225 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
962:20181113:091928.024 enabling Zabbix agent checks on host "XXXX1": host became available
962:20181113:091931.118 enabling Zabbix agent checks on host "XXXX2": host became available
962:20181113:091935.175 enabling Zabbix agent checks on host "XXXX3": host became available
962:20181113:091939.280 enabling Zabbix agent checks on host "XXXX4": host became available"
In my opinion I should have in logs something like - INET is down ... wait for xx secs and that's it.
And after connection is restored - just log it that "Connection is OK now" ... something like that.
Help, ideas, suggestions are welcome!
Thanks in advance!
--
Valentin
Shortly:
Let say I have XXX number of hosts outside of my office that I monitor.
I cut internet cable and wait for 15 mins.
On Zabbix dashboard I see only 1 problem (as it should be) for my "Internet check" that has failed (this is due to dependencies I've already set), BUT
after I restore internet connection I get a lot of emails with "ZABBIX: PROBLEM: Zabbix agent on XXXX was unavailable for 5 mins"
and immediately another email with "ZABBIX: OK: Zabbix agent on XXXX is unreachable for 5 minutes".
I've spent a month trying to set up different configurations, tests, etc., but with no success in stopping that post email notification storm!
I've used Zabbix 3.2, then following official instructions, upgraded to 4.0.1 - it's the same issue.
I use passive checks and I obey recommendations for pyramid interval checks for chain dependencies - e.g:
"LEVEL0 Item - Internet check -- 30s" -> "LEVEL1 Item -- 1m" -> "LEVEL2 Items -- 2m" etc ....
Is this kind of a Zabbix design or I can stop these emails that should not be sent at all ... somehow?
I also have some Icinga and Nagios exp. but cannot remember to had such an issue with post email storm.
Partial cut of Zabbix server log right after the cable cut:
" 938:20181113:084312.326 executing housekeeper
938:20181113:084437.084 housekeeper [deleted 54545 hist/trends, 0 items/triggers, 33 events, 0 problems, 0 sessions, 0 alarms, 0 audit items in 84.757251 sec, idle for 1 hour(s)]
958:20181113:091519.432 Zabbix agent item "proc.num[,,run]" on host "XXXX" failed: first network error, wait for 15 seconds
960:20181113:091520.033 Zabbix agent item "system.cpu.intr" on host "XXXX" failed: first network error, wait for 15 seconds
961:20181113:091520.066 Zabbix agent item "service.info[Power,state]" on host "XXXX" failed: first network error, wait for 15 seconds
957:20181113:091524.078 Zabbix agent item "agent.ping" on host "XXXX" failed: first network error, wait for 15 seconds
962:20181113:091619.351 temporarily disabling Zabbix agent checks on host "XXXX1": host unavailable
962:20181113:091623.402 temporarily disabling Zabbix agent checks on host "XXXX2": host unavailable
962:20181113:091627.461 temporarily disabling Zabbix agent checks on host "XXXX3": host unavailable
962:20181113:091631.524 temporarily disabling Zabbix agent checks on host "XXXX4": host unavailable
935:20181113:091645.083 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
936:20181113:091725.134 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
937:20181113:091805.184 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
935:20181113:091845.225 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
962:20181113:091928.024 enabling Zabbix agent checks on host "XXXX1": host became available
962:20181113:091931.118 enabling Zabbix agent checks on host "XXXX2": host became available
962:20181113:091935.175 enabling Zabbix agent checks on host "XXXX3": host became available
962:20181113:091939.280 enabling Zabbix agent checks on host "XXXX4": host became available"
In my opinion I should have in logs something like - INET is down ... wait for xx secs and that's it.
And after connection is restored - just log it that "Connection is OK now" ... something like that.
Help, ideas, suggestions are welcome!
Thanks in advance!
--
Valentin

Comment