Ad Widget

**tim.mooney** · 04-04-2024, 20:03

Originally posted by Quickeyed

Running Zabbix 7

I ping the router every 1m
Trigger expression for the router is: max(/router/icmpping,#3)=0
No recovery expression specified

The hardware behind the router is also pinged every 1m
Trigger expression for the hardware behind the router is the same as for the router, but different host names.

Adding a recovery expression of

Code:

min(router/icmpping, #2) = 1

to your router will probably fix the problem. It will cause the router to stay in the problem state slightly longer. Instead of becoming "OK" after the first successful ping, it won't transition to the OK state until it has had 2 successful pings. That should give the equipment behind the router time to also transition to the OK state before the dependency does.

The problem is a timing issue:

router is fine (ping checks are returning 1)
hardware behind the router is fine (ping checks are returning 1)
router fails (first ping check is 0), not yet a problem because max(/router/icmpping, #3) is still 1
hardware behind the router fails first ping check because the router is offline, not yet a problem because max(/hardware/icmpping, #3) is still 1
router fails 2nd ping check, still not a problem
hardware behind the router fails 2nd ping check, still not a problem
router fails 3rd ping check. The trigger expression is now true, so a problem event is generated for the router
hardware behind the router fails 3rd ping check. A problem event IS NOT generated because the router dependency is in error.
some amount of time passes.
router comes back online, router ping check returns 1, trigger expression is now false, so router is no longer in error
BEFORE Zabbix re-checks hardware behind the router, it recalculates whether any triggers are true. hardware behind the router has a trigger that is true AND its dependency is no longer in error, so Zabbix generates an error event for this hardware.
Zabbix pings the hardware behind the router, ping check is now successful so max(/hardware/icmpping, #3) is false, so Zabbix generates an OK event for hardware behind the router.

The problem happens because your cadence for checking the router is exactly the same as your cadence for checking the equipment behind it, and there's a brief window between when the router becomes OK again and before Zabbix checks the hardware behind the router again. It's this window where the router is now OK but the hardware hasn't been re-checked and switched back to OK that's causing the alerts you're receiving.

**tim.mooney** · 13-03-2024, 00:51

How often do you have Zabbix ping the router?

What is the trigger expression for the router, for detecting a problem? Is there a recovery expression?

How often do you collect the items for your hardware that is on the other side of the router?

**Quickeyed** · 20-03-2024, 10:42

Running Zabbix 7

I ping the router every 1m
Trigger expression for the router is: max(/router/icmpping,#3)=0
No recovery expression specified

The hardware behind the router is also pinged every 1m
Trigger expression for the hardware behind the router is the same as for the router, but different host names.

**Quickeyed** · 04-04-2024, 08:59

Anyone have any ideas?

**tim.mooney** · 04-04-2024, 20:03

Originally posted by Quickeyed

Running Zabbix 7

I ping the router every 1m
Trigger expression for the router is: max(/router/icmpping,#3)=0
No recovery expression specified

The hardware behind the router is also pinged every 1m
Trigger expression for the hardware behind the router is the same as for the router, but different host names.

Adding a recovery expression of

Code:

min(router/icmpping, #2) = 1

to your router will probably fix the problem. It will cause the router to stay in the problem state slightly longer. Instead of becoming "OK" after the first successful ping, it won't transition to the OK state until it has had 2 successful pings. That should give the equipment behind the router time to also transition to the OK state before the dependency does.

The problem is a timing issue:

router is fine (ping checks are returning 1)
hardware behind the router is fine (ping checks are returning 1)
router fails (first ping check is 0), not yet a problem because max(/router/icmpping, #3) is still 1
hardware behind the router fails first ping check because the router is offline, not yet a problem because max(/hardware/icmpping, #3) is still 1
router fails 2nd ping check, still not a problem
hardware behind the router fails 2nd ping check, still not a problem
router fails 3rd ping check. The trigger expression is now true, so a problem event is generated for the router
hardware behind the router fails 3rd ping check. A problem event IS NOT generated because the router dependency is in error.
some amount of time passes.
router comes back online, router ping check returns 1, trigger expression is now false, so router is no longer in error
BEFORE Zabbix re-checks hardware behind the router, it recalculates whether any triggers are true. hardware behind the router has a trigger that is true AND its dependency is no longer in error, so Zabbix generates an error event for this hardware.
Zabbix pings the hardware behind the router, ping check is now successful so max(/hardware/icmpping, #3) is false, so Zabbix generates an OK event for hardware behind the router.

The problem happens because your cadence for checking the router is exactly the same as your cadence for checking the equipment behind it, and there's a brief window between when the router becomes OK again and before Zabbix checks the hardware behind the router again. It's this window where the router is now OK but the hardware hasn't been re-checked and switched back to OK that's causing the alerts you're receiving.

**cyber** · 05-04-2024, 08:09

And you actually can never know, which one is checked first, your router or hosts behind it (multiple pinger processes, queues etc)... So yes, as described very clearly above, make sure your router trigger stays "open" for a longer time than trigger for hosts... Checking router more often than hosts behind it also helps, in that case router issues are discovered earlier.

Ad Widget

Trigger dependencies

Trigger dependencies

Comment

Comment

Comment

Comment

Comment

Comment