Ad Widget

Collapse

Trigger dependencies

Collapse
This topic has been answered.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Quickeyed
    Junior Member
    • Feb 2024
    • 5

    #1

    Trigger dependencies

    I monitor a bunch of hardware that's all dependent on a router. So I have every device's trigger for icmp ping unavailable dependent on the icmp ping unavailable trigger of the router.
    This all works. But as soon as the router comes back online, all the other triggers still fire, causing a floodgate of alerts.

    How can I prevent that from happening? When the router is offline, nothing else should trigger and neither should they trigger when the router is back online.
  • Answer selected by Quickeyed at 05-04-2024, 09:38.
    tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    Originally posted by Quickeyed
    Running Zabbix 7

    I ping the router every 1m
    Trigger expression for the router is: max(/router/icmpping,#3)=0
    No recovery expression specified

    The hardware behind the router is also pinged every 1m
    Trigger expression for the hardware behind the router is the same as for the router, but different host names.
    Adding a recovery expression of

    Code:
    min(router/icmpping, #2) = 1
    to your router will probably fix the problem. It will cause the router to stay in the problem state slightly longer. Instead of becoming "OK" after the first successful ping, it won't transition to the OK state until it has had 2 successful pings. That should give the equipment behind the router time to also transition to the OK state before the dependency does.

    The problem is a timing issue:
    1. router is fine (ping checks are returning 1)
    2. hardware behind the router is fine (ping checks are returning 1)
    3. router fails (first ping check is 0), not yet a problem because max(/router/icmpping, #3) is still 1
    4. hardware behind the router fails first ping check because the router is offline, not yet a problem because max(/hardware/icmpping, #3) is still 1
    5. router fails 2nd ping check, still not a problem
    6. hardware behind the router fails 2nd ping check, still not a problem
    7. router fails 3rd ping check. The trigger expression is now true, so a problem event is generated for the router
    8. hardware behind the router fails 3rd ping check. A problem event IS NOT generated because the router dependency is in error.
    9. some amount of time passes.
    10. router comes back online, router ping check returns 1, trigger expression is now false, so router is no longer in error
    11. BEFORE Zabbix re-checks hardware behind the router, it recalculates whether any triggers are true. hardware behind the router has a trigger that is true AND its dependency is no longer in error, so Zabbix generates an error event for this hardware.
    12. Zabbix pings the hardware behind the router, ping check is now successful so max(/hardware/icmpping, #3) is false, so Zabbix generates an OK event for hardware behind the router.
    The problem happens because your cadence for checking the router is exactly the same as your cadence for checking the equipment behind it, and there's a brief window between when the router becomes OK again and before Zabbix checks the hardware behind the router again. It's this window where the router is now OK but the hardware hasn't been re-checked and switched back to OK that's causing the alerts you're receiving.

    Comment


    • Quickeyed
      Quickeyed commented
      Editing a comment
      Thank you so much
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    How often do you have Zabbix ping the router?

    What is the trigger expression for the router, for detecting a problem? Is there a recovery expression?

    How often do you collect the items for your hardware that is on the other side of the router?

    Comment

    • Quickeyed
      Junior Member
      • Feb 2024
      • 5

      #3
      Running Zabbix 7

      I ping the router every 1m
      Trigger expression for the router is: max(/router/icmpping,#3)=0
      No recovery expression specified

      The hardware behind the router is also pinged every 1m
      Trigger expression for the hardware behind the router is the same as for the router, but different host names.

      Comment

      • Quickeyed
        Junior Member
        • Feb 2024
        • 5

        #4
        Anyone have any ideas?

        Comment

        • tim.mooney
          Senior Member
          • Dec 2012
          • 1427

          #5
          Originally posted by Quickeyed
          Running Zabbix 7

          I ping the router every 1m
          Trigger expression for the router is: max(/router/icmpping,#3)=0
          No recovery expression specified

          The hardware behind the router is also pinged every 1m
          Trigger expression for the hardware behind the router is the same as for the router, but different host names.
          Adding a recovery expression of

          Code:
          min(router/icmpping, #2) = 1
          to your router will probably fix the problem. It will cause the router to stay in the problem state slightly longer. Instead of becoming "OK" after the first successful ping, it won't transition to the OK state until it has had 2 successful pings. That should give the equipment behind the router time to also transition to the OK state before the dependency does.

          The problem is a timing issue:
          1. router is fine (ping checks are returning 1)
          2. hardware behind the router is fine (ping checks are returning 1)
          3. router fails (first ping check is 0), not yet a problem because max(/router/icmpping, #3) is still 1
          4. hardware behind the router fails first ping check because the router is offline, not yet a problem because max(/hardware/icmpping, #3) is still 1
          5. router fails 2nd ping check, still not a problem
          6. hardware behind the router fails 2nd ping check, still not a problem
          7. router fails 3rd ping check. The trigger expression is now true, so a problem event is generated for the router
          8. hardware behind the router fails 3rd ping check. A problem event IS NOT generated because the router dependency is in error.
          9. some amount of time passes.
          10. router comes back online, router ping check returns 1, trigger expression is now false, so router is no longer in error
          11. BEFORE Zabbix re-checks hardware behind the router, it recalculates whether any triggers are true. hardware behind the router has a trigger that is true AND its dependency is no longer in error, so Zabbix generates an error event for this hardware.
          12. Zabbix pings the hardware behind the router, ping check is now successful so max(/hardware/icmpping, #3) is false, so Zabbix generates an OK event for hardware behind the router.
          The problem happens because your cadence for checking the router is exactly the same as your cadence for checking the equipment behind it, and there's a brief window between when the router becomes OK again and before Zabbix checks the hardware behind the router again. It's this window where the router is now OK but the hardware hasn't been re-checked and switched back to OK that's causing the alerts you're receiving.

          Comment


          • Quickeyed
            Quickeyed commented
            Editing a comment
            Thank you so much
        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4806

          #6
          And you actually can never know, which one is checked first, your router or hosts behind it (multiple pinger processes, queues etc)... So yes, as described very clearly above, make sure your router trigger stays "open" for a longer time than trigger for hosts... Checking router more often than hosts behind it also helps, in that case router issues are discovered earlier.

          Comment

          Working...