Ad Widget

Collapse

Dependency Detection Doesn't Always Work?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kberrien
    Member
    • Mar 2007
    • 43

    #1

    Dependency Detection Doesn't Always Work?

    We've run Zabbix for many years, and recently upgraded from v1.8 to the v5 branch. We mostly use ping up/down status.

    I half expected to see improvement to dependency function after moving to version 5. But after a major storm power outage in our environment since moving to v5 I found that not all my dependency configuration worked.

    To provide BASIC dependency operation, I set devices within a building as dependent on the building core switch. During this recent outage, when some buildings lost power and UPS exhausted their batteries a handful of devices alerted even though they were dependent, and other devices did not alert.

    Does anyone know how the dependency checks actually work that might explain the behavior I see? I would imagine if a test come up negative (ie, no ping) the software would then IMMEDIATELY check/recheck the status (in this case, ping) of the associated dependent device before triggering an alarm? This then I would think ensure no trigger executes when the dependent device is offline.

    Or do checks run in some sort of processing order, and if a device is tested as negative state, it's trigger processes depending on last test on the dependent device (be it earlier or later in the processing order). Then, depending on the processing order and failure timing, some dependent devices would trigger because their dependent device last test WAS not triggered, but in reality during the processing loop, did enter a failure state.
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    Originally posted by kberrien
    To provide BASIC dependency operation, I set devices within a building as dependent on the building core switch. During this recent outage, when some buildings lost power and UPS exhausted their batteries a handful of devices alerted even though they were dependent, and other devices did not alert.

    Does anyone know how the dependency checks actually work that might explain the behavior I see? I would imagine if a test come up negative (ie, no ping) the software would then IMMEDIATELY check/recheck the status (in this case, ping) of the associated dependent device before triggering an alarm? This then I would think ensure no trigger executes when the dependent device is offline.

    Or do checks run in some sort of processing order, and if a device is tested as negative state, it's trigger processes depending on last test on the dependent device (be it earlier or later in the processing order). Then, depending on the processing order and failure timing, some dependent devices would trigger because their dependent device last test WAS not triggered, but in reality during the processing loop, did enter a failure state.

    As far as I can tell, it's the 2nd scenario you described, and there is no "recheck the dependencies before treating this as a problem".

    It's essentially a race condition. If the dependency is checked first and a problem event is detected, then you're OK. Otherwise, not.

    What you can do to reduce alerts from this type of thing is to check your dependencies as frequently as possible, and check the other devices less frequently. Alternately, you can do things either with triggers or with escalations so that either the problem isn't triggered right away (for example, 2 or 3 consecutive checks of a device need to fail before it's treated as problem) or it's detected right away but an alert doesn't go out immediately (alerting happens on later steps with escalations).

    These are both just workarounds, though. Having a way to force an immediate recheck of all dependencies would be very nice, but when there are multiple dependencies in a complicated network, it can become complicated.

    Comment

    • Clontarf[X]
      Member
      • Jan 2017
      • 80

      #3
      I am a supporter of the "more than one received check before triggering" workaround, and it's good practice anyway to avoid trigger flapping.

      Comment


      • kberrien
        kberrien commented
        Editing a comment
        Ok, thanks. Glad I've confirmed my suspicions on the application's behavior. I'm not sure how to change my rules to be x2 before alarm, and editing that for 600 hosts doesn't sound like fun...

      • Clontarf[X]
        Clontarf[X] commented
        Editing a comment
        Well to solve the "updating hundreds of hosts" problem, your item should be in a Template. So all you do is change your trigger definition in the Template.

        Find the templated trigger and change the condition from last() (or whatever you might have) to something like last(3m) or whatever your criteria is. Voila, all your hundreds of hosts now have fixed triggers.

    Working...