We've run Zabbix for many years, and recently upgraded from v1.8 to the v5 branch. We mostly use ping up/down status.
I half expected to see improvement to dependency function after moving to version 5. But after a major storm power outage in our environment since moving to v5 I found that not all my dependency configuration worked.
To provide BASIC dependency operation, I set devices within a building as dependent on the building core switch. During this recent outage, when some buildings lost power and UPS exhausted their batteries a handful of devices alerted even though they were dependent, and other devices did not alert.
Does anyone know how the dependency checks actually work that might explain the behavior I see? I would imagine if a test come up negative (ie, no ping) the software would then IMMEDIATELY check/recheck the status (in this case, ping) of the associated dependent device before triggering an alarm? This then I would think ensure no trigger executes when the dependent device is offline.
Or do checks run in some sort of processing order, and if a device is tested as negative state, it's trigger processes depending on last test on the dependent device (be it earlier or later in the processing order). Then, depending on the processing order and failure timing, some dependent devices would trigger because their dependent device last test WAS not triggered, but in reality during the processing loop, did enter a failure state.
I half expected to see improvement to dependency function after moving to version 5. But after a major storm power outage in our environment since moving to v5 I found that not all my dependency configuration worked.
To provide BASIC dependency operation, I set devices within a building as dependent on the building core switch. During this recent outage, when some buildings lost power and UPS exhausted their batteries a handful of devices alerted even though they were dependent, and other devices did not alert.
Does anyone know how the dependency checks actually work that might explain the behavior I see? I would imagine if a test come up negative (ie, no ping) the software would then IMMEDIATELY check/recheck the status (in this case, ping) of the associated dependent device before triggering an alarm? This then I would think ensure no trigger executes when the dependent device is offline.
Or do checks run in some sort of processing order, and if a device is tested as negative state, it's trigger processes depending on last test on the dependent device (be it earlier or later in the processing order). Then, depending on the processing order and failure timing, some dependent devices would trigger because their dependent device last test WAS not triggered, but in reality during the processing loop, did enter a failure state.
Comment