I am working with a network that has a lot of brief outages, and am working to make the alerts from triggers more meaningful. That means adding more hysteresis to triggers, and in particular reporting some over longer periods of time to separate out more serious outages from brief flickers.
The two related issues I am having is that the simple checks:
1) Ping loss reports 100% when the node is down
2) Ping response reports 0 seconds when the node is down.
Now the first is of course correct, but it is also redundant, and misleading to treat in an average when looking for loss problems -- down is down, averaging in 100% loss in longer time periods does not speak directly to poor connections so much as no connection. I would rather it report no data.
But the second one is just silly -- putting in a numeric zero causes any averages to be misleadingly reduced. It should also report no data.
Fundamentally I want trigger expressions of these like .avg(20m) to be meaningful, and certainly the 0 isn't, and I would argue the 100% loss is not either (not when one is separately reporting complete outages).
But I do not see any way to make that happen short of doing the ping in an external check. Which is what I'm leaning toward. Do the ping as an external check saying up/down, then have it use zabbix sender to record (or not) items for loss and response only if the node is not down.
But switching pings to an external check is probably a pretty big performance hit, so wondering if someone knows if there is another solution?
Again... I know about trigger dependencies, and do not see how that solves the problem, as it is data collected and aggregate functions on it that are the concern.
PS. I found one of these as an open change suggestion from 2012 with no real conclusion in the discussion here.
The two related issues I am having is that the simple checks:
1) Ping loss reports 100% when the node is down
2) Ping response reports 0 seconds when the node is down.
Now the first is of course correct, but it is also redundant, and misleading to treat in an average when looking for loss problems -- down is down, averaging in 100% loss in longer time periods does not speak directly to poor connections so much as no connection. I would rather it report no data.
But the second one is just silly -- putting in a numeric zero causes any averages to be misleadingly reduced. It should also report no data.
Fundamentally I want trigger expressions of these like .avg(20m) to be meaningful, and certainly the 0 isn't, and I would argue the 100% loss is not either (not when one is separately reporting complete outages).
But I do not see any way to make that happen short of doing the ping in an external check. Which is what I'm leaning toward. Do the ping as an external check saying up/down, then have it use zabbix sender to record (or not) items for loss and response only if the node is not down.
But switching pings to an external check is probably a pretty big performance hit, so wondering if someone knows if there is another solution?
Again... I know about trigger dependencies, and do not see how that solves the problem, as it is data collected and aggregate functions on it that are the concern.
PS. I found one of these as an open change suggestion from 2012 with no real conclusion in the discussion here.