To reduce false positives surrounding blips and spikes, I'm thinking of the best way to implement triple checks on most of our triggers & notifications.
I've thought of two ways to do this, and just looking for feedback/input on how others do the same thing. Keep in mind I've done monitoring in the past but am new to Zabbix so while I get concepts, I may now know the right terminology or what a good way to do things in Zabbix is.
a) Have triggers logged and displayed on the console, but not send out e-mail notifications unless you get three consecutive trigger failures. The trigger log would then show us when things became a problem but wouldn’t spam our e-mail. Admittedly not sure how to do this. I initiallyl thought escalations that had their timing lined up with the item checks, but not sure if this is the wisest way since there's no way to be sure the times stay in sync if someone changes item timing or escalation timing.
OR
b) Not have anything triggered on the console or logged as a failure unless we’ve failed three times in a row. A report would then have to be written to look for small blips or false positives along the way against the individual items. Thinking a complex trigger expression checking that the item has had the same "issue" using .last(0), .last(1) and .last(2) to check the last three values that were recorded.
I've thought of two ways to do this, and just looking for feedback/input on how others do the same thing. Keep in mind I've done monitoring in the past but am new to Zabbix so while I get concepts, I may now know the right terminology or what a good way to do things in Zabbix is.
a) Have triggers logged and displayed on the console, but not send out e-mail notifications unless you get three consecutive trigger failures. The trigger log would then show us when things became a problem but wouldn’t spam our e-mail. Admittedly not sure how to do this. I initiallyl thought escalations that had their timing lined up with the item checks, but not sure if this is the wisest way since there's no way to be sure the times stay in sync if someone changes item timing or escalation timing.
OR
b) Not have anything triggered on the console or logged as a failure unless we’ve failed three times in a row. A report would then have to be written to look for small blips or false positives along the way against the individual items. Thinking a complex trigger expression checking that the item has had the same "issue" using .last(0), .last(1) and .last(2) to check the last three values that were recorded.
Comment