Currently we're on Zabbix 4.4.7, but we'll be moving to 5.0 LTS or later soon, so if there's improved functionality in the area I'm asking about in one of the 5.x series, don't hesitate to point it out.
When you have multiple separate triggers for increasing seriousness of a problem, is there a way (perhaps global event correlation?) to cancel or avoid sending a recovery operation message when a lower severity event ends and a higher severity problem begins?
As an example, I'm using a simple ping check item for a network link:
Then I have multiple separate triggers to identify varying levels of packet loss and treat them as different severities:
Trigger #1:
Trigger #2:
Trigger #3:
Now imagine that a problem starts out with 10% packet loss, so the first trigger fires and generates a problem of severity=Information.
Later, the problem gets worse and we have 70% packet loss.
Later still, the network device is offline and we have 100% packet loss.
I don't want a recovery message to go out for the "partial packet loss" when the "major ping packet loss" trigger starts to fire, and I don't want a recovery message to go out for the "major ping packet loss" when the "link down" trigger fires.
Now I realize that I could use a recovery expression with all three triggers that doesn't generate an OK event until packet loss = 0.0, but that just causes all 3 problems to generate their 3 recovery messages at the same time.
What I want is for the first problem event to go away without a recovery message if we transition to the "major ping packet loss" event. If, however, the problem never gets worse, and packet loss returns to 0.0% without ever getting above 20%, then the information-level recovery message would go out.
I've never used global event correlation, and for something fairly complicated and with some dire warnings about getting it wrong, there's very little in the way of examples in the documentation. However, from what I can tell, using global event correlation to close a lower-level problem when a higher-level problem takes over would still cause the recovery message to fire for the lower-level problem. If that's not a correct understanding, please correct me!.
This same thing could apply to any number of other types of items (like % used on a filesystem or load average on an important server) where there is an increasing seriousness of the problem and you want a more serious problem event to cancel or override a less-serious problem event.
Anyone have any suggestions for how to accomplish what I'm trying to accomplish?
Thanks,
Tim
When you have multiple separate triggers for increasing seriousness of a problem, is there a way (perhaps global event correlation?) to cancel or avoid sending a recovery operation message when a lower severity event ends and a higher severity problem begins?
As an example, I'm using a simple ping check item for a network link:
Code:
Name : % ping packet loss Type : Simple check Key : icmppingloss[,,200,32,1200] Host interface : 10.12.95.161:10050 Type of Information: Numeric (float)
Then I have multiple separate triggers to identify varying levels of packet loss and treat them as different severities:
Trigger #1:
Code:
Name : {HOST.NAME}: partial ping packet loss
Severity : Information
Expression : {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>0.0 and {10.12.95.161:icmppingloss[,,200,32,1200].max(#2)}<20.0
OK event generation : Expression
PROBLEM event generation mode: Single
OK event closes : All problems
Trigger #2:
Code:
Name : {HOST.NAME}: major ping packet loss
Severity : Warning
Expression : {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>=20.0 and {10.12.95.161:icmppingloss[,,200,32,1200].max(#2)}<90.0
OK event generation : Expression
PROBLEM event generation mode: Single
OK event closes : All problems
Trigger #3:
Code:
Name : {HOST.NAME}: link down
Severity : High
Expression : {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>=90.0
OK event generation : Expression
PROBLEM event generation mode: Single
OK event closes : All problems
Now imagine that a problem starts out with 10% packet loss, so the first trigger fires and generates a problem of severity=Information.
Later, the problem gets worse and we have 70% packet loss.
Later still, the network device is offline and we have 100% packet loss.
I don't want a recovery message to go out for the "partial packet loss" when the "major ping packet loss" trigger starts to fire, and I don't want a recovery message to go out for the "major ping packet loss" when the "link down" trigger fires.
Now I realize that I could use a recovery expression with all three triggers that doesn't generate an OK event until packet loss = 0.0, but that just causes all 3 problems to generate their 3 recovery messages at the same time.
What I want is for the first problem event to go away without a recovery message if we transition to the "major ping packet loss" event. If, however, the problem never gets worse, and packet loss returns to 0.0% without ever getting above 20%, then the information-level recovery message would go out.
I've never used global event correlation, and for something fairly complicated and with some dire warnings about getting it wrong, there's very little in the way of examples in the documentation. However, from what I can tell, using global event correlation to close a lower-level problem when a higher-level problem takes over would still cause the recovery message to fire for the lower-level problem. If that's not a correct understanding, please correct me!.
This same thing could apply to any number of other types of items (like % used on a filesystem or load average on an important server) where there is an increasing seriousness of the problem and you want a more serious problem event to cancel or override a less-serious problem event.
Anyone have any suggestions for how to accomplish what I'm trying to accomplish?
Thanks,
Tim
Comment