Ad Widget

Collapse

avoid multiple recovery messages from different triggers for the same item

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #1

    avoid multiple recovery messages from different triggers for the same item

    Currently we're on Zabbix 4.4.7, but we'll be moving to 5.0 LTS or later soon, so if there's improved functionality in the area I'm asking about in one of the 5.x series, don't hesitate to point it out.

    When you have multiple separate triggers for increasing seriousness of a problem, is there a way (perhaps global event correlation?) to cancel or avoid sending a recovery operation message when a lower severity event ends and a higher severity problem begins?

    As an example, I'm using a simple ping check item for a network link:

    Code:
    Name               : % ping packet loss
    Type               : Simple check
    Key                : icmppingloss[,,200,32,1200]
    Host interface     : 10.12.95.161:10050
    Type of Information: Numeric (float)


    Then I have multiple separate triggers to identify varying levels of packet loss and treat them as different severities:

    Trigger #1:
    Code:
    Name                         : {HOST.NAME}: partial ping packet loss
    Severity                     : Information
    Expression                   :  {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>0.0 and {10.12.95.161:icmppingloss[,,200,32,1200].max(#2)}<20.0
    OK event generation          : Expression
    PROBLEM event generation mode: Single
    OK event closes              : All problems


    Trigger #2:
    Code:
    Name                         : {HOST.NAME}: major ping packet loss
    Severity                     : Warning
    Expression                   :  {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>=20.0 and {10.12.95.161:icmppingloss[,,200,32,1200].max(#2)}<90.0
    OK event generation          : Expression
    PROBLEM event generation mode: Single
    OK event closes              : All problems


    Trigger #3:
    Code:
    Name                         : {HOST.NAME}: link down
    Severity                     : High
    Expression                   :  {10.12.95.161:icmppingloss[,,200,32,1200].min(#2)}>=90.0
    OK event generation          : Expression
    PROBLEM event generation mode: Single
    OK event closes              : All problems


    Now imagine that a problem starts out with 10% packet loss, so the first trigger fires and generates a problem of severity=Information.

    Later, the problem gets worse and we have 70% packet loss.

    Later still, the network device is offline and we have 100% packet loss.

    I don't want a recovery message to go out for the "partial packet loss" when the "major ping packet loss" trigger starts to fire, and I don't want a recovery message to go out for the "major ping packet loss" when the "link down" trigger fires.

    Now I realize that I could use a recovery expression with all three triggers that doesn't generate an OK event until packet loss = 0.0, but that just causes all 3 problems to generate their 3 recovery messages at the same time.

    What I want is for the first problem event to go away without a recovery message if we transition to the "major ping packet loss" event. If, however, the problem never gets worse, and packet loss returns to 0.0% without ever getting above 20%, then the information-level recovery message would go out.

    I've never used global event correlation, and for something fairly complicated and with some dire warnings about getting it wrong, there's very little in the way of examples in the documentation. However, from what I can tell, using global event correlation to close a lower-level problem when a higher-level problem takes over would still cause the recovery message to fire for the lower-level problem. If that's not a correct understanding, please correct me!.

    This same thing could apply to any number of other types of items (like % used on a filesystem or load average on an important server) where there is an increasing seriousness of the problem and you want a more serious problem event to cancel or override a less-serious problem event.

    Anyone have any suggestions for how to accomplish what I'm trying to accomplish?

    Thanks,

    Tim
  • johndoe2374
    Member
    • Aug 2021
    • 80

    #2
    Hello. I didn't try to use it myself yet, but it looks like you could try to use trigger dependencies:


    Do you use different actions for each of these triggers? Maybe you could try to play with action steps duration, try to increase it and stop action for suppressed problems.

    Comment

    • tim.mooney
      Senior Member
      • Dec 2012
      • 1427

      #3
      Hey John, I appreciate the response.

      We make widespread use of trigger dependencies in my environment already, both for network topology dependencies and for host dependencies. Trigger dependencies in Zabbix aren't perfect, but they can greatly reduce the number of spurious alerts that are generated.

      However, I don't see how trigger dependencies will help in the case I'm asking about. If I made the Trigger #3 dependent on Trigger #2, and Trigger #2 dependent on Trigger #1, then if the problem started out small, so that the first Trigger went into a PROBLEM state, the triggers for #2 or #3 would never fire. That's not what I want; if the problem gets more serious I do want the "higher-level" severity to show in the GUI and for alerting. I just don't want the recovery messages from the lower-severity instances of the problem.

      If I instead put the dependencies in reverse, so that Trigger #1 is dependent on Trigger #2 and Trigger #2 is dependent on Trigger #3, it doesn't change the problem I've outlined. If the problem starts small, we'll still get the Information-level problem detected. If it escalates to the mid-level severity, then the trigger dependency (Trigger #1 depending on Trigger #2) would kick in, but because Trigger #1 was already in a PROBLEM state, it wouldn't matter. When the problem was resolved, the OK notification would still be sent for the Information-level PROBLEM that was originally detected.

      If you see a mistake in my thinking there, or a way to make dependencies work for this, please do correct me. I'm just not seeing how they would solve this issue, though.

      As far as the second part of your question, we don't use different actions for these triggers; we have a few actions that handle everything.

      Thanks,

      Tim

      Comment

      • johndoe2374
        Member
        • Aug 2021
        • 80

        #4
        The logic behind trigger actions is that they're executed just after some event happens (usually when trigger becomes a problem, but there's also some other conditions, like when the problem is suppressed). It's just simple as that, you can't undo it. I think the only way to stop trigger action and it's steps from being executed is when the trigger changes it's state to "OK" or problem is suppressed. Or using API. So my idea was to set up different actions for different trigger's severity and delay steps that send you notification, say like 5 minutes of default step duration and first step sending notification will be like 2-2, so it will not do anything in first 5 minutes. Then, if during these first 5 minutes higher severity trigger becomes a problem, it suppresses the 1st one via some mechanism (I think, you'll have to dive into using API for that), so the first action will not send you notification at all, and also waits for some time before sending notification itself. And so on. Of course I could be wrong too, but I don't think you'll be able to reach your goal another way.

        Comment

        Working...