Ad Widget

Collapse

Actions - Only Send Recovery Message After Problem Message is Sent

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • th3axman
    Junior Member
    • Mar 2020
    • 10

    #1

    Actions - Only Send Recovery Message After Problem Message is Sent

    I apologize if this has been fully addressed in another post, but I've spent a significant amount of time searching the Zabbix forums, the Zabbix documentation, and even the book Zabbix 4 Network Monitoring - Third Edition for answers. Unfortunately, I haven't found anything that really answers my questions and, at best, the information I have found is conflicting.

    Disclaimer: We're in the process of setting up Zabbix for the first time in our environment so we're rather new to the system.

    Server and Agent versions:
    • Zabbix Server Version: 4.0.11
    • Zabbix Agent Version: 4.0.12

    We've successfully delayed sending problem alert messages by 10 minutes using escalation in our Action. Our intention is to account for situations where an issue arises and subsequently resolves itself within the 10 minute time frame so that we only receive alerts for things we deem urgent. Unfortunately, Zabbix continues to send recovery messages during that time frame even though no problem messages were sent.

    We would like to only send a recovery message after a problem message is actually sent. I understand that Recovery Operations does not have an escalation option, but it doesn't make much sense to send recovery messages when associated problem messages were never sent.

    I've tried using functions like min(10m) in our trigger expression, but the trigger flips to a problem state well before 10 minutes has passed and problem and recovery messages are immediately sent. It's possible our trigger expressions aren't set up correctly and/or our testing methodology is flawed.

    I'm intentionally leaving details out to keep this post from being a million pages long, but I'm happy to provide additional details if needed.

    Any assistance is greatly appreciated!

    Thank you,

    Jason
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    I'm not using escalations much at my site, except in a couple special cases to notify an additional level of people if a problem hasn't been acknowledged or fixed after a couple hours. I've thought about switching more of our monitoring to use them, very similar to what you're trying to do, however the issue you've just brought up makes me glad I haven't done that yet.

    Originally posted by th3axman
    I've tried using functions like min(10m) in our trigger expression, but the trigger flips to a problem state well before 10 minutes has passed and problem and recovery messages are immediately sent. It's possible our trigger expressions aren't set up correctly and/or our testing methodology is flawed.
    What I did to accomplish a similar thing at our site was make heavy use of last(), with a # argument for how many check periods we wanted to wait before treating a value as a problem.

    For example, if you check zabbix.agent.ping every minute, but you only want to be alerted if a box can't be reached for 10 consecutive minutes, then you can use last(#10) as part of the trigger, to effectively say "the last 10 consecutive readings need to be bad before this trigger treats it as a problem". If you check zabbix.agent.ping every 5 minutes, then you would use last(#2), etc.

    The big downside to doing this is that we've essentially designed the triggers to ignore a problem for a period of time. It's why I was considering switching us to using escalations more, because I could have Zabbix detect problems quicker (so they would show in the dashboard, and we would notice more intermittent problems) even if alerting doesn't start immediately.

    Using this method, there's even a down-side for the recovery message. If you've written your trigger so that it's "something.last(#2) <> a_good_value", then the trigger won't transition back to the OK state until the last 2 bad values have transitioned out, so it takes longer for the trigger to go back to a good state. This can be avoided by using carefully-written recovery operation checks, but in some situations it can be tricky to avoid flapping/hysteresis.

    If you find an elegant solution that allows you to continue to use escalations like you want to, please do follow up and post about it. I'm sure many people would find it beneficial.

    Comment


    • th3axman
      th3axman commented
      Editing a comment
      I've found Dmitry Lambert's YouTube videos on Zabbix to be quite helpful. You can get to his YouTube page via this link: https://www.youtube.com/channel/UCUQ...CjwQZQGznTkvrQ

      In the video "Monitoring Triggers Explained in ZABBIX", he talks about how you have to be very careful about using last(). That's a good overall video to watch. Honestly, all of his videos are very good.
  • Christopher Gray
    Junior Member
    • Jan 2021
    • 2

    #3
    Has anyone figured out best practice for this scenario? I'm getting recovery notifications sent to sleeping admins, when the initial problem notification wasn't yet sent (as it had not exceeded the initial step duration). Is the best approach to try to use an event tag of some sort in an action condition, or ??? Thanks.

    Comment

    • cergoc85
      Junior Member
      • Apr 2021
      • 1

      #4
      We are using Zabbix 5.0.9, and for us setting the recovery option to "Notify all involved" seems to give the result of not sending recovery unless a problem message was sent.

      Comment


      • mouse
        mouse commented
        Editing a comment
        cergoc85, thanks, that works for me as well.
    Working...