Ad Widget

Collapse

Suppress recovery notifications during maintenance

Collapse
This topic has been answered.
X
X
 
  • Time
  • Show
Clear All
new posts
  • cdecarlo
    Junior Member
    • Nov 2019
    • 23

    #1

    Suppress recovery notifications during maintenance

    Hello,

    I want to know if there is a good way to suppress recovery alerts during a maintenance period? I'm running Zabbix 5.4 on Ubuntu Linux.

    I originally had the trigger action defined with a condition of "Problem is NOT suppressed" and that would indeed not send any notification messages when triggers fired during a maintenance window.

    However, what that also meant, was that anything that was still in a problem state when the maintenance period expired would never send any notification messages.

    My operations definition does have the "Pause operations for suppressed problems" checked.

    What I would like to do is the following.

    1. If the host is not in maintenance then all trigger notifications should fire normally.
    2. If the host is in maintenance mode then no alerts should fire during the maintenance mode (meaning no problem or recovery alerts).
    3. Once a host comes out of maintenance mode, if an problem condition still exists, all triggers should fire and notifications should be sent normally.

    I've just added a "Time Period not in..." condition (that matched my maintenance mode), but that seems to defeat the purpose of defining a maintenance mode and I am also worried that my condition #3 from above is also now not going to work.

    It makes no sense to me that if I have a maintenance mode defined where I know that 100's of items are going to be in a "problem" state that I receive 100's of recovery notifications when they do recover but are still in maintenance mode.

    Is there any way to accomplish what I'm looking for here?

    Thanks,

    Cliff

  • Answer selected by cdecarlo at 07-12-2021, 23:23.
    ISiroshtan
    Senior Member
    • Nov 2019
    • 324

    Offtopic: should not read forum at 10 p.m., doing weird things... like editing this message 3 times

    So, back to idea of 2 actions.

    If I update the existing trigger action to include the condition of "Problem is not suppressed" then the alert notification never gets sent by that trigger action when the host comes out of maintenance and the item is still down (which is NOT what I want to have happen).
    That is what you actually want. Because you will have 2nd action, which is exactly same but will have condition "Problem IS suppressed" and will have that 1 minute extra step where Zabbix does nothing.

    This way you will have no duplicates, because one actions only sends notifications for triggers in normal operation mode and 2nd one only for problems raised during Maintenance mode. And as result, notification for triggers in normal operation mode are not delayed by one minute.
    Last edited by ISiroshtan; 07-12-2021, 22:26.

    Comment

    • ISiroshtan
      Senior Member
      • Nov 2019
      • 324

      #2
      So I was recently looking at discussion of recovery messages in this thread.

      Use case is completely different, but core question is still: "how to not send recovery if problem notification was not sent". So it boiled down to using "Notify all involved" option which is briefly mentioned in documentation. This option would send recovery notification to user only if said user also received problem notification.

      As far as I could understand form your description, it might be exactly what you looking for.

      Please note: I myself never used this option, and can not provide much info about it's practical usage.

      Hope it helps!

      Comment

      • cdecarlo
        Junior Member
        • Nov 2019
        • 23

        #3
        Thanks, but unfortunately I already had the recovery operations set to "Notify all involved" and that still sends out recovery notifications during a maintenance window (even though nobody was notified because the problem notification was suppressed due to maintenance).

        From the docs on maintenance mode : "Note that problem recovery and update operations are not suppressed during maintenance, only escalations." That is exactly what I would like to be able to turn off as well...I would think that a maintenance period should suppress all notifications (with maybe the exception of an update that was manually entered by a user).

        Comment

        • cdecarlo
          Junior Member
          • Nov 2019
          • 23

          #4
          What is also strange is that we had this scenario last night. The host in question was in maintenance from 02:00 AM until 07:00 AM.

          This is the sequence of events (recovery option was set to "Notify all involved")

          12/07/2021 03:51:11 AM - Problem Created
          12/07/2021 06:10:57 AM - Problem Resolved

          12/07/2021 06:11:02 AM - Problem email sent
          12/07/2021 06:11:05 AM - Recovery email sent

          All of this happened during the maintenance window. So I received the problem notification via email 5 seconds after the item recovered and then the recovery notification email 3 seconds after that.

          This is generating a ton of spam email notifications.

          The condition of "Problem is not suppressed" blocked all of these email notifications during maintenance, but would never send notifications on anything that was still in a problem state after the maintenance period expired.

          Comment

          • ISiroshtan
            Senior Member
            • Nov 2019
            • 324

            #5
            Hmmmm, reading documentation regarding "Pause operations for suppressed problems", it goes "Mark this checkbox to delay the start of operations for the duration of a maintenance period. When operations are started, after the maintenance, all operations are performed including those for the events during the maintenance". So it means all actions will still be executed but after the MP. And if you have Send notification Immediately they would work as you describe.

            Now what makes me curious, does it actually proceed with Action steps timers while MP ongoing or are they also paused? If they paused - you could try changing actual action to have a small delay before sending notification, lets say 1 minute. This way after MP Zabbix should be able to close no longer existing problem before they are sent out.

            Following up on idea(if it works that is) you can create 2 different actions:
            One with "Problem is supressed" = True, where you add said 60 seconds delay before sending notification which would be used for MP notificatons.
            Other with "Problem is supressed" = False, where notifications are sent immediately, which would be for normal trigger operation.

            It's really needs to be tested. My Lab Zabbix server don't have notifications setup, so can not really test it out fast.

            Comment

            • cdecarlo
              Junior Member
              • Nov 2019
              • 23

              #6
              O.K. Good thought on the delayed start to an operation. I did some testing with one of our test servers that is monitored by Zabbix and this is what I've found.

              I created a new trigger action configuration (with the only condition being that the host equals a certain host).
              This trigger action has the operation delayed for 1 minute (default operation step duration set to 1m and the only operation step is configured as step 2 which causes a 1 minute delay). The recovery/update options are set to "Notify all involved"

              I left the existing trigger action as configured (no problem is not suppressed condition and the operation starting at step 1 with immediate execution).

              Both trigger actions have the "Pause operations for suppressed problems" checked.

              I put the host in maintenance and brought down one of the monitored items on that host.

              Neither trigger action caused any notifications to be sent when the item went down.

              I then re-started the monitored item. This caused the existing trigger action to send out a problem email followed by a recovered email a few seconds later. The new trigger definition did not send any notification emails (I assume because the problems were gone by the time the 1 minute delay passed).

              I then brought the item back down, waited a couple of minutes and then took the host out of maintenance mode.

              Both the new and existing trigger actions then fired the problem notification, with the new trigger action email coming in 1 minute after the existing one (as expected from the delayed start).

              If I update the existing trigger action to include the condition of "Problem is not suppressed" then the alert notification never gets sent by that trigger action when the host comes out of maintenance and the item is still down (which is NOT what I want to have happen).

              It looks like the solution is to start the operations of my trigger actions at step 2 after a 1 minute delay (and not worry about problem suppression). I cannot create two different trigger actions, because then I would be receiving duplicate notification email messages when items are not in maintenance.

              However, that will have the unfortunate side effect of delaying all notification emails by at least 1 minute after a problem has been detected when everything is being monitored normally (i.e. no maintenance period in effect).

              Unless somebody has a better idea?

              Comment

              • ISiroshtan
                Senior Member
                • Nov 2019
                • 324

                #7
                Offtopic: should not read forum at 10 p.m., doing weird things... like editing this message 3 times

                So, back to idea of 2 actions.

                If I update the existing trigger action to include the condition of "Problem is not suppressed" then the alert notification never gets sent by that trigger action when the host comes out of maintenance and the item is still down (which is NOT what I want to have happen).
                That is what you actually want. Because you will have 2nd action, which is exactly same but will have condition "Problem IS suppressed" and will have that 1 minute extra step where Zabbix does nothing.

                This way you will have no duplicates, because one actions only sends notifications for triggers in normal operation mode and 2nd one only for problems raised during Maintenance mode. And as result, notification for triggers in normal operation mode are not delayed by one minute.
                Last edited by ISiroshtan; 07-12-2021, 22:26.

                Comment

                • cdecarlo
                  Junior Member
                  • Nov 2019
                  • 23

                  #8
                  Thank you! You are correct. I tried that exact scenario (one action with "Problem not suppressed" with no delay and one with "Problem is suppressed" with a 1 minute delay).

                  That configuration of trigger actions behaves exactly as desired.

                  When "normal" operating mode the trigger action with the "Not suppressed" condition fires the notification immediately. When in "maintenance" that trigger action does not fire at all (for either problem or recovery).
                  When "normal" operating mode the trigger action with the "Suppressed" condition with the delayed action does not fire notifications. When in "maintenance" that trigger action will be delayed until the maintenance mode is over and then notification emails will be sent.

                  Thank you for your help!

                  Comment

                  Working...