Ad Widget

Collapse

How can I delay alerts until business hours?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • c.h.
    Junior Member
    • Dec 2021
    • 29

    #1

    How can I delay alerts until business hours?

    When I worked at AWS, there were two kinds of alerts that would page people:
    • Page immediately (Emergency, Severity 2 tickets)
    • Page during business hours (Less-urgent, Severity 2.5 tickets)
    I've set Zabbix up to only send Slack messages about a problem during business hours, which is good.
    However, when business hours start (9am), Zabbix doesn't send a Slack message about those problems that occurred during the night.
    A further refinement would be to only alert if the trigger is still triggered.

    How would I set this up?

    One workaround is to have Zabbix call a script, and let the script handle the business hour logic, adding delayed alerts to a queue. Link

    I'm not too keen on that, because it feels like Zabbix has all the necessary pieces to make it work without writing more code.
    Last edited by c.h.; 07-04-2022, 00:58.
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    We tried to do something very similar to this by using an additional condition of "Time period in ..." at the Action level. We wanted to do it at the action level so that Zabbix still detected problems after business hours and on the weekend, it just didn't take any action until business hours for those problems. That way, if someone happened to be looking at the dashboard after hours or on the weekend, they could see that there was a problem, even if Zabbix wasn't paging about it (yet).

    For the action configuration that I tried, I thought I was being super clever because I set it up to use a macro, like this:

    Code:
    Time period in   {$ERROR_ACTION_PERIOD}
    and then I set the global value for "{$ERROR_ACTION_PERIOD}" to "1-7,00:00-24:00", so that the default would be alerting any time of the day or night, but I could override the macro at the host level so that particular host wouldn't alert on weekends or after business hours.

    As I said, I thought it was clever and would work really well for us.

    Unfortunately, for actions, the "Time period in" isn't referencing the current time, it references the event time. The result is that if I had a host configured with a macro that said don't alert after business hours, and the problem happened after business hours, then it would just never alert about that problem. For this to work, we would need the Zabbix developers to add a separate condition for "Current Time in" (and probably rename "Time period in" to be "Event Time in", so it's more clear).

    What you could try is using some of the time and date functions at the trigger level. That gets messy though because you have to do it for every (or most) trigger for a particular host. It also didn't match our requirement to detect a problem when it happens, but just not alert us about it until we were back in business hours.

    I've often thought that there might be a way to do this using tags, but I've never dug in any farther.

    If you come up with some that works well for you, please post about what you did. It could be really beneficial to a lot of sites.

    Comment

    • c.h.
      Junior Member
      • Dec 2021
      • 29

      #3
      From what you wrote, I had an idea: what about having two triggers, one that fires as soon as the problem occurs but doesn't page anybody, and another that uses the "Time period in..." condition to actually page someone at the appropriate time?

      It's not pretty, but could work!

      Also, I've seen where you can have one trigger "Depend on" another trigger; if that other trigger is active, then this one is temporarily disabled. We could have a trigger that only depends on the time of day/day of week, called "Outside of business hours", and have all the other triggers depend on that trigger!

      I just looked for a function that returns the status of another trigger, but couldn't find one. That way we could have a hierarchy (tree) of triggers, with only the top-most trigger actually paging someone. When it pages you, you can look at it to see which 'child' trigger fired, and then the child of that child, and so on, until you get to the root cause. When creating a trigger, you'd want to make sure that there aren't any loops, or else it'll get stuck in the alarming state.

      We could also have suppressing triggers by saying "not" in front of the trigger, which is what "Dependencies" are right now.

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        Originally posted by c.h.
        From what you wrote, I had an idea: what about having two triggers, one that fires as soon as the problem occurs but doesn't page anybody, and another that uses the "Time period in..." condition to actually page someone at the appropriate time?

        It's not pretty, but could work!
        Triggers alone don't determine how or when someone is alerted. Triggers are just the threshold logic for detecting "there is a problem" or "there is not a problem". Once a problem is detected (or a problem is no longer present) it's the actions that determine when and whether someone is alerted or some other action is taken.

        As I mentioned in my original response, the time and date functions at the trigger level can be used to cause a trigger to not fire until a certain time or date, and that may work well for you needs. It wasn't what we wanted for our environment, because we wanted immediate detection but delayed notification.

        Originally posted by c.h.
        Also, I've seen where you can have one trigger "Depend on" another trigger; if that other trigger is active, then this one is temporarily disabled. We could have a trigger that only depends on the time of day/day of week, called "Outside of business hours", and have all the other triggers depend on that trigger!
        Yes, trigger dependencies are very useful for avoiding false alerts when the real problem is e.g. a network switch or backbone link. We use trigger dependencies for all of our network topology "hops" between datacenters and other monitored systems. We also use trigger dependencies between high level host monitors (disk space, application is running, etc.) and the low level host monitor (zabbix agent ping, or just ping for some devices). That way, if a host is down, we don't get paged once for every trigger present on the system, we just get paged for the low level "host is unreachable" trigger.

        Comment

        • c.h.
          Junior Member
          • Dec 2021
          • 29

          #5
          Originally posted by tim.mooney
          Triggers alone don't determine how or when someone is alerted. Triggers are just the threshold logic for detecting "there is a problem" or "there is not a problem". Once a problem is detected (or a problem is no longer present) it's the actions that determine when and whether someone is alerted or some other action is taken.

          As I mentioned in my original response, the time and date functions at the trigger level can be used to cause a trigger to not fire until a certain time or date, and that may work well for you needs. It wasn't what we wanted for our environment, because we wanted immediate detection but delayed notification.
          I don't fully understand. I thought having two triggers watching the same item (one trigger including a time function that will delay it to a certain time) would allow Zabbix to have an immediate action (the plain trigger) and delayed action (the trigger with the time function). By adding different tags to the triggers, and specifying different actions for the tags, the delayed one could send a notification, while the immediate one is silent. If that wouldn't work, I don't know why.

          Comment

          • tim.mooney
            Senior Member
            • Dec 2012
            • 1427

            #6
            Originally posted by c.h.
            I don't fully understand. I thought having two triggers watching the same item (one trigger including a time function that will delay it to a certain time) would allow Zabbix to have an immediate action (the plain trigger) and delayed action (the trigger with the time function). By adding different tags to the triggers, and specifying different actions for the tags, the delayed one could send a notification, while the immediate one is silent. If that wouldn't work, I don't know why.
            Yes, that may work. As you've noted, it requires adjustments to your Actions, and a way for the action to detect whether the event should generate an alert or not. As you've suggested, tags may work for that.

            My point was just that triggers alone don't determine whether someone is paged or not. It's the triggers and actions together that would customize paging to the level you're talking about.

            Comment

            • cyber
              Senior Member
              Zabbix Certified SpecialistZabbix Certified Professional
              • Dec 2006
              • 4807

              #7
              Set it to maintenance for non-work time? Maintenance with data collection. While in maintenance, you still get data and triggers, but IIRC escalations are held back until maintenance expires...


              During a maintenance "with data collection" triggers are processed as usual and events are created when required. However, problem escalations are paused for hosts/triggers in maintenance, if the Pause operations for suppressed problems option is checked in action configuration

              Comment

              • c.h.
                Junior Member
                • Dec 2021
                • 29

                #8
                I like it! That's exactly the desired behaviour, but for some reason the concept didn't "click" that it described maintenance windows. The fact that it can be applied to triggers and not just hosts is even better!

                Comment

                • c.h.
                  Junior Member
                  • Dec 2021
                  • 29

                  #9
                  Alas, it appears that Maintenance windows are only granular down to the host level, not the trigger level. It doesn't appear to allow you to specify triggers/services, only hosts and host groups, with optional filtering based on tags:
                  You can define maintenance periods for host groups, hosts and specific triggers/services in Zabbix.
                  -- https://www.zabbix.com/documentation...%20in%20zabbix.

                  Example:
                  • Add a tag to a trigger (either in a template trigger or a host trigger), for example: 'testtag: foo'
                  • Create a maintenance window that includes all discovered hosts that have a tag named 'testtag' that contains 'foo' (note: tag names are case-sensitive, but their values aren't)
                  You'd expect only the triggers with that tag would be in maintenance mode, or failing that, only those selected hosts that also have the trigger that has matching tag would be in maintenance mode, but Zabbix 5.0.20 appears to ignore the tag filter and tags all discovered hosts (in this example).

                  Hey, that makes me think of a new feature:
                  • when creating or viewing a maintenance period, have a 'list matching hosts' button on the "Hosts and host groups" tab.
                  • Then you'd be able to see whether you're selecting the right hosts or not.
                  • Or the tab could show how many hosts are selected, if listing them is too hard.
                  Last edited by c.h.; 20-04-2022, 21:57.

                  Comment

                  • DenisBY
                    Member
                    • Jul 2006
                    • 44

                    #10
                    Hi,

                    Did you find a solution? We need exactly the same.

                    Comment

                    • c.h.
                      Junior Member
                      • Dec 2021
                      • 29

                      #11
                      The only solution I could think of was adding date and time functions to the trigger expression: https://www.zabbix.com/documentation...functions/time

                      ..... and (dayofweek()<6 and time()>=090000 and time()<=170000​)

                      This example should match Monday-Friday, 9am to 5pm (untested).

                      If an item were created of type Zabbix internal (e.g. zabbix[boottime]), set to a 1 minute interval, and then a trigger with an expression containing just the inverse of the time window, then it could be used as a dependency for other triggers, suppressing them outside the time window. Again, untested.

                      Comment

                      • guille.rodriguez
                        Senior Member
                        • Jun 2022
                        • 114

                        #12
                        Why not an option of Manteinance with collect data but no triggers. I want to get all info, but if its in maintenance i dont want to get triggers fired.

                        Comment


                        • mosconi.trt1
                          mosconi.trt1 commented
                          Editing a comment
                          Manteinance prevents the notifications, but doesn't send all "holded" as soon it the periods ends.

                          Would be nice if in the action had an option "start until period"
                      • c.h.
                        Junior Member
                        • Dec 2021
                        • 29

                        #13
                        The problem (for me) was that putting a host in maintenance is an all-or-nothing operation. If root's mailbox is over 1MB in size, that can wait until Monday, but if the disk is full, I want to know now.

                        Setting some triggers to be dependencies of a time-based trigger still records the data; it's just that they won't create alarms until the time-based trigger is not in alarm.

                        Dependencies were created so that if a router goes down and suddenly zabbix can't reach dozens or hundreds of hosts, you only get the alert about the router, rather than all of the hosts that were cut off. That router alert had better be marked 'High' or 'Disaster', but at least you can see the root cause easily.

                        Comment

                        Working...