Ad Widget

Collapse

Escalations Explained (RFC)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • geno
    Junior Member
    • Nov 2008
    • 29

    #1

    Escalations Explained (RFC)

    Hi Zabbix experts,

    I've taken some time and experimented with the escalations part of an action. Below I will explain how I found it to work and I hope to induce some feedback, comments and discussion around this.
    ----
    Let’s start at Configuration of Actions.
    Example 1:



    For this example I’m setting up an action that will alert me when the SSH server is down on the Zabbix server.

    Notice here I’ve enabled escalations and I’ve set the escalation period to 60 seconds (at 1). I’ve also enabled recovery message (at 2, I will go into more detail about this later on).

    I’ve defined the default message for when in problem state, and also for when recovered.

    On the right-hand-side I’ve clicked new which brought up the ‘Edit operation’ box. In here we have steps from, to and period (at A,B,C). From and To steps indicate the instances at which the message should be sent:

    Step 1 – the instant at which zabbix encounters the problem. Example: 14:00
    Step 2 – the instant + Period (seconds) [see 1] from step 1. Example: 14:01
    Step 3 – the instant + Period (seconds) [see 1] from step 1. Example: 14:02
    Etc

    If you define From Step 1 to Step 3: It will alert every 60 seconds (because that is the period set at 1) three times (ie, at 14:00, at 14:01, at 14:02)
    If you define From Step 1 to Step 1: It will alert once, at the instance the problem was detected.
    If you define From Step 3 to Step 3: It will alert once, at 14:02

    Example 2:



    In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:

    14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message

    14:01 – Step 2 to 2 – Period is 60sec, thus 60 sec after instant of problem and only then – user guest receive msg

    14:02 – Step 3 – Period is 60 sec, thus 60 sec after step 2 – group “AppDev 1st Standby” receive msg
    14:03 – Step 4 – Period is 60 sec, thus 60 sec after step 3 – group “AppDev 1st Standby” receive msg

    PERIOD



    The value that you define in the Period box (see A in the img) replace the value defined as your default period (see B in the img).
    In other words, the value defined on the left (at B) is your default delay between escalations, and the value defined at A is the new value to replace the default.

    Example 3:



    In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:

    14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message. At this step we also replace the default delay of 60sec with 120sec.

    14:02 – Step 2 to 2 – Because the delay was changed to 120sec, only 120sec after the instant of the problem will step 2 occur, which sends a msg to ‘monitor’ and replace the delay with 180 sec.

    14:05 – Step 3 – Period (delay) is 180 sec, thus 180 sec after step 2 – send msg to ‘monitor’ and change delay to 240sec
    14:09 – Step 4 – Period (delay) is 240 sec, thus 240 sec after step 3 – send msg to ‘monitor’ and change delay back to default (which is 60sec)

    That means, that if you were to define a step 5, step 5 would occur 60sec after step 4 because the period/delay was set to the default at step 4.

    RECOVERY MESSAGES

    I’m not sure if I understand these correctly, or if I understand the purpose of these correctly, but this is the behaviour I’ve seen.


    Example 4:

    Lets take example 3 above and look at how we receive recovery messages for that configuration.
    Lets assume that the SSH server came back up on the Zabbix server and that the Zabbix trigger for it went to OK at 15:00.

    15:00 – Step 1 – The instant the problem has recovered is step 1 – send recovery msg to ‘monitor’
    15:02 – Step 2 – The delay was 120sec, so 120 sec AFTER THE RECOVERY – send recovery msg to ‘monitor’
    15:05 – Step 3 – The delay was 180sec, so 180 sec after step 2 (5 min AFTER RECOVERY) – send recovery msg to ‘monitor’
    15:09 – Step 4 – The delay was 240sec, so 240 after step 3 (9 MIN AFTER THE RECOVERY!) – send recovery msg to ‘monitor’

    Recovery messages works in exactly the same way that problem messages do. Thus in a real work scenario you might send a message at step 1 (the instant) to your 1st standby. You might decide to escalate to the manager if the problem isn’t fixed within an hour. BUT this means, that your manager will only get the recovery message 1 hour after the problem was fixed (and the standby person will seem like he took much longer that he really did)

    It would make more sense if the recovery message is sent to (a) all and only the recipients who received the problem message (b) at the instant the problem recovered.

    --
    Can the experts please tell me if this is how you also experience escalations to work?

    --
    There is another way to do this, it isn’t perfect either, which I will cover in a reply to this post.
  • Aly
    ZABBIX developer
    • May 2007
    • 1126

    #2
    I think you need to add condition like: Trigger value = "PROBLEM"
    Also which version you're using for tests?
    Zabbix | ex GUI developer

    Comment

    • geno
      Junior Member
      • Nov 2008
      • 29

      #3
      Another img

      Here's an image that explains the time line of events as described in the "Recovery MEssages" section of my previous post...



      You'd really like that all the recovery messages (could be just one) be sent at the instant fix was detected, ie 16:00

      Comment

      • geno
        Junior Member
        • Nov 2008
        • 29

        #4
        Using "Trigger Value = OK"

        In the 1st post at the end I mentioned another way of doing the escalations; Aly pointed out to use “Trigger value = PROBLEM”. I will try to explain that here and point out the problems with that configuration.

        With this setup, you configure the same actions as before except that you DO NOT tick ‘enable escalations’ and you DO NOT tick ‘recovery message’. You will then end up with something that looks like this:

        1.0 - Instant Configuration


        Img 1.0

        In Img 1.0 - This action will trigger at the instant the problem is detected, ie “Trigger Value = PROBLEM”. It has a few other conditions to define what to trigger on, and it will send a message to 1st standby email and cell and it will email 2nd standby.

        2.0 – Escalation 30 min Configuration


        Img 2.0

        This is setup with ‘enable escalations’ ticked, and the period set at 1800sec (30 min). The steps to notify on is 2, which means, not at the instant but at step 2 which is 30 min from the instant of the problem; which is also why trigger value is set to “PROBLEM”.

        3.0– Escalation 60 min Configuration


        Img 3.0

        This is setup with ‘enable escalations’ ticked, and the period set at 3600sec (60 min). The steps to notify on is 2, which means, not at the instant (step 1) but at step 2 which is 60 min from the instant of the problem; which is also why trigger value is set to “PROBLEM”.

        4.0 – Recovery messages


        Img 4.0

        At the instant it is detected that the problem was fixed (therefor ‘enable escalations’ not ticked) and “Trigger Value = OK” send a message to all relevant people.

        --
        This works fine. Except for the following:

        Let’s say the problem was detected and the notification was sent at the instant (lets say 15:00). 1st standby then fixes the problem within 5 minutes, ie the 30 min and 60 min escalation notifications doesn’t occur. Then zabbix detects the problem was fixed but because of this setup sends recovered messages to all three groups.

        That is because you cannot determine who received messages and who didn’t.

        A POSSIBLE SOLUTION:
        If you knew how long the problem existed you could change; for example:
        If [trigger value = OK] and [event.age < 30] -> send recovery msg to only 1st standby
        If [trigger value = OK] and [event.age < 60] -> send recovery msg to only 2nd standby
        If [trigger value = OK] and [event.age > 60] -> send recovery msg to only 1 HR standby (manager)

        You can do this because you know that if the problem existed for less than 30 minutes, ONLY 1st standby got the error notification and only he needs to receive that recovery notification (you don’t want to waste time of the 2nd standby or get the manager worried for something that was fixed quickly).

        The same story for if the problem was less than 60 minutes, only 1st and 2nd standby…

        --
        Am I missing the plot completely? Am I making this too complicated?

        Thanks

        Comment

        • geno
          Junior Member
          • Nov 2008
          • 29

          #5
          Eureka!!

          Eureka!!

          Aly, you have said what many has said before. And maybe I wasn't listening properly...

          I’m not sure whether I’ve just been blind or stupid or if my intuition lacks some deeper insight or what (?) but I’ve found the solution. Looking at it seems almost simple, except for the “Trigger value = PROBLEM” part which just doesn’t seem to fit completely. It also begs me to ask what’s the purpose of allowing the user to tick “Recovery message” if he doesn’t add “Trigger value = PROBLEM”? I ask this because, if you remove “Trigger value = PROBLEM” from the below scenario, you can still tick “Recovery message” but it doesn’t work, ie, it doesn't send a recovery message…??? (I'll check again)

          Solution


          1) Enable Esclations
          2) Set the period between escalations (300sec in this example)
          3) Set default subject and message
          4) Enable ‘Recovery Message’
          5) Set recovery subject and message. I like to put ‘RECOVERY’ in the subject then it’s nice and clear.
          6) Set the conditions and then add “Trigger value = PROBLEM”. [YOU MUST ADD THIS ONE]
          7) Set the operations for each step, in this case a different person will be emailed every 300sec until jthut was mailed no more will happen. [NOTE: You can modify the default delay between escalations, see the first post by me in this thread.]

          What this will cause I will explain with examples:

          Example 1

          15:00 – Zabbix trigger picks up problem condition on host ‘farfaraway’. It will immediately email ‘lskywalker’

          For this example, let’s say Mr ‘lskywalker’ fix the problem within a minute (our hero!)
          Assume this item is checked every 60 seconds, then at around:

          +- 15:03 – Zabbix trigger picks up the problem was fixed on host ‘farfaraway’. It will immediately email ‘lskywalker’. (IMPORTANT: It has only notified ‘lskywalker’ of the problem, thus will only notify ‘lskywalker’ that the problem was fixed.)

          Example 2
          (adaptation of Example 1)

          15:00 – Zabbix trigger picks up problem condition on host ‘farfaraway’. It will immediately email ‘lskywalker’

          For this example, let’s say Mr ‘lskywalker’ was playing with his light-saber and doesn’t notice the email. So:

          15:05 – Zabbix escalates (300sec later) and sends for step 2 a message to Mr ‘yoda’

          Mr ‘yoda’ is of course working hard and fixes the problem immediately (fixes the problem he can!) so at around:

          +- 15:07 - Zabbix trigger picks up the problem was fixed on host ‘farfaraway’. It will immediately email ‘lskywalker’ and it will immediately email ‘yoda’. (IMPORTANT: They were the only ones notified of the problem and should/is be the only ones to be notified of the recovery.)

          --
          The above example in a timeline


          --

          You get the idea. This is important, because as you’ve seen in my previous posts, you don’t want to send out unnecessary notifications (I was having problems with recovery messages being sent to people that didn’t even need to know the problem occurred, such as the 3rd escalation person, who might just be your manager/boss). Especially if it’s a SMS/Cellphone Text message. ESPECIALLY if you sometimes have problems at 3am in the morning. ESPECIALLY if the morning is xmas morning and ‘dvader’ (your boss) is sleeping…

          --

          Any comments would be appreciated
          Last edited by geno; 06-01-2009, 18:00. Reason: Pictures explains better...

          Comment

          • geno
            Junior Member
            • Nov 2008
            • 29

            #6
            Double checked

            I double checked

            Okay, so for some reason having Trigger Value = PROBLEM is what makes the difference between these two scenarios:


            With 'Trigger Value = PROBLEM'

            --


            WITHOUT 'Trigger Value = PROBLEM'

            --
            I think that my confusion is justified, it doesn't really make sense. These options doesn't come intuitively, so by playing around the Actions configuration you won't necessarily just 'get it'.

            Atleast I got it figured out now, and hopefully this post will clear it up for other people also.

            Comment

            • Aly
              ZABBIX developer
              • May 2007
              • 1126

              #7
              I'm glad you have figured it out. Let the force be with U
              Zabbix | ex GUI developer

              Comment

              • hml
                Junior Member
                • Apr 2008
                • 5

                #8
                Zabbix 1.6.2 Period value in steps

                Hi All,

                I have just installed 1.6.2 in a test environment. I have managed to find out how the escalations are working and I have to say it is a big improvement from 1.4.6 actions.

                However I have a question about the period value in the steps screen. I have tried putting different values in each step but cannot see any impact to the time line of escalations/notifications for the problem or recovery.

                Is the period value in a step configuration suppose to change the time line of escalations?

                Regards,

                hml

                Comment

                • geno
                  Junior Member
                  • Nov 2008
                  • 29

                  #9
                  i have not upgraded to the 1.6.2 so ito that i cannot give you a proper answer. however, if you look at example 3 in the first post of this thread, you will see what the period value does... i'm not sure how to explain it any more clear than that?

                  Comment

                  • hml
                    Junior Member
                    • Apr 2008
                    • 5

                    #10
                    Escalations in 1.6.2

                    Hi Geno,

                    I have run some more tests where I put values for default period which differs enough from the periods in the steps and also used SMS for more accurate timing.

                    This confirmed that escalations work as expected. My misunderstanding was about what period in the step configuration was.

                    It would have been a bit easier to understand if for example instead of "Period [0-Default]" was saying "Run next step in: ....[0=Default]" and on the action screen maybe period was called "Default Period".

                    The other factor that contributes to the confusion is that the values in the Delay column are only calculated using the default value and do not adjusted when it is overwritten in the step configuration.

                    To summarise, escalations work OK.

                    hml

                    Comment

                    • geno
                      Junior Member
                      • Nov 2008
                      • 29

                      #11
                      You are correct. I was just reporting on how I found the escalations to work, not what is most understandable

                      You could suggest your improved interface to the developers

                      Comment

                      • consultorpc
                        Junior Member
                        • Apr 2008
                        • 16

                        #12
                        Hello,

                        Can you please help me to configure an action which need to be repeated for every 30 minutes? Is it possible with a single operations action with escalation can be done this? I have already posted a thread : http://www.zabbix.com/forum/showthread.php?t=12045 , this will give you more details about the question.

                        Thanks

                        consultropc

                        Comment

                        • Justin Freeman
                          Junior Member
                          • Jan 2009
                          • 18

                          #13
                          Please add these examples to the Zabbix Manual

                          Please add these examples to the Zabbix Manual. They are excellent and helped me understand how to use actions more effectively.

                          Thanks to everyone in the thread for their efforts

                          Comment

                          • consultorpc
                            Junior Member
                            • Apr 2008
                            • 16

                            #14
                            Justin,

                            Since you got a better understanding how to configure this, please tell me how I can make an action which need to be repeated for every 30 minutes?

                            Thanks

                            consultorpc

                            Comment

                            • bernard
                              Member
                              • Oct 2008
                              • 54

                              #15
                              Best post about escalation

                              Hi Geno,

                              This is the best post about escalation !!! It should be always on top or into the manual.

                              Tank you,
                              bernard

                              Comment

                              Working...