Ad Widget

Collapse

Double event notifications, Hysterisis

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tokind
    Member
    • May 2007
    • 47

    #1

    Double event notifications, Hysterisis

    I have simple icmpping availability watch on the routers for remote sites. I set up triggers like:

    Code:
     {POP7Router:icmpping.last(0)}=0
    and an action that filters on the routers group, title like "timeout". I get notices alright, two per event. So I figured I was getting one for when the timeout occurred, and a second one when the ping again succeeded (restored).

    So I figured I would try a set of triggers and try to send a different message for timeout and restored.

    Code:
    Timeout:
     {POP7Router:icmpping.prev(0)}=1&{POP7Router:icmpping.last(0)}=0
    
    Restored:
     {POP7Router:icmpping.prev(0)}=0&{POP7Router:icmpping.last(0)}=1
    Well, now I get four messages, two each timeout, and restored. I looked at Zabbix 1.4 Manual in section 4.11.4 Hysteresis and tried to use

    Code:
     {TRIGGER.VALUE}=1&{POP7Router:icmpping.last(0)}=0 
    
     {TRIGGER(14170).VALUE}=1&{POP7Router:icmpping.last(0)}=0 
    
     {14170.VALUE}=1&{POP7Router:icmpping.last(0)}=0 
     
     {Connection Timeout POP7.VALUE}=1&{POP7Router:icmpping.last(0)}=0
    and get "Expression [{TRIGGER.VALUE}] does not match to [server:key.func(param)]"

    This should be simple, but I do not quite get it. I want one page when icmpping times out, and another when it resumes.
  • tokind
    Member
    • May 2007
    • 47

    #2
    Is this not happening to anyone else out there?

    Code:
     Connection Timeout ElPaso
    	{ElpasoRouter:icmpping.prev(0)}=1&{ElpasoRouter:icmpping.last(0)}=0	
            Warning      Enabled
    Gets me events:
    Code:
    2008.Mar.19 02:10:16	Connection Timeout ElPaso	OFF 	Warning
    2008.Mar.19 02:09:11	Connection Timeout ElPaso	ON 	Warning
    Gets me Actions:
    Code:
    	 
    2008.Mar.19 02:10:16 	Email 	sent 	[email protected] 	
      Subject: Connection Timeout ElPaso
      2008.03.19-02:10:16:Connection Timeout ElPaso
     
    2008.Mar.19 02:09:11 	Email 	sent 	[email protected] 	
      Subject: Connection Timeout ElPaso
      2008.03.19-02:09:11:Connection Timeout ElPaso
    Since I also have an action for connection restored, I get four pages (two timeout, two restored) each time the link bounces. The actions appear to be triggered with STATE CHANGE even when I specify that I want is trigger specific to "1 to 0".

    Is this a bug, or am I missing something obvious? Boss man does not want four pages at 2 AM. Two pages, the latter indicating that the link was restored, will suffice.

    Comment

    • Hichhiker
      Member
      • Nov 2004
      • 45

      #3
      Originally posted by tokind
      Is this a bug, or am I missing something obvious? Boss man does not want four pages at 2 AM. Two pages, the latter indicating that the link was restored, will suffice.
      You are missing something, although I am not sure how obvious it is until you wrap your head around it. Alerts in Zabbix do NOT exactly signify an event has occurred - rather that trigger STATE has changed.

      Each of your triggers always exist with some sort of state. The state of the trigger is the expression you defined in the trigger when you created it (lets use for example "{POP7Router:icmpping.last(0)}=0"). If something has changed and trigger which was OFF (or FALSE) changes to ON (TRUE), you get an alert. When that SAME trigger is evaluated and the state changes to OFF (FALSE) ({POP7Router:icmpping.last(0)}=0 is now FALSE) then you get another alert. So if you want to be notified when link goes down and when link goes back up, you only need a single trigger and the STATE of it in the alert signifies what actually occurred.

      I really wish the state value map could be specified per trigger, or at least switch the confusing ON/OFF to less confusing TRUE/FALSE.

      -HH

      Comment

      • tokind
        Member
        • May 2007
        • 47

        #4
        I see your point, and this is what I suspected. I see this as a bug. My triggers are SPECIFIC regarding the state change that I wish to detect. 1 to 0 indicates a failure, 0 to 1 indicates restored. The logic I wrote is disregarded, as each trigger will trip on ANY state change.

        I will see this as a bug until such time as I find a way to deliver a clear message about the nature of the state change in the action. It also seems to me that a timeout or failure would be classified as Warning or Critical, whereas restoration of a service would be Informational. In fact I have third trigger for Down which reports that the connection is still down, every 10 minutes. The trigger works in an indicator on my map, causing a yellow indicator to turn to red, labeled DOWN, but never results in an action.

        I will be looking at the use of a dependency in the triggers to try to limit them to one page per specific state change. Just off the top of my head, this seems unlikely to work in a logical manner.

        Thank you for confirming my suspicions.

        Comment

        • Hichhiker
          Member
          • Nov 2004
          • 45

          #5
          Originally posted by tokind
          I see your point, and this is what I suspected. I see this as a bug. My triggers are SPECIFIC regarding the state change that I wish to detect. 1 to 0 indicates a failure, 0 to 1 indicates restored. The logic I wrote is disregarded, as each trigger will trip on ANY state change.
          I think you are missing the fact that TRIGGER and what you call STATE are one and the same. Making the above statement nonsensical. Trigger is ON or OFF. Trigger tripping IS the state change from false to true and it "untriggering" is state change from true to false.

          Originally posted by tokind
          I will see this as a bug until such time as I find a way to deliver a clear message about the nature of the state change in the action.
          ON/OFF may be unclear to anyone unfamiliar with the system but its a simple enough concept for most to grasp when explained. You just need to write your notifications text clearer.

          Originally posted by tokind
          It also seems to me that a timeout or failure would be classified as Warning or Critical, whereas restoration of a service would be Informational.
          I think that would be wrong for most cases. If your alarm calls you that your house is on fire, you rush home just to find out someone overcooked bacon and tripped a smoke detector for a sec and failed to let you know hat your house is not really on fire, you'll be pissed. Generally ON and OFF messages should be same priority.

          Originally posted by tokind
          In fact I have third trigger for Down which reports that the connection is still down, every 10 minutes. The trigger works in an indicator on my map, causing a yellow indicator to turn to red, labeled DOWN, but never results in an action.
          For one the trigger should not be configured as DOWN - it should be configured as connection state. You are making this a lot more complicated than it should be. Just set your trigger to "{POP7Router:icmpping.last(0)}=0"
          It will be true if ping failed and false if it is successful. Each time trigger changes you can set up an Action to send you a notification on trigger change.


          Originally posted by tokind
          I will be looking at the use of a dependency in the triggers to try to limit them to one page per specific state change. Just off the top of my head, this seems unlikely to work in a logical manner.

          Thank you for confirming my suspicions.
          I think you are lost. You are trying to create a very complicated workaround to make it do exactly what it does normally without any workarounds. Take a step back and re-read the docs.

          -HH

          Comment

          • tokind
            Member
            • May 2007
            • 47

            #6
            I am often guilty of thinking too hard I just want clearer messages than "Connection Timeout ON" and "Connection Timeout OFF" for the two other people on the page list. Such statements are clear to me only because I administer the server and configure the triggers. The recipient of the message is not concerned about the state of the trigger, they want to know the state of the router. So "Connection Timeout" and "Connection Restored" would give them a clearer understanding of the current status of the connection.

            IMHO, I am not lost. I assumed a level of selectivity that is apparently not supported. I'll happily agree not to call this a bug if you agree not to think of me as a fool

            Comment

            • Hichhiker
              Member
              • Nov 2004
              • 45

              #7
              Originally posted by tokind
              I am often guilty of thinking too hard I just want clearer messages than "Connection Timeout ON" and "Connection Timeout OFF" for the two other people on the page list. Such statements are clear to me only because I administer the server and configure the triggers. The recipient of the message is not concerned about the state of the trigger, they want to know the state of the router. So "Connection Timeout" and "Connection Restored" would give them a clearer understanding of the current status of the connection.

              IMHO, I am not lost. I assumed a level of selectivity that is apparently not supported. I'll happily agree not to call this a bug if you agree not to think of me as a fool
              Heh, I did not by any means meant to imply that you are a fool, just that you appear to be making fundamental incorrect assumptions about things and that taking a step back might give you a better perspective.

              Based on what you just said, what you probably ought to be looking at is actions instead of triggers. Triggers maintain state of things in the system - like is the router up or down - actions on the other hand do things like send emails for notification. You can define there what is the format of the message - for example instead of default "Router Down: ON" you can change it to say "'Router Down' event status is now ON" - which actually makes much more sense. I also like to add time/date to the message to when I get paged out of order it makes more sense.
              For example I often do:
              Subject: {TRIGGER.NAME} is now {STATUS}
              Body:
              Trigger: {TRIGGER.NAME}
              State changed to {STATUS}
              Time: {TIME}
              Date: {DATE}


              Of course if you want get fancier, you can create separate actions for on and off state via action filters and even create separate actions for different triggers - however be careful of having multiple actions match an event as you may end up with multiple notifications. Not to mention maintaining all those actions may become a much bigger job than explaining how to read the messages to your two other people.

              -HH

              Comment

              • tokind
                Member
                • May 2007
                • 47

                #8
                Right! Actions may be configured to do what I wanted to do. I took a wrong turn early on and, with your suggestions, got actions configured to deliver the desired messages.

                Single Trigger
                I configured a single trigger for a simplecheck Item

                {Branch7Router:icmpping.last(0)}=0

                I named this trigger "Network Connection Branch 7"

                I then configured TWO Actions. They deliver a clear message and status based on the trigger state!

                To do this, I use the condition of Trigger Value. The "Timeout" warning action looks like this:

                Code:
                Action type = send message 	
                Source 	= Trigger
                Conditions 	
                  Host group = "Routers"
                  Trigger description like "Connection"
                  Trigger value = "ON"
                Send message to = User 	
                Subject = {TRIGGER.NAME} Timeout	
                Message = {TRIGGER.NAME} Timeout at {DATE}-{TIME}
                The "Restored" action looks like this:

                Code:
                Action type = send message 	
                Source 	= Trigger
                Conditions 	
                  Host group = "Routers"
                  Trigger description like "Connection"
                  Trigger value = "OFF"
                Send message to = User 	
                Subject = {TRIGGER.NAME} Restored	
                Message = {TRIGGER.NAME} Restored at {DATE}-{TIME}
                Since there is now only a single trigger, I only get one message per state change of the trigger. The ACTION filters the state change to the desired message: "Timeout" or "Restored".

                Comment

                Working...