Ad Widget

Collapse

Problems with escalator in 1.6.4? Sending extra emails

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dennisfeiock
    Junior Member
    • Jun 2006
    • 15

    #1

    Problems with escalator in 1.6.4? Sending extra emails

    I set up the following Action:
    --ACTION--
    Name: Generic Email
    Event Source: Trigger
    Enable Escalations: checked
    Period 3600 seconds
    Subject: blah
    Message: blah
    Recovery Message: checked
    Recovery Subject: blah
    Recovery Message: blah
    Status: Enabled

    --CONDITIONS--
    Trigger Value = Problem

    --ACTION OPERATIONS--
    Step: From 1 to 0
    Period: 0
    Operation Type: Send message
    Send Message to: User group, Some group (7 members)
    Default message: checked
    Conditions: Event acknowledged = Not Ack

    Sometimes when the recovery message gets sent, rather than sending 1 recovery message to each of the people in the user group, something goes wrong and it sends a lot extra. The one case I am looking at sent 195 emails across 7 people (27.8 each).

    In the debugging logs, I see the process_escalations function start. For some reason, it gets mixed up and inserts the extra messages into the actions table. I'm not too great with C, so I'm having a hard time figuring out where the problem lies.

    This is an intermittent issue which only appears to happen for the recovery messages (that I've seen so far). I would guess this occurs roughly 10% of the time or so. After the proccess_escalations completed, it said it took over full second to process everything. Normally the times are around .002 seconds

    A section of the debug log (up to a second message going to person1 for the same event) is attached. I had to rename it to a .patch file to be able to attach it.

    Even after removing the escalation (and making it a normal action), I had one time where this problem presented itself. As a result, I unchecked the escalation part, and removed the condition. It happened pretty close to when I had only disabled the escalation, so I'm not sure if this problem is limited down to being the trigger status = problem part or not.

    Thanks for any help,
    DF
    Attached Files
  • dennisfeiock
    Junior Member
    • Jun 2006
    • 15

    #2
    I did a little further testing and found out a bit more about this problem.

    I created a test item, trigger, and action. I tested the action successfully as a non-escalation action twice. I then turned the escalation on, and when the recovery message was sent, I got three messages. I'm beginning to wonder if the logic for the recovery message looks at the number of "OK" events for the trigger in question, rather than just the latest one.

    This is my test info:
    ITEM
    Name: simple_test
    Key: testing (UserParameter=testing,echo 1)

    TRIGGER
    Name: test_trigger
    Expression: {fputil:testing.last(0)}#1
    Event Generation: Normal

    ACTION:
    Name: Test Email
    Escalations: enabled
    Recovery Message: enabled
    Action Conditions: Trigger = "test_trigger" and Trigger value = "PROBLEM"
    Action Operations: From 1 to 0, period 0, Send message, single user: me

    - DF

    Comment

    • dennisfeiock
      Junior Member
      • Jun 2006
      • 15

      #3
      Well, I managed to find the problem in the logs. I guess I'll have to try to decipher the code to see what is going on.

      As I suspected, the reason Zabbix spams out the recovery message is that its looking at old events for some reason. These are the log entries:
      Code:
       27355:20090730:104638 In process_escalations()
       27355:20090730:104638 Query [select escalationid,actionid,triggerid,eventid,r_eventid,esc_step,status from escalations where status in (0,1) and nextcheck<=1248965198 and escalationid between 000000000000000 and 099999999999999]
       27355:20090730:104638 Query [begin;]
       27355:20090730:104638 Query [select description,status from triggers where triggerid=13260]
       27355:20090730:104638 Query [select i.description from items i,functions f,triggers t where t.triggerid=13260 and f.triggerid=t.triggerid and i.itemid=f.itemid and i.status=1]
       27355:20090730:104638 Query [select h.host from hosts h,items i,functions f,triggers t where t.triggerid=13260 and t.triggerid=f.triggerid and f.itemid=i.itemid and i.hostid=h.hostid and h.status=1]
       27355:20090730:104638 Query [select actionid,eventsource,esc_period,r_shortdata,r_longdata,recovery_msg,status,name from actions where actionid=9]
       27355:20090730:104638 Query [select eventid,source,object,objectid,clock,value,acknowledged from events where eventid=7911]
       27355:20090730:104638 Query [select description,priority,comments,url,type from triggers where triggerid=13260]
       27355:20090730:104638 In substitute_macros(data:"{TRIGGER.SEVERITY}-{STATUS}: {TRIGGER.NAME}")
       27355:20090730:104638 In substitute_simple_macros (data:"{TRIGGER.SEVERITY}-{STATUS}: {TRIGGER.NAME}")
       27355:20090730:104638 In substitute_simple_macros (data:"No reply from Zabbix agent on {HOSTNAME}")
       27355:20090730:104638 Query [select expression from triggers where triggerid=13260]
       27355:20090730:104638 Query [select distinct h.host from triggers t,functions f,items i,hosts h where t.triggerid=13260 and t.triggerid=f.triggerid and i.itemid=f.itemid and h.hostid=i.hostid]
       27355:20090730:104638 End substitute_simple_macros (result:No reply from Zabbix agent on fpprod2)
       27355:20090730:104638 End substitute_simple_macros (result:Average-OK: No reply from Zabbix agent on fpprod2)
       27355:20090730:104638 End substitute_macros(result:Average-OK: No reply from Zabbix agent on fpprod2)
      ##here is where it builds the email body and calls substitute_simple_macros
       27355:20090730:104638 Query [select distinct userid from alerts where actionid=9 and eventid=4953 and alerttype=0]
       27355:20090730:104638 In add_message_alert()
      ##After this it inserts the alerts to the database which get processed later
      ##When it comes around again (event 5016 the second time), the logs begin with the "Query [begin;]" line.
      These emails are supposed to be for the recovery of eventid 7911. However, it is selecting userids for event 4953. After processing those and sending mail, it then does the same thing for event 5016. Rinse and repeat for a total of 69 times before it stopped.

      I'm not sure where its dreaming up 4953, 5016, etc as being the eventid, since none of the queries beforehand would return that event in any way.

      - DF
      Last edited by dennisfeiock; 30-07-2009, 18:55.

      Comment

      • dennisfeiock
        Junior Member
        • Jun 2006
        • 15

        #4
        Well, I think I have narrowed the problem down to ecalator.c, line 833. This is the start of the process_escalations function. The following query is ran:
        Code:
        	result = DBselect("select escalationid,actionid,triggerid,eventid,r_eventid,esc_step,status"
        			" from escalations where status in (%d,%d) and nextcheck<=%d" DB_NODE,
        			ESCALATION_STATUS_ACTIVE,
        			ESCALATION_STATUS_RECOVERY,
        			now,
        			DBnode_local("escalationid"));
        Since the begin; query is at the start of each repeated message, I believe this query is returning multiple results for some reason. I am not sure how the escalations table is supposed to look, but on mine I have:
        765 rows
        all of which have a status of 2
        some of them have esc_step of 1 or 2, most are at step 0
        some have r_eventid's specified, most are at 0
        42 distinct values for triggerid
        765 distinct values for eventid
        3 distinct values for actionid

        Earlier in the logs, I see the following entry:
        Code:
         27333:20090730:104636 Query [update escalations set r_eventid=7911,status=1,nextcheck=0 where actionid=9 and triggerid=13260]
        Since I appear to have these extra escalation entries lying around, when this runs it would update a lot of extra rows to have a status of 1. Then when the query for process_escalations runs, it would result in a bunch of spam.

        Can someone that has escalations with recovery messages working properly see whats in their escalations table? Should I just delete all the rows from the escalations table? Will doing so have an adverse affect on how zabbix operates?

        Thanks,
        DF

        Comment

        • dennisfeiock
          Junior Member
          • Jun 2006
          • 15

          #5
          While I have been working on this, I've left my escalation turned off and have just a normal action with a recovery message. When a PROBLEM trigger was thrown, an email was sent. This caused the escalations table to increase to 766 rows. Shortly afterwards, the trigger changed to OK and the recovery message was sent. This caused the escalations table to increase to 767 rows.

          When looking at the escalation table, the last two entries are for the same trigger id, but with two separate events. Seems to me that this is what caused all these extra entries to get in there.

          So, if I turned escalations back on again, and this same trigger went to PROBLEM, then back to OK. The sql to set the recovery event id and change the status to 1 to have a mail sent would flip these two entries to send email as well. Thus, the process_recovery function would generate emails for three recovery events instead of just one.

          Comment

          Working...