Ad Widget

**dennisfeiock** · 28-07-2009, 21:42

I did a little further testing and found out a bit more about this problem.

I created a test item, trigger, and action. I tested the action successfully as a non-escalation action twice. I then turned the escalation on, and when the recovery message was sent, I got three messages. I'm beginning to wonder if the logic for the recovery message looks at the number of "OK" events for the trigger in question, rather than just the latest one.

This is my test info:
ITEM
Name: simple_test
Key: testing (UserParameter=testing,echo 1)

TRIGGER
Name: test_trigger
Expression: {fputil:testing.last(0)}#1
Event Generation: Normal

ACTION:
Name: Test Email
Escalations: enabled
Recovery Message: enabled
Action Conditions: Trigger = "test_trigger" and Trigger value = "PROBLEM"
Action Operations: From 1 to 0, period 0, Send message, single user: me

- DF

**dennisfeiock** · 30-07-2009, 18:43

Well, I managed to find the problem in the logs. I guess I'll have to try to decipher the code to see what is going on.

As I suspected, the reason Zabbix spams out the recovery message is that its looking at old events for some reason. These are the log entries:

Code:

 27355:20090730:104638 In process_escalations()
 27355:20090730:104638 Query [select escalationid,actionid,triggerid,eventid,r_eventid,esc_step,status from escalations where status in (0,1) and nextcheck<=1248965198 and escalationid between 000000000000000 and 099999999999999]
 27355:20090730:104638 Query [begin;]
 27355:20090730:104638 Query [select description,status from triggers where triggerid=13260]
 27355:20090730:104638 Query [select i.description from items i,functions f,triggers t where t.triggerid=13260 and f.triggerid=t.triggerid and i.itemid=f.itemid and i.status=1]
 27355:20090730:104638 Query [select h.host from hosts h,items i,functions f,triggers t where t.triggerid=13260 and t.triggerid=f.triggerid and f.itemid=i.itemid and i.hostid=h.hostid and h.status=1]
 27355:20090730:104638 Query [select actionid,eventsource,esc_period,r_shortdata,r_longdata,recovery_msg,status,name from actions where actionid=9]
 27355:20090730:104638 Query [select eventid,source,object,objectid,clock,value,acknowledged from events where eventid=7911]
 27355:20090730:104638 Query [select description,priority,comments,url,type from triggers where triggerid=13260]
 27355:20090730:104638 In substitute_macros(data:"{TRIGGER.SEVERITY}-{STATUS}: {TRIGGER.NAME}")
 27355:20090730:104638 In substitute_simple_macros (data:"{TRIGGER.SEVERITY}-{STATUS}: {TRIGGER.NAME}")
 27355:20090730:104638 In substitute_simple_macros (data:"No reply from Zabbix agent on {HOSTNAME}")
 27355:20090730:104638 Query [select expression from triggers where triggerid=13260]
 27355:20090730:104638 Query [select distinct h.host from triggers t,functions f,items i,hosts h where t.triggerid=13260 and t.triggerid=f.triggerid and i.itemid=f.itemid and h.hostid=i.hostid]
 27355:20090730:104638 End substitute_simple_macros (result:No reply from Zabbix agent on fpprod2)
 27355:20090730:104638 End substitute_simple_macros (result:Average-OK: No reply from Zabbix agent on fpprod2)
 27355:20090730:104638 End substitute_macros(result:Average-OK: No reply from Zabbix agent on fpprod2)
##here is where it builds the email body and calls substitute_simple_macros
 27355:20090730:104638 Query [select distinct userid from alerts where actionid=9 and eventid=4953 and alerttype=0]
 27355:20090730:104638 In add_message_alert()
##After this it inserts the alerts to the database which get processed later
##When it comes around again (event 5016 the second time), the logs begin with the "Query [begin;]" line.

These emails are supposed to be for the recovery of eventid 7911. However, it is selecting userids for event 4953. After processing those and sending mail, it then does the same thing for event 5016. Rinse and repeat for a total of 69 times before it stopped.

I'm not sure where its dreaming up 4953, 5016, etc as being the eventid, since none of the queries beforehand would return that event in any way.

- DF

**dennisfeiock** · 31-07-2009, 15:35

Well, I think I have narrowed the problem down to ecalator.c, line 833. This is the start of the process_escalations function. The following query is ran:

Code:

	result = DBselect("select escalationid,actionid,triggerid,eventid,r_eventid,esc_step,status"
			" from escalations where status in (%d,%d) and nextcheck<=%d" DB_NODE,
			ESCALATION_STATUS_ACTIVE,
			ESCALATION_STATUS_RECOVERY,
			now,
			DBnode_local("escalationid"));

Since the begin; query is at the start of each repeated message, I believe this query is returning multiple results for some reason. I am not sure how the escalations table is supposed to look, but on mine I have:
765 rows
all of which have a status of 2
some of them have esc_step of 1 or 2, most are at step 0
some have r_eventid's specified, most are at 0
42 distinct values for triggerid
765 distinct values for eventid
3 distinct values for actionid

Earlier in the logs, I see the following entry:

Code:

 27333:20090730:104636 Query [update escalations set r_eventid=7911,status=1,nextcheck=0 where actionid=9 and triggerid=13260]

Since I appear to have these extra escalation entries lying around, when this runs it would update a lot of extra rows to have a status of 1. Then when the query for process_escalations runs, it would result in a bunch of spam.

Can someone that has escalations with recovery messages working properly see whats in their escalations table? Should I just delete all the rows from the escalations table? Will doing so have an adverse affect on how zabbix operates?

Thanks,
DF

**dennisfeiock** · 31-07-2009, 15:59

While I have been working on this, I've left my escalation turned off and have just a normal action with a recovery message. When a PROBLEM trigger was thrown, an email was sent. This caused the escalations table to increase to 766 rows. Shortly afterwards, the trigger changed to OK and the recovery message was sent. This caused the escalations table to increase to 767 rows.

When looking at the escalation table, the last two entries are for the same trigger id, but with two separate events. Seems to me that this is what caused all these extra entries to get in there.

So, if I turned escalations back on again, and this same trigger went to PROBLEM, then back to OK. The sql to set the recovery event id and change the status to 1 to have a mail sent would flip these two entries to send email as well. Thus, the process_recovery function would generate emails for three recovery events instead of just one.

Ad Widget

Problems with escalator in 1.6.4? Sending extra emails

Problems with escalator in 1.6.4? Sending extra emails

Comment

Comment

Comment

Comment