Hi everyone!
Firstly, I'd like to talk about configuring the actions in general. How it is confusing.
Secondly, about notifications after maintenance period expiration.
o) Configuring Actions.
Now in current zabbix version 2.0 we have 'Trigger value' condition that can be set to OK or PROBLEM or both. And we have Recovery message checkbox.
At first glance I (and probably some of you) would say that 'Recovery message checkbox' is excess option, and that checkbox works like 'Trigger Value = OK' condition (or the other way around).
Beware. Usually it does, but there is only one difference. When you want to use escalations only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' works good as expected.
Surprise? I know, it's documented, but I'm going crazy when thinking why did developers do that. Why did they make two things that do the same but not the same. %)
So, we might think, if only this configuration works as expected we could abolish 'Trigger value' condition at all! What for do we need two things that do the same?
In other words set 'Trigger value' condition permanently to 'PROBLEM' and hide it from user. Then if user wants to receive OK notifications let him mark 'Recovery msg checkbox'. So there is no 'Trigger value' condition anymore.
But, I can imagine situation when we want to run 'Remote command' (instead of 'Send message') on OK switching. Or we would want only OK notification (quiet strange) or something else.
Another disadvantage of removing 'Trigger value' condition: it's very difficult to find out which action configured to send OK notifications in Actions list (Configuration -> Actions).
For example, I use zabbix in production since 1.4 (now 1.8) and have ~70 actions (you have more, I know
)... And I still can figure out which one has OK condition at first glance)
So, I tried every variant to configure action. Here is the table with testing (see first page). As you can see there are so many possible different configurations! It's quiet confusing.
W/o escalation every variant works good, but w/ escalation in most cases OK message is sent after step duration. Only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' works excellent.
o) Notifications after or during maintenance period.
How does notification work in zabbix 2.0 when we use maintenance period? So, we just figured out that only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' is good.
I tried this variant and "'Trigger value = PROBLEM' and 'Trigger value = OK'" variant. Here is another table (see previous link, at second page).
The main idea is neither of this configurations work good.
Let's consider situation when you have one trigger on some host. You have configured action, that notifies you at OK and at PROBLEM switching of this trigger.
Then you have configured 'maintenance period' for this host with collection data
Initially tigger is on OK state
Maintenance period starts
Then during maintenance period trigger switched to PROBLEM state.
It wouldn't send notification, right? Yes, and it's good behavior.
But then maintenance period ends and our trigger is still in PROBLEM state, i.e. trigger_state_before_mp != trigger_state_after_mp
And in this situation zabbix wouldn't notify you. Is it good behavior?
I think no. I think it's very annoying bug/feature. I think it mustn't notify you only in case if trigger_state_before_mp == trigger_state_after_mp, no matter did it switch during maintenance or not.
You can realize this behaviour is actually ok, because Zabbix does send notifications when there was a trigger status change.
Not actually. See generate_events() which is called from update_maintenance_hosts().
In 1.8 branch if you use 'Trigger value = PROBLEM' and 'Trigger value = OK' instead of 'Recovery message box' it works like I just described (but you had to forget about escalations).
Here is ticket to prove my statement - ZBXNEXT-894.
I wrote small patch in this ticket. Patch makes this behavior much useful.
It would be much more useful if zabbix sent you such notifications in described case. Do you agree? <--- It's main message!
This is the biggest problem why I still use 1.8 branch in production.
Because you can't switch off escalations on 2.0. It's hided from user and switched on.
I've made small patch for 2.0 branch. Find it in suggestions below!
Example from real life: I work in IT department (~30 admins) with different hardware and servers. I'm the only one who support our monitoring system.
One of these admins made MP for host1. When MP had started admin updated software on host1 and then rebooted it. When host1 went up some of service that has trigger didn't go up.
Tigger Service1 switched to PROBLEM, no notification. Then maintenance ended and no notification again. Admin went home and he doesn't know that something is going bad.
o) Finally. As result:
- two things that do the same, but not the same ('Trigger value = OK' and 'Recovery msg checkbox'). A bit counter-intuitive.
- Not good behavior of notifications with maintenance.
My suggestions:
1. Make 'Recovery msg checkbox' something like alias to 'Trigger value = OK' condition and leave that checkbox just for overwriting recovery message. Or at least make it visible in action list, but previous suggestion is much more welcome!
2. Remove default adding of 'Trigger value = PROBLEM'. Leave just 'Maintenance status not in "maintenance"'. Because if we don't set any 'Trigger value' condition then any trigger value will cause event (i.e. both OK and PROBLEM). It's general logic of any condition type work.
3. Make 'Trigger value = OK' work with escalations in smart way, i.e make zabbix to consider operation type.
If 'operation type'=='send message' then consider escalation step duration only when trigger goes to PROBLEM, but at OK ignore step duration. If 'operation type'=='remote command' then consider step duration always.
Or just introduce 'Ignore step duration on OK' checkbox for action configuration! which would set by default.
4. Realize suggested good behaviour of notifications with maintenance. See suggested in ticket patch (if it will work with above fixes). I meant notify only if trigger_state_before_mp != trigger_state_after_mp.
Here is patch for 2.0 branch. Check it out.
http://pastebin.com/Ux61kZvb
5. Revert r32342 trunk svn commit to bring back good and right comment for generate_events() function.
Welcome to discuss!
P.S. Sorry for my english! It was so hard to explain in foreign language!
UPD. Suggestions are reworked 3.0.
UPD 2. Patch for 2.0 has been added!
Firstly, I'd like to talk about configuring the actions in general. How it is confusing.
Secondly, about notifications after maintenance period expiration.
o) Configuring Actions.
Now in current zabbix version 2.0 we have 'Trigger value' condition that can be set to OK or PROBLEM or both. And we have Recovery message checkbox.
At first glance I (and probably some of you) would say that 'Recovery message checkbox' is excess option, and that checkbox works like 'Trigger Value = OK' condition (or the other way around).
Beware. Usually it does, but there is only one difference. When you want to use escalations only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' works good as expected.
Surprise? I know, it's documented, but I'm going crazy when thinking why did developers do that. Why did they make two things that do the same but not the same. %)
So, we might think, if only this configuration works as expected we could abolish 'Trigger value' condition at all! What for do we need two things that do the same?
In other words set 'Trigger value' condition permanently to 'PROBLEM' and hide it from user. Then if user wants to receive OK notifications let him mark 'Recovery msg checkbox'. So there is no 'Trigger value' condition anymore.
But, I can imagine situation when we want to run 'Remote command' (instead of 'Send message') on OK switching. Or we would want only OK notification (quiet strange) or something else.
Another disadvantage of removing 'Trigger value' condition: it's very difficult to find out which action configured to send OK notifications in Actions list (Configuration -> Actions).
For example, I use zabbix in production since 1.4 (now 1.8) and have ~70 actions (you have more, I know
)... And I still can figure out which one has OK condition at first glance)So, I tried every variant to configure action. Here is the table with testing (see first page). As you can see there are so many possible different configurations! It's quiet confusing.
W/o escalation every variant works good, but w/ escalation in most cases OK message is sent after step duration. Only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' works excellent.
o) Notifications after or during maintenance period.
How does notification work in zabbix 2.0 when we use maintenance period? So, we just figured out that only 'Trigger value = PROBLEM' + 'Recovery msg checkbox' is good.
I tried this variant and "'Trigger value = PROBLEM' and 'Trigger value = OK'" variant. Here is another table (see previous link, at second page).
The main idea is neither of this configurations work good.
Let's consider situation when you have one trigger on some host. You have configured action, that notifies you at OK and at PROBLEM switching of this trigger.
Then you have configured 'maintenance period' for this host with collection data
Initially tigger is on OK state
Maintenance period starts
Then during maintenance period trigger switched to PROBLEM state.
It wouldn't send notification, right? Yes, and it's good behavior.
But then maintenance period ends and our trigger is still in PROBLEM state, i.e. trigger_state_before_mp != trigger_state_after_mp
And in this situation zabbix wouldn't notify you. Is it good behavior?
I think no. I think it's very annoying bug/feature. I think it mustn't notify you only in case if trigger_state_before_mp == trigger_state_after_mp, no matter did it switch during maintenance or not.
You can realize this behaviour is actually ok, because Zabbix does send notifications when there was a trigger status change.
Not actually. See generate_events() which is called from update_maintenance_hosts().
In 1.8 branch if you use 'Trigger value = PROBLEM' and 'Trigger value = OK' instead of 'Recovery message box' it works like I just described (but you had to forget about escalations).
Here is ticket to prove my statement - ZBXNEXT-894.
I wrote small patch in this ticket. Patch makes this behavior much useful.
It would be much more useful if zabbix sent you such notifications in described case. Do you agree? <--- It's main message!
This is the biggest problem why I still use 1.8 branch in production.
Because you can't switch off escalations on 2.0. It's hided from user and switched on.
I've made small patch for 2.0 branch. Find it in suggestions below!
Example from real life: I work in IT department (~30 admins) with different hardware and servers. I'm the only one who support our monitoring system.
One of these admins made MP for host1. When MP had started admin updated software on host1 and then rebooted it. When host1 went up some of service that has trigger didn't go up.
Tigger Service1 switched to PROBLEM, no notification. Then maintenance ended and no notification again. Admin went home and he doesn't know that something is going bad.
o) Finally. As result:
- two things that do the same, but not the same ('Trigger value = OK' and 'Recovery msg checkbox'). A bit counter-intuitive.
- Not good behavior of notifications with maintenance.
My suggestions:
1. Make 'Recovery msg checkbox' something like alias to 'Trigger value = OK' condition and leave that checkbox just for overwriting recovery message. Or at least make it visible in action list, but previous suggestion is much more welcome!
2. Remove default adding of 'Trigger value = PROBLEM'. Leave just 'Maintenance status not in "maintenance"'. Because if we don't set any 'Trigger value' condition then any trigger value will cause event (i.e. both OK and PROBLEM). It's general logic of any condition type work.
3. Make 'Trigger value = OK' work with escalations in smart way, i.e make zabbix to consider operation type.
If 'operation type'=='send message' then consider escalation step duration only when trigger goes to PROBLEM, but at OK ignore step duration. If 'operation type'=='remote command' then consider step duration always.
Or just introduce 'Ignore step duration on OK' checkbox for action configuration! which would set by default.
4. Realize suggested good behaviour of notifications with maintenance. See suggested in ticket patch (if it will work with above fixes). I meant notify only if trigger_state_before_mp != trigger_state_after_mp.
Here is patch for 2.0 branch. Check it out.
http://pastebin.com/Ux61kZvb
5. Revert r32342 trunk svn commit to bring back good and right comment for generate_events() function.
Welcome to discuss!
P.S. Sorry for my english! It was so hard to explain in foreign language!
UPD. Suggestions are reworked 3.0.
UPD 2. Patch for 2.0 has been added!
Comment