Hi Zabbix experts,
I've taken some time and experimented with the escalations part of an action. Below I will explain how I found it to work and I hope to induce some feedback, comments and discussion around this.
----
Let’s start at Configuration of Actions.
Example 1:

For this example I’m setting up an action that will alert me when the SSH server is down on the Zabbix server.
Notice here I’ve enabled escalations and I’ve set the escalation period to 60 seconds (at 1). I’ve also enabled recovery message (at 2, I will go into more detail about this later on).
I’ve defined the default message for when in problem state, and also for when recovered.
On the right-hand-side I’ve clicked new which brought up the ‘Edit operation’ box. In here we have steps from, to and period (at A,B,C). From and To steps indicate the instances at which the message should be sent:
Step 1 – the instant at which zabbix encounters the problem. Example: 14:00
Step 2 – the instant + Period (seconds) [see 1] from step 1. Example: 14:01
Step 3 – the instant + Period (seconds) [see 1] from step 1. Example: 14:02
Etc
If you define From Step 1 to Step 3: It will alert every 60 seconds (because that is the period set at 1) three times (ie, at 14:00, at 14:01, at 14:02)
If you define From Step 1 to Step 1: It will alert once, at the instance the problem was detected.
If you define From Step 3 to Step 3: It will alert once, at 14:02
Example 2:

In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:
14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message
14:01 – Step 2 to 2 – Period is 60sec, thus 60 sec after instant of problem and only then – user guest receive msg
14:02 – Step 3 – Period is 60 sec, thus 60 sec after step 2 – group “AppDev 1st Standby” receive msg
14:03 – Step 4 – Period is 60 sec, thus 60 sec after step 3 – group “AppDev 1st Standby” receive msg
PERIOD

The value that you define in the Period box (see A in the img) replace the value defined as your default period (see B in the img).
In other words, the value defined on the left (at B) is your default delay between escalations, and the value defined at A is the new value to replace the default.
Example 3:

In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:
14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message. At this step we also replace the default delay of 60sec with 120sec.
14:02 – Step 2 to 2 – Because the delay was changed to 120sec, only 120sec after the instant of the problem will step 2 occur, which sends a msg to ‘monitor’ and replace the delay with 180 sec.
14:05 – Step 3 – Period (delay) is 180 sec, thus 180 sec after step 2 – send msg to ‘monitor’ and change delay to 240sec
14:09 – Step 4 – Period (delay) is 240 sec, thus 240 sec after step 3 – send msg to ‘monitor’ and change delay back to default (which is 60sec)
That means, that if you were to define a step 5, step 5 would occur 60sec after step 4 because the period/delay was set to the default at step 4.
RECOVERY MESSAGES
I’m not sure if I understand these correctly, or if I understand the purpose of these correctly, but this is the behaviour I’ve seen.
Example 4:
Lets take example 3 above and look at how we receive recovery messages for that configuration.
Lets assume that the SSH server came back up on the Zabbix server and that the Zabbix trigger for it went to OK at 15:00.
15:00 – Step 1 – The instant the problem has recovered is step 1 – send recovery msg to ‘monitor’
15:02 – Step 2 – The delay was 120sec, so 120 sec AFTER THE RECOVERY – send recovery msg to ‘monitor’
15:05 – Step 3 – The delay was 180sec, so 180 sec after step 2 (5 min AFTER RECOVERY) – send recovery msg to ‘monitor’
15:09 – Step 4 – The delay was 240sec, so 240 after step 3 (9 MIN AFTER THE RECOVERY!) – send recovery msg to ‘monitor’
Recovery messages works in exactly the same way that problem messages do. Thus in a real work scenario you might send a message at step 1 (the instant) to your 1st standby. You might decide to escalate to the manager if the problem isn’t fixed within an hour. BUT this means, that your manager will only get the recovery message 1 hour after the problem was fixed (and the standby person will seem like he took much longer that he really did)
It would make more sense if the recovery message is sent to (a) all and only the recipients who received the problem message (b) at the instant the problem recovered.
--
Can the experts please tell me if this is how you also experience escalations to work?
--
There is another way to do this, it isn’t perfect either, which I will cover in a reply to this post.
I've taken some time and experimented with the escalations part of an action. Below I will explain how I found it to work and I hope to induce some feedback, comments and discussion around this.
----
Let’s start at Configuration of Actions.
Example 1:

For this example I’m setting up an action that will alert me when the SSH server is down on the Zabbix server.
Notice here I’ve enabled escalations and I’ve set the escalation period to 60 seconds (at 1). I’ve also enabled recovery message (at 2, I will go into more detail about this later on).
I’ve defined the default message for when in problem state, and also for when recovered.
On the right-hand-side I’ve clicked new which brought up the ‘Edit operation’ box. In here we have steps from, to and period (at A,B,C). From and To steps indicate the instances at which the message should be sent:
Step 1 – the instant at which zabbix encounters the problem. Example: 14:00
Step 2 – the instant + Period (seconds) [see 1] from step 1. Example: 14:01
Step 3 – the instant + Period (seconds) [see 1] from step 1. Example: 14:02
Etc
If you define From Step 1 to Step 3: It will alert every 60 seconds (because that is the period set at 1) three times (ie, at 14:00, at 14:01, at 14:02)
If you define From Step 1 to Step 1: It will alert once, at the instance the problem was detected.
If you define From Step 3 to Step 3: It will alert once, at 14:02
Example 2:

In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:
14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message
14:01 – Step 2 to 2 – Period is 60sec, thus 60 sec after instant of problem and only then – user guest receive msg
14:02 – Step 3 – Period is 60 sec, thus 60 sec after step 2 – group “AppDev 1st Standby” receive msg
14:03 – Step 4 – Period is 60 sec, thus 60 sec after step 3 – group “AppDev 1st Standby” receive msg
PERIOD

The value that you define in the Period box (see A in the img) replace the value defined as your default period (see B in the img).
In other words, the value defined on the left (at B) is your default delay between escalations, and the value defined at A is the new value to replace the default.
Example 3:

In this example, when the “SSH server is down on Zabbix Server” trigger is set off (for example at 14:00) the following will occur:
14:00 – Step 1 to 1 – at the instant of problem and only then : monitor will receive an email with default message. At this step we also replace the default delay of 60sec with 120sec.
14:02 – Step 2 to 2 – Because the delay was changed to 120sec, only 120sec after the instant of the problem will step 2 occur, which sends a msg to ‘monitor’ and replace the delay with 180 sec.
14:05 – Step 3 – Period (delay) is 180 sec, thus 180 sec after step 2 – send msg to ‘monitor’ and change delay to 240sec
14:09 – Step 4 – Period (delay) is 240 sec, thus 240 sec after step 3 – send msg to ‘monitor’ and change delay back to default (which is 60sec)
That means, that if you were to define a step 5, step 5 would occur 60sec after step 4 because the period/delay was set to the default at step 4.
RECOVERY MESSAGES
I’m not sure if I understand these correctly, or if I understand the purpose of these correctly, but this is the behaviour I’ve seen.
Example 4:
Lets take example 3 above and look at how we receive recovery messages for that configuration.
Lets assume that the SSH server came back up on the Zabbix server and that the Zabbix trigger for it went to OK at 15:00.
15:00 – Step 1 – The instant the problem has recovered is step 1 – send recovery msg to ‘monitor’
15:02 – Step 2 – The delay was 120sec, so 120 sec AFTER THE RECOVERY – send recovery msg to ‘monitor’
15:05 – Step 3 – The delay was 180sec, so 180 sec after step 2 (5 min AFTER RECOVERY) – send recovery msg to ‘monitor’
15:09 – Step 4 – The delay was 240sec, so 240 after step 3 (9 MIN AFTER THE RECOVERY!) – send recovery msg to ‘monitor’
Recovery messages works in exactly the same way that problem messages do. Thus in a real work scenario you might send a message at step 1 (the instant) to your 1st standby. You might decide to escalate to the manager if the problem isn’t fixed within an hour. BUT this means, that your manager will only get the recovery message 1 hour after the problem was fixed (and the standby person will seem like he took much longer that he really did)
It would make more sense if the recovery message is sent to (a) all and only the recipients who received the problem message (b) at the instant the problem recovered.
--
Can the experts please tell me if this is how you also experience escalations to work?
--
There is another way to do this, it isn’t perfect either, which I will cover in a reply to this post.








Comment