ODT Export
 

Delaying notifications

What?

This article shows you how to wait a certain amount of time before sending out notifications. (It does what you may know as Nagios “soft states” where a problem is not alerted instantly.)

Why?

Reason 1: ignore transient problems

Some problems may arise only once and you may not want to get informed about them if they go away at the next check. (Our UPS for example has the annoying habit to sometimes claim that someone stole its battery or that the temperature is volcano-like in the computer room.) Although in such cases of buggy equipment it may be better to use trigger conditions to avoid triggering instantly.)

Reason 2: give Zabbix enough time to evaluate trigger dependencies

Dependencies may not be calculated correctly if you alert at the first appearance of a problem. Imagine you have a remote office with a router and a server. Imagine you lost connection to the router. If you had set proper trigger dependencies (“server→router”) then you should get a notification about the router outage only and not that that the server (also) became unavailable. But imagine that Zabbix is currently checking the server and finds that it is unreachable. It will then instantly send out notifications to the admins that the server went away. A few seconds later Zabbix checks the router and sees that it is down, too. It computes the dependencies and understands that the trigger of “server” can be ignored and that “router” is the component causing the problem. But the notification about the server outage went out already. If Zabbix had been given more time then it would have only sent out the notification about the “router”.

How?

The key to delaying notifications is using escalations. You likely have an action that informs somebody in case a trigger fires. Take a look at how I set this up:

Escalations have been described and discussed in the forum. Just as a quick introduction how escalations work: Escalations are used if the first user does not react upon a notification. Then after a defined amount of time the next user gets informed (e.g. their boss). You can even repeat notifications if you fear that the first notification got lost.

In the top right panel you define a plan what to do at what times after a problem arises. Every “period” seconds (120 in this example) the next step is run. So step 1 starts at second 0 (the instant the trigger fires). Step 2 would be 120 seconds later. (The default period is defined on the left!) Step 3 would happen at second 240. Step 4 at second 360 and so on.

So if you define that some action should happen at seconds 0, 120 and 240 then you would define the steps 1-3. That is the meaning of “From” (1) and “To” (3).

What I do here is not do anything at Step 1. So the trigger fires but nobody gets notified. You would see the problem appear on the dashboard though. If the problem persists after 120 seconds then Step 2 of the escalation plan is executed which means to inform a certain user group.

An extra goodie is the condition on the lower right. It means that this action is only run if the event has not been acknowledged yet. If anyone acknowledges the problem within the first 120 seconds then the condition would not match and nobody would get a notification.

 
howto/config/alerts/delaying_notifications.txt · Last modified: 2009/10/21 18:15 by Signum
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki