4 Best practices for triggers
Triggers are a powerful tool, but they can also create undesired alert noise. To see more real signals, and less noise, follow these tips:
-
Desensitize triggers. Instead of alerting on the latest value (too high/too low), analyze the average for an extended period by using functions like
avg,min, andmax. -
Consider using the
percentilefunction (set it to 95% or 5%) in triggers, if you want to avoid alerts on random spikes and drops. -
Use hysteresis to avoid trigger flapping—frequent changes of trigger state (OK ↔ Problem). A continuum for the problem state can be defined by adding a recovery expression (a separate condition for problem resolution).
-
Use trigger dependencies to avoid alerts, which are not related to the root cause.
-
Use trigger severity to alert on the more serious problems only.
-
Define maintenance windows.
Hysteresis
Sometimes an interval is needed between problem and recovery states, rather than a simple threshold. For example, if we want to define a trigger that reports a problem when server room temperature goes above 20°C and we want it to stay in the problem state until the temperature drops below 15°C, a simple trigger threshold at 20°C will not be enough.
Instead, we need to define a trigger expression for the problem event first (temperature above 20°C). Then we need to define an additional recovery condition (temperature below 15°C). This is done by defining a Recovery expression when configuring a trigger.
In this case, problem recovery takes place in two steps:
- First, the problem expression (temperature above 20°C) has to evaluate to FALSE
- Second, the recovery expression (temperature below 15°C) has to evaluate to TRUE
The recovery expression is evaluated only after the problem event is resolved. The recovery expression being TRUE alone does not resolve a problem if the problem expression is still TRUE!
Example
Temperature in server room is too high.
Problem expression:
last(/server/temp)>20
Recovery expression:
last(/server/temp)<=15
It is unproductive to use the {TRIGGER.VALUE} macro in a recovery expression because this expression is only evaluated when the trigger is in the "Problem" state. Consequently, {TRIGGER.VALUE} will always resolve to "1" (which indicates a "Problem" state) while evaluating the expression.