4 Best practices

Triggers are a powerful tool, but they can also create undesired alert noise. To see more real signals, and less noise, follow these tips:

Desensitize triggers. Instead of alerting on the latest value (too high/too low), analyze the average for an extended period by using functions like avg, min, and max.
Consider using the percentile function (set it to 95% or 5%) in triggers, if you want to avoid alerts on random spikes and drops.
Use recovery expressions to avoid frequent changes of the trigger state (OK ↔︎ Problem) or trigger flapping. Recovery expressions allow to create a separate condition for problem resolution. See hysteresis.
Use trigger dependencies to avoid alerts, which are not related to the root cause.
Use trigger severity to alert on the more serious problems only.
Define maintenance windows.

Hysteresis

Sometimes an interval is needed between problem and recovery states, rather than a simple threshold. For example, if we want to define a trigger that reports a problem when server room temperature goes above 20°C and we want it to stay in the problem state until the temperature drops below 15°C, a simple trigger threshold at 20°C will not be enough.

Instead, we need to define a trigger expression for the problem event first (temperature above 20°C). Then we need to define an additional recovery condition (temperature below 15°C). This is done by defining a Recovery expression when configuring a trigger.

In this case, problem recovery takes place in two steps:

First, the problem expression (temperature above 20°C) has to evaluate to FALSE
Second, the recovery expression (temperature below 15°C) has to evaluate to TRUE

The recovery expression is evaluated only after the problem event is resolved. The recovery expression being TRUE alone does not resolve a problem if the problem expression is still TRUE!

Example

Temperature in server room is too high.

Problem expression:

last(/server/temp)>20

Recovery expression:

last(/server/temp)<=15

It is unproductive to use the {TRIGGER.VALUE} macro in a recovery expression because this expression is only evaluated when the trigger is in the "Problem" state. Consequently, {TRIGGER.VALUE} will always resolve to "1" (which indicates a "Problem" state) while evaluating the expression.

Documentation

4 Best practices

Hysteresis