Triggers and problems

Configuration

Naming

Trigger names must be prefixed with the LLD object they belong to.

Trigger names should not use the {HOST.NAME} macro to keep names shorter. Consider getting this data from the host column.

Avoid using {ITEM.LASTVALUE} in trigger name

Don’t use {ITEM.LASTVALUE1-9} macros right in trigger names. These macros are expanded to values at the time when problem name is generated.

Use it in the operational data field (available since Zabbix 4.4) instead.

Explain the threshold in event name

Consider explaining why trigger fired (threshold) in parenthesis ().

Use the event name field for it (supported since Zabbix 5.2), to keep the trigger name short. The event name, if defined, will be used for generating the problem name.

E. g.:

  • Trigger name: CPU load is too high
  • Event name: CPU load is too high (over 1.5)

Other examples for event names:

Good Bad
Temperature is too high (over 35 C for 5m)
MySQL: Refused connections (max_connections limit reached)
Temperature is too high (now: 40)
MySQL: Refused connections
Trigger description

Use this field to describe:

  • Describe the problem in more detail. But do not just copy the text from the trigger name.
  • Why it is important to check this
  • Describe the probable root cause of the problem if possible and which actions should be taken
  • Provide a reference to the documentation if any
Expressions

Trigger expressions should be reasonably flap-resistant - that is, not relying on the last value only but checking last 5 or 10 minutes instead. On the other hand, do not make the expressions overly complex - for example, do not use trigger hysteresis unless it really adds significant value.

Prefer to use user macros in trigger expressions to allow thresholds tuning.

Good Bad
last(/TEMPLATE_NAME/temperature)>{$TEMP.MAX.WARN} last(/TEMPLATE_NAME/temperature)>30

Use newlines and spaces to make long trigger expressions more human-readable.

Using time and data suffixes in triggers

Always use time (1m, 5m, 1d...) and size suffixes (1K, 1B, 1G) in trigger expressions and problem names, trigger description, operational data to improve readability. Remember, that you can use them in user macros, too.

Good Bad
avg(/TEMPLATE_NAME/temperature,10m)>{$TEMP.MAX.WARN}
avg(/TEMPLATE_NAME/memory.free,10m)<{$MEM_FREE.WARN} where {$MEM_FREE.WARN} = 100M
avg(/TEMPLATE_NAME/temperature,600)>{$TEMP.MAX.WARN}}
avg(/TEMPLATE_NAME/memory.free,600)<{$MEM_FREE.WARN} where {$MEM_FREE.WARN} = 104857600
Severity

Triggers created in the templates are mapped to the standard Zabbix severity scale. Consider choosing the severity assigned to the trigger with the following in mind:

Severity Description Examples Expected reaction type and time (not always true!), given as example only
Not classified Not used under normal circumstances
Info The event happened that is not an alarm at all. This is the info that might be helpful in the future for retrospective analysis or for auditing. Examples: s/n changed, user logged in, etc None
Warning A minor alarm that could lead to some more serious problem if left without attention. Examples: Disk usage is low but there is still some room React during working hours, no notification is expected.
Average Performance alarms: Average alarm that indicates serious performance problems or key service degradation.

Fault alarms: partial resource failure or warnings that if left without attention might lead to complete device fault.
Examples: CPU utilization is high, Low memory, High device temperature, Disk health failure in the disk array, Website is slow. React during working hours, create an issue ticket if the problem stays for hours.
High Performance alarms: Key service is not available. Fault alarms: The device is not functioning or not available. No ICMP PING, Website is down. React off working hours if affects services with the page.

React with a ticket during working hours otherwise.
Disaster Reserved for alarms indicating blackouts, disasters, global business service faults.

There should be no triggers with disaster level severity in resource templates.
Riga DC is down, Level core network is down, >50% of users cannot purchase anything from our website. Always react by paging the responsible person.
Trigger tags

Use tags to logically group triggers using the recommended tagging model.

Trigger tags

Tag Value Description
scope performance
availability - a monitoring target or it's part may become unavailable
capacity - a monitored resource may be exhausted
notice
security
compliance - reserved for user-defined templates
Specifies the type of a problem.
Including at least one tag is mandatory; multiple tags are allowed.

For example, the trigger High memory utilization might contain the following tags:

scope: capacity; scope: performance
Trigger macros

For macros used in trigger expressions (thresholds) use this form:

{$[<NAMESPACE>.]<METRIC_NAME>[.MAX|.MIN][.OK |.WARN|.CRIT]}

Use MAX|MIN when you need to highlight whether it is the high or low threshold.

Good Bad
{$MYSQL.REPLICATION_LAG.MAX.WARN}
{$TEMP.MAX.WARN:”{#SENSOR}”}
{$SERVICE.STATUS.CRIT}
{$IF.ERRORS.MAX.WARN}
{$DISK.STATUS.OK}
{$DISK.STATUS.WARN}
{$DISK.STATUS.CRIT}
{$MEM_UTIL.MAX.WARN}
{$MEM_UTIL.MAX.CRIT}
{$DISK_OK_STATUS}
{$MEMORY_UTIL_MAX}

Use trigger snippets

Check the following trigger snippets library and consider reusing configuration to avoid reinventing the wheel.

Case: Something has just been restarted

Trigger: <resource> has just been restarted (uptime < 10m)

Applicable for For uptime counters for device, host, or software/service running
Name <resource> has been restarted
Event name <resource> has been restarted (uptime < 10m)
Description <resource> uptime is less than 10 minutes
Expression last(/TEMPLATE_NAME/METRIC)<10m
Recovery expression -
Recovery mode -
Manual close Yes
Severity Warning for the host. Info for all others.
Depends on -

Case: Any master item + preprocessing in dependent items

Trigger: Master item is not responding

<resource>: Failed to get items (no data for 30m)

Applicable for Any type of items used for bulk data collection
Expression nodata(/TEMPLATE_NAME/temperature,30m)=1
Recovery expression -
Recovery mode -
Manual close Yes
Severity Warning
Depends on If present: <Proc> is not running

Case: HTTP item + regex preprocessing in dependent items

Trigger: HTTP item is not responding

Applicable for HTTP items that provide output for future regex preprocessing.
Use ‘Headers and Body’ mode in the item.

Case: <VALUE> is too high (over X)/ is too low (under X) for slow to change values

For slow changing values (i.e. temperature, use max() for high, and min() for lows to get immediate response with delayed (confirmed) recovery.

Trigger: <VALUE> is too high (over X)

Applicable for High temperature (slow to change)
Expression max(/TEMPLATE_NAME/METRIC,5m) > X

Trigger: <VALUE> is too low (under X)

Applicable for Low temperature (slow to change)
Expression min(/TEMPLATE_NAME/METRIC,5m) < X

Case: <VALUE> is too high (over X for 5m)/ is too low (under X for 5m) for quick-to-change and jumpy values

For jumpy values, use min (for high) and max(for low) to make triggers more tolerable to spikes/noise.

Trigger: <VALUE> is too high (over X for 5m)

Applicable for CPU utilization (jumpy), signal strength(jumpy), network utilization
Expression min(/TEMPLATE_NAME/METRIC,5m) > X

Trigger: <VALUE> is too low (under X for 5m)

Applicable for CPU utilization (jumpy), signal strength(jumpy), network utilization
Expression max(/TEMPLATE_NAME/METRIC,5m) < X

Case: Serial number has changed on the device

Trigger: Serial numbers controls

Applicable for Serial numbers items
Name <resource> has been replaced
Event name <resource> has been replaced (new serial number received)
Description <resource> serial number has changed. Ack to close
Expression last(/TEMPLATE_NAME/METRIC)<>last(/TEMPLATE_NAME/METRIC,#2) and length(/TEMPLATE_NAME/METRIC)>0
Recovery expression -
Recovery mode None
Manual close Yes
Severity Info
Depends on -

Case: Software version has changed on the device

Trigger: Version controls

Applicable for Software version items
Name <resource> version has changed
Event name <resource> version has changed (new version: {ITEM.VALUE})
Description <resource> version has changed. Ack to close
Expression last(/TEMPLATE_NAME/METRIC)<>last(/TEMPLATE_NAME/METRIC,#2) and length(/TEMPLATE_NAME/METRIC)>0
Recovery expression -
Recovery mode None
Manual close Yes
Severity Info
Depends on -

Case: Control how much disk space is left

Trigger: Filesystem space is critically low with timeleft with context macro

{$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"} = 90

Applicable for Filesystems
Name Disk space is critically low
Event name Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"})
Description Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h.

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"}.

Second condition should be one of the following:
- The disk free space is less than 5G.
- The disk will be full in less than 24 hours.
Expression last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"} and (last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.CRIT:"{#FSNAME}"} or timeleft((/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d
Recovery expression -
Recovery mode None
Manual close Yes
Severity Average
Depends on -

Trigger: Filesystem space is low with timeleft with context macro

{$VFS.FS.PUSED.WARN.CRIT:\"__RESOURCE__\"} = 80

Applicable for Filesystems
Name Disk space is low
Event name Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:\"__RESOURCE__\"})
Description Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h.

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.WARN:\"__RESOURCE__\"}.

Second condition should be one of the following:
- The disk free space is less than 10G.
- The disk will be full in less than 24 hours.
Expression last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"} and (last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.WARN:"{#FSNAME}"} or timeleft((/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d
Recovery expression -
Recovery mode None
Manual close Yes
Severity Warning
Depends on Disk space is critically low.