2022 Zabbix中国峰会
2022 Zabbix中国峰会

2 触发器表达式

Overview

The expressions used in triggers are very flexible. You can use them to create complex logical tests regarding monitored statistics.

A simple expression uses a function that is applied to the item with some parameters. The function returns a result that is compared to the threshold, using an operator and a constant.

The syntax of a simple useful expression is function(/host/key,parameter)<operator><constant>.

For example:

  min(/Zabbix server/net.if.in[eth0,bytes],5m)>100K

will trigger if the number of received bytes during the last five minutes was always over 100 kilobytes.

While the syntax is exactly the same, from the functional point of view there are two types of trigger expressions:

  • problem expression - defines the conditions of the problem
  • recovery expression (optional) - defines additional conditions of the problem resolution

When defining a problem expression alone, this expression will be used both as the problem threshold and the problem recovery threshold. As soon as the problem expression evaluates to TRUE, there is a problem. As soon as the problem expression evaluates to FALSE, the problem is resolved.

When defining both problem expression and the supplemental recovery expression, problem resolution becomes more complex: not only the problem expression has to be FALSE, but also the recovery expression has to be TRUE. This is useful to create hysteresis and avoid trigger flapping.

函数

触发器函数允许引用采集的值,当前时间和其他因素。

可以使用的支持函数完整列表。

函数参数

大多数数字型的函数接受秒数来作为参数。

你可以使用前缀#来指定参数具有不同的含义:

函数调用 含义
sum(600) 600秒内所有值的总和
sum(#5) 最后5个值的总和

函数last当以#作为前缀使用时具有不同的含义 - 它可以选择第N次前的值, 返回值 3, 7, 2, 6, 5 (最近五次),last(#2) 将返回值为7 ,last(#5) 将返回值为5。

一些函数支持额外的第二个参数时间偏移量。这个参数允许从过去一段时间内引用数据。例如,avg(1h,1d)将会返回一天前1小时的平均值。

你可以在触发器表达式中使用支持的单位符号, 例如‘5m’(分钟)代替‘300’秒,‘1d’(天)代替‘86400’秒。‘1k’代表‘1024’bytes。

运算符

触发器支持下列运算符(在执行中优先级递减)

优先级 运算 定义 **[未知值] /manual/config/triggers/expression#expressions_with_unsupported_items_and_unknown_values)**注释
1 - *-**Unknown → Unknown
2 not 逻辑非 ** ot** Unknown → Unknown
3 * * Unknown → Unknown
(yes, Unknown, not 0 - to not lose
Unknown in arithmetic operations)
1.2 * Unknown → Unknown
/ nknown / 0 → error
Unknown / 1.2 → Unknown
0.0 / Unknown → Unknown
4 + .2 + Unknown → Unknown
- .2 - Unknown → Unknown
5 < 小于。该运算符定义:  1.2 **&lt
A<B ⇔ (A<B-0.000001)
** Unknown → Unknown
<= 小于等于。该运算符定义:  Unknown **&
A<=B ⇔ (A≤B+0.000001)
t;=** Unknown → Unknown
> 大于. 该运算符定义:

A>B ⇔ (A>B+0.000001)
>= 大于等于。 该运算符定义:

A>=B ⇔ (A≥B-0.000001)
6 = 相等。 该运算符定义:

A=B ⇔ (A≥B-0.000001) and (A≤B+0.000001)
<> 不等于。该运算符定义:

A<>B ⇔ (A<B-0.000001) or (A>B+0.000001)
7 and 逻辑与 0 *and Unknown → 0
1
and Unknown → Unknown
Unknown
and** Unknown → Unknown
8 or 逻辑或 1 *or Unknown → 1
0
or Unknown → Unknown
Unknown
or** Unknown → Unknown

not, and and or 运算符区分大小写,而且必须为小写。它们也必须被空格或括号包围。

所有运算符中, 除了 -not ,都有左到右的关联性。 -not是非结合的(意味着-(-1)not (not 1)应该用--1 and not not 1代替).

计算结果:

  • <, <=, >, >=, =, <> 如果指定的关系为真,运算符将会在触发器表达式中产生‘1’。如果指定的关系为假,则返回‘0’。如果至少有一个运算数未知,则结果未知;
  • and 对于已知的运算对象,如果两个运算对象的比较不等于“0”,则运算符将会在触发器表达式中产生“1”,否则,它产生“0”;对于未知的运算对象,如果两个运算对象的比较等于“0”,则会产生“0”,否则,则会产生“Unknown”;
  • or 对于已知的运算对象,如果其中任意一个运算对象的比较不等于“0”,则运算符会在触发器表达式中产生“1”,否则,它产生“0”;对于未知的运算对象进行“or”运算,则只有当一个运算对象的比较不等于“0”,才会产生“1”,否则,它会产生“Unknown”;
  • 如果操作数的值不等于“0”,则已知操作数的逻辑否定运算符not的结果是“0”;如果操作数的值等于“0”,则为“1”。对于未知的操作数not产生“Unknown”。

缓存值

触发器评估所需的值由Zabbix server缓存。由于此触发器评估在服务器重新启动后一段时间导致较高的数据库负载。当监控项历史数据被移除(手动或housekeeper)时,缓存值不会被清除,因此服务器将使用缓存的值,直到它们比触发器函数中定义的时间段或服务器重启的时间长。

触发器示例

Operators

The following operators are supported for triggers (in descending priority of execution):

Priority Operator Definition Notes for unknown values Force cast operand to float 1
1 - Unary minus -Unknown → Unknown Yes
2 not Logical NOT not Unknown → Unknown Yes
3 * Multiplication 0 * Unknown → Unknown
(yes, Unknown, not 0 - to not lose
Unknown in arithmetic operations)
1.2 * Unknown → Unknown
Yes
/ Division Unknown / 0 → error
Unknown / 1.2 → Unknown
0.0 / Unknown → Unknown
Yes
4 + Arithmetical plus 1.2 + Unknown → Unknown Yes
- Arithmetical minus 1.2 - Unknown → Unknown Yes
5 < Less than. The operator is defined as:

A<B ⇔ (A<B-0.000001)
1.2 < Unknown → Unknown Yes
<= Less than or equal to. The operator is defined as:

A<=B ⇔ (A≤B+0.000001)
Unknown <= Unknown → Unknown Yes
> More than. The operator is defined as:

A>B ⇔ (A>B+0.000001)
Yes
>= More than or equal to. The operator is defined as:

A>=B ⇔ (A≥B-0.000001)
Yes
6 = Is equal. The operator is defined as:

A=B ⇔ (A≥B-0.000001) and (A≤B+0.000001)
No 1
<> Not equal. The operator is defined as:

A<>B ⇔ (A<B-0.000001) or (A>B+0.000001)
No 1
7 and Logical AND 0 and Unknown → 0
1 and Unknown → Unknown
Unknown and Unknown → Unknown
Yes
8 or Logical OR 1 or Unknown → 1
0 or Unknown → Unknown
Unknown or Unknown → Unknown
Yes

1 String operand is still cast to numeric if:

  • another operand is numeric
  • operator other than = or <> is used on an operand

(If the cast fails - numeric operand is cast to a string operand and both operands get compared as strings.)

not, and and or operators are case-sensitive and must be in lowercase. They also must be surrounded by spaces or parentheses.

All operators, except unary - and not, have left-to-right associativity. Unary - and not are non-associative (meaning -(-1) and not (not 1) should be used instead of --1 and not not 1).

Evaluation result:

  • <, <=, >, >=, =, <> operators shall yield '1' in the trigger expression if the specified relation is true and '0' if it is false. If at least one operand is Unknown the result is Unknown;
  • and for known operands shall yield '1' if both of its operands compare unequal to '0'; otherwise, it yields '0'; for unknown operands and yields '0' only if one operand compares equal to '0'; otherwise, it yields 'Unknown';
  • or for known operands shall yield '1' if either of its operands compare unequal to '0'; otherwise, it yields '0'; for unknown operands or yields '1' only if one operand compares unequal to '0'; otherwise, it yields 'Unknown';
  • The result of the logical negation operator not for a known operand is '0' if the value of its operand compares unequal to '0'; '1' if the value of its operand compares equal to '0'. For unknown operand not yields 'Unknown'.
示例 2

www.zabbix.com is overloaded

{www.zabbix.com:system.cpu.load[all,avg1].last()}>5 or {www.zabbix.com:system.cpu.load[all,avg1].min(10m)}>2 

当前处理器负载大于5或者最近10分钟内最小值大于2,表达式为true。

示例 3

/etc/passwd文件被修改

使用函数diff:

{www.zabbix.com:vfs.file.cksum[/etc/passwd].diff()}=1

当文件/etc/passwd的checksum值与最近的值不同时,表达式为true。

类似的,表达式可以用于监控重要文件的修改, 如/etc/passwd, /etc/inetd.conf, /kernel等

示例 4

有人正在从互联网上下载一个大文件

使用min函数:

{www.zabbix.com:net.if.in[eth0,bytes].min(5m)}>100K

在过去5分钟内,eth0上接收字节数大于100kb时,表达式为true。

示例 5

SMTP服务群集的两个节点都停止。 注意在一个表达式中使用两个不同的主机:

{smtp1.zabbix.com:net.tcp.service[smtp].last()}=0 and {smtp2.zabbix.com:net.tcp.service[smtp].last()}=0

当SMTP服务器smtp1.zabbix.com和smtp2.zabbix.com都停止,表达式为true

示例 6

Zabbix agent需要升级

使用str()函数:

{zabbix.zabbix.com:agent.version.str("beta8")}=1

如果Zabbix agent版本是beta8(可能是1.0beta8),则表达式为真。

示例 7

服务器无法访问

{zabbix.zabbix.com:icmpping.count(30m,0)}>5

当主机“zabbix.zabbix.com”在30分钟内超过5次不可达,则表达式为真。

示例 8

3分钟内没有心跳检查

使用nodata()函数:

{zabbix.zabbix.com:tick.nodata(3m)}=1

要使用这个触发器,'tick'必须定义成一个Zabbix[:manual/config/items/itemtypes/trapper|trapper]]监控项。主机应该使用zabbix_sender定期发送这个监控项的数据。

如果在180秒内没有接收到数据,则触发值变为异常状态。

注释‘nodata’可以在任何类型的监控项中使用。

示例 9

夜间的CPU负载

使用time()函数:

{zabbix:system.cpu.load[all,avg1].min(5m)}>2 and {zabbix:system.cpu.load[all,avg1].time()}>000000 and {zabbix:system.cpu.load[all,avg1].time()}<060000

仅在夜间(00:00-06:00),触发器状态变可以变为真。

示例 10

检查客户端本地时间是否与Zabbix服务器时间同步

使用fuzzytime()函数:

{MySQL_DB:system.localtime.fuzzytime(10)}=0

当MySQL_DB服务器的本地时间与Zabbix server之间的时间相差超过10秒,触发器将变为异常状态。

示例 11

比较今天的平均负载和昨天同一时间的平均负载(使用第二个“时间偏移”参数)。

{server:system.cpu.load.avg(1h)}/{server:system.cpu.load.avg(1h,1d)}>2

如果最近一小时平均负载超过昨天相同小时负载的2倍,触发器将触发。

示例 12

使用了另一个监控项的值来获得触发器的阈值:

{Template PfSense:hrStorageFree[{#SNMPVALUE}].last()}<{Template PfSense:hrStorageSize[{#SNMPVALUE}].last()}*0.1

如果剩余存储量下降到10%以下,触发器将触发。

示例 13

使用评估结果获取超过阈值的触发器数量:

({server1:system.cpu.load[all,avg1].last()}>5) + ({server2:system.cpu.load[all,avg1].last()}>5) + ({server3:system.cpu.load[all,avg1].last()}>5)>=2

如果表达式中至少有两个触发器大于5,触发器将触发。

滞后

有时我们需要一个OK和问题状态之间的区间,而不是一个简单的阈值。例如,我们希望定义一个触发器,当机房温度超过20C时,触发器会出现异常,我们希望它保持在那种状态,直到温度下降到15C以下。

为了做到这一点,我们首先定义问题事件的触发器表达式。然后在事件成功迭代中选择‘恢复表达式’,并为OK事件输入恢复表达式。

请注意,只有首先解决问题事件才会评估恢复表达式。如果问题条件仍然存在,则不能通过恢复表达式来解决问题。

示例 1

机房温度过高。

问题表达式:

{server:temp.last()}>20

恢复表达式:

{server:temp.last()}<=15
示例 2

磁盘剩余空间过低。

问题表达式: it is less than 10GB for last 5 minutes

{server:vfs.fs.size[/,free].max(5m)}<10G

恢复表达式: it is more than 40GB for last 10 minutes

{server:vfs.fs.size[/,free].min(10m)}>40G

不支持项的表达式和未知的值

Zabbix3.2之前的版本对触发器表达式中不支持的监控项非常严格。表达式中的任何不支持的监控项都会立即将触发器值呈现为“未知”。

从Zabbix3.2开始通过将未知值引入到表达式评估中,对不受支持的项有更灵活的方法:

  • 对于某些函数,它们的值不受监控项是否支持的影响。这样的函数即使它们引用不支持的项,也会对它们进行评估。 请参阅函数和不支持的监控项清单。
  • Logical expressions with OR and AND can be evaluated to known values in two cases regardless of unknown operands:
    • "1 or 不支持的监控项函数1 or 不支持的监控项函数2 or ..." 可以被评估为'1' (True),
    • "0 and 不支持的监控项函数1 and 不支持的监控项函数2 and ..." 可以被评估为'0' (False),
      Zabbix试图评估不支持的项目作为Unknown值的逻辑表达式。在上述两种情况下,将产生一个已知值;在其他情况下,触发值将是Unknown
  • 如果对受支持的监控项的一个函数评估结果为错误,那么这个函数的值为Unknown ,并且它将参与进一步的表达式评估。

如上所述,未知值可以在逻辑表达式中“消失”。 在算数表达式中未知值总会导致结果为“Unknown”(除以0除外)。

如果具有多个不支持的监控项的触发器表达式评估为“Unknown”,前端的错误消息是指最后一个不支持的监控项。

Example 15

Comparing two string values - operands are:

  • a function that returns a string
  • a combination of macros and strings

Problem: detect changes in the DNS query

The item key is:

net.dns.record[8.8.8.8,{$WEBSITE_NAME},{$DNS_RESOURCE_RECORD_TYPE},2,1]

with macros defined as

{$WEBSITE_NAME} = example.com
       {$DNS_RESOURCE_RECORD_TYPE} = MX

and normally returns:

example.com           MX       0 mail.example.com

So our trigger expression to detect if the DNS query result deviated from the expected result is:

last(/Zabbix server/net.dns.record[8.8.8.8,{$WEBSITE_NAME},{$DNS_RESOURCE_RECORD_TYPE},2,1])<>"{$WEBSITE_NAME}           {$DNS_RESOURCE_RECORD_TYPE}       0 mail.{$WEBSITE_NAME}"

Notice the quotes around the second operand.

Example 16

Comparing two string values - operands are:

  • a function that returns a string
  • a string constant with special characters \ and "

Problem: detect if the /tmp/hello file content is equal to:

\" //hello ?\"

Option 1) write the string directly

last(/Zabbix server/vfs.file.contents[/tmp/hello])="\\\" //hello ?\\\""

Notice how \ and " characters are escaped when the string gets compared directly.

Option 2) use a macro

{$HELLO_MACRO} = \" //hello ?\"

in the expression:

last(/Zabbix server/vfs.file.contents[/tmp/hello])={$HELLO_MACRO}
Example 17

Comparing long-term periods.

Problem: Load of Exchange server increased by more than 10% last month

trendavg(/Exchange/system.cpu.load,1M:now/M)>1.1*trendavg(/Exchange/system.cpu.load,1M:now/M-1M)

You may also use the Event name field in trigger configuration to build a meaningful alert message, for example to receive something like

"Load of Exchange server increased by 24% in July (0.69) comparing to June (0.56)"

the event name must be defined as:

Load of {HOST.HOST} server increased by {{?100*trendavg(//system.cpu.load,1M:now/M)/trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(0)}% in {{TIME}.fmttime(%B,-1M)} ({{?trendavg(//system.cpu.load,1M:now/M)}.fmtnum(2)}) comparing to {{TIME}.fmttime(%B,-2M)} ({{?trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(2)})

It is also useful to allow manual closing in trigger configuration for this kind of problem.

Hysteresis

Sometimes an interval is needed between problem and recovery states, rather than a simple threshold. For example, if we want to define a trigger that reports a problem when server room temperature goes above 20°C and we want it to stay in the problem state until the temperature drops below 15°C, a simple trigger threshold at 20°C will not be enough.

Instead, we need to define a trigger expression for the problem event first (temperature above 20°C). Then we need to define an additional recovery condition (temperature below 15°C). This is done by defining an additional Recovery expression parameter when defining a trigger.

In this case, problem recovery will take place in two steps:

  • First, the problem expression (temperature above 20°C) will have to evaluate to FALSE
  • Second, the recovery expression (temperature below 15°C) will have to evaluate to TRUE

The recovery expression will be evaluated only when the problem event is resolved first.

The recovery expression being TRUE alone does not resolve a problem if the problem expression is still TRUE!

Example 1

Temperature in server room is too high.

Problem expression:

last(/server/temp)>20

Recovery expression:

last(/server/temp)<=15
Example 2

Free disk space is too low.

Problem expression: it is less than 10GB for last 5 minutes

max(/server/vfs.fs.size[/,free],5m)<10G

Recovery expression: it is more than 40GB for last 10 minutes

min(/server/vfs.fs.size[/,free],10m)>40G

Expressions with unsupported items and unknown values

Versions before Zabbix 3.2 are very strict about unsupported items in a trigger expression. Any unsupported item in the expression immediately renders trigger value to Unknown.

Since Zabbix 3.2 there is a more flexible approach to unsupported items by admitting unknown values into expression evaluation:

  • For the nodata() function, the values are not affected by whether an item is supported or unsupported. The function is evaluated even if it refers to an unsupported item.
  • Logical expressions with OR and AND can be evaluated to known values in two cases regardless of unknown operands:
    • "1 or Unsupported_item1.some_function() or Unsupported_item2.some_function() or ..." can be evaluated to '1' (True),
    • "0 and Unsupported_item1.some_function() and Unsupported_item2.some_function() and ..." can be evaluated to '0' (False).
      Zabbix tries to evaluate logical expressions taking unsupported items as Unknown values. In the two cases mentioned above a known value will be produced; in other cases trigger value will be Unknown.
  • If a function evaluation for supported item results in error, the function value is Unknown and it takes part in further expression evaluation.

Note that unknown values may "disappear" only in logical expressions as described above. In arithmetic expressions unknown values always lead to result Unknown (except division by 0).

If a trigger expression with several unsupported items evaluates to Unknown the error message in the frontend refers to the last unsupported item evaluated.