2022 Zabbix中国峰会
2022 Zabbix中国峰会

1 基于触发器的时间关联

概述

基于触发器的事件关联允许关联一个触发器产生的不同问题。

通常,在Zabbix中正常事件会关闭一个触发器生成的所有问题事件,但在某些情况下需要更加细致的方法。例如,当监控日志文件时,在日志文件中想要发现某些问题,并将它们单独关闭,而不是一起关闭。

当触发器配置页面的多重问题时间生成选项为启用的情况下,通常适用于日志监控、被动采集(trap)处理等。

换言之,相同的触发器可以创建由事件标签标识的的不同事件。因此,可以一个一个单独地标识问题事件,并基于事件标签地标识单独关闭。

工作原理

在日志监控中,可能会遇到下面类似地输出:

Line1: 应用1停止
       
       Line2: 应用2停止
       
       Line3: 应用1重启
       
       Line4: 应用2重启

事件关联地想法是将从“Line1”的问题事件到“Line3”的恢复事件,从“Line2”的问题事件到“Line4”的恢复事件相匹配,并能逐个关闭这些问题:

Line1: 应用1停止
       
       Line3: 应用1重启#问题来自于Line1关闭
       
       Line2: 应用2停止
       
       Line4: 应用2重启#问题来自于Line2关闭

为此,需要通过标签将这些事件相关联,例如,可以标识为“Application 1”和“Application 2”。这个过程也可以将正则表达式应用于日志中来提取标签的值。然后,当事件创建时,他们分别给标识为“Application 1”和“Application 2”,并且问题可以与解决方法相匹配

配置

在触发器的配置界面配置事件关联:

所有必须输入的区域都通过红色星号进行标记。

  • 选择“问题事件生成模式”的多重选项;
  • 选择“正常事件关闭”的如果标签匹配的所有问题
  • 输入事件匹配的标签名称;
  • 从日志中提取标签的值以配置标签

如果配置成功,能够看到标记“application ”的问题事件,并与监测中问题页面看到结果相匹配

<note warning>因为有可能出现错误配置,当为不相关的问题创建相似的事件标签时,请查阅下面标记出来的情况: :::

  • 当由两个applications向相同的日志文件写入故障和恢复信息,用户通过在标签中使用单独的正则表达式来提取标签的名称。例如“application A”和来自宏{ITEM.VALUE}的“application B”(当消息格式不同时),然而,如果和正则表达式不匹配的话,可能会无法按照计划工作。不匹配的正则表达式将生成空的标签值,并且在问题和正常事件中的单个空标签值足以关联它们。因此,来自“application A”的恢复消息可能会意外地关闭来自“application B”地错误消息。
  • 实际上标签和标签的值只有在触发器触发时才会显示。如果所使用的正则表达式无效的话,则会使用默认的字段“UNKNOWN”进行替换。如果错过了标签值“UNKNOWN”的初始问题事件,那么可能会出现与标签值“UNKNOWN”的后续正常事件,并有可能导致关闭不应该关闭的问题事件。
  • 如果用户使用没有宏功能的宏{ITEM.VALUE}作为标签值,则会有255个字符串的限制。当日志消息很长,并且前面255个字符串是不明确的话,就有可能导致类似的事件标签用于不相关的问题上。
Item

To begin with, you may want to set up an item that monitors a log file, for example:

log[/var/log/syslog]

With the item set up, wait a minute for the configuration changes to be picked up and then go to Latest data to make sure that the item has started collecting data.

Trigger

With the item working you need to configure the trigger. It's important to decide what entries in the log file are worth paying attention to. For example, the following trigger expression will search for a string like 'Stopping' to signal potential problems:

find(/My host/log[/var/log/syslog],,"regexp","Stopping")=1 

To make sure that each line containing the string "Stopping" is considered a problem also set the Problem event generation mode in trigger configuration to 'Multiple'.

Then define a recovery expression. The following recovery expression will resolve all problems if a log line is found containing the string "Starting":

find(/My host/log[/var/log/syslog],,"regexp","Starting")=1 

Since we do not want that it's important to make sure somehow that the corresponding root problems are closed, not just all problems. That's where tagging can help.

Problems and resolutions can be matched by specifying a tag in the trigger configuration. The following settings have to be made:

  • Problem event generation mode: Multiple
  • OK event closes: All problems if tag values match
  • Enter the name of the tag for event matching

  • configure the tags to extract tag values from log lines

If configured successfully you will be able to see problem events tagged by application and matched to their resolution in MonitoringProblems.

Because misconfiguration is possible, when similar event tags may be created for unrelated problems, please review the cases outlined below!

  • With two applications writing error and recovery messages to the same log file a user may decide to use two Application tags in the same trigger with different tag values by using separate regular expressions in the tag values to extract the names of, say, application A and application B from the {ITEM.VALUE} macro (e.g. when the message formats differ). However, this may not work as planned if there is no match to the regular expressions. Non-matching regexps will yield empty tag values and a single empty tag value in both problem and OK events is enough to correlate them. So a recovery message from application A may accidentally close an error message from application B.
  • Actual tags and tag values only become visible when a trigger fires. If the regular expression used is invalid, it is silently replaced with an *UNKNOWN* string. If the initial problem event with an *UNKNOWN* tag value is missed, there may appear subsequent OK events with the same *UNKNOWN* tag value that may close problem events which they shouldn't have closed.
  • If a user uses the {ITEM.VALUE} macro without macro functions as the tag value, the 255-character limitation applies. When log messages are long and the first 255 characters are non-specific, this may also result in similar event tags for unrelated problems.