2022 Zabbix中国峰会
2022 Zabbix中国峰会

14 不可达/不可用 主机设置

概述

当agent检查(Zabbix, SNMP, IPMI, JMX)失败并且主机变得不可达时,一些配置 参数 定义了 Zabbix server 作何反应。

不可达主机

Zabbix, SNMP, IPMI 或 JMX agents检查(网络错误,超时)失败后即视主机不可达. 注意,Zabbix agent 主动检查不影响主机可用性。

From that moment UnreachableDelay定义了主机再次检查的频率 is rechecked using one of the items (包括 LLD 规则) in this unreachability situation and such rechecks will be performed already by unreachable pollers.默认情况下,两次检查时间间隔为15秒。

在Zabbix server 日志中 ,不可达是通过类似下面的消息表示的:

Zabbix agent item "system.cpu.load[percpu,avg1]" on host "New host" failed: first network error, wait for 15 seconds
       Zabbix agent item "system.cpu.load[percpu,avg15]" on host "New host" failed: another network error, wait for 15 seconds

注意,失败的监控项和监控项类型(Zabbix agent)列出来了。

在主机不可达期间,Timeout 参数也会影响主机再次被检查的时间。如果Timeout 是 20 秒,但是 UnreachableDelay 是 30 秒, 下一次检查在 50 秒后 。

UnreachablePeriod参数定义了不可达的总时长。 UnreachablePeriod 应该比 UnreachableDelay大几倍, 这样在主机变为不可用之前,主机会被检查不止一次。

如果不可达主机再次出现, 监控自动恢复正常:

恢复 Zabbix agent 对主机 "New host"的检查: 连接恢复

不可用主机

主机不可达期结束后主机没有再次出现, 视主机为不可用。

在server 日志中,不可用是通过类似下面的消息来表示的:

temporarily disabling Zabbix agent checks on host "New host": host unavailable

前端 主机可用性图标由绿色(或灰色)变为红色(注意,在鼠标经过时会提示错误描述):

UnavailableDelay 参数定义了在主机不可用期间,主机被检查的频率。

默认为 60 秒 (所以此时从上面的日志信息来看, "temporarily disabling"意味着禁用检查一分钟)。

当主机连接恢复时,监控也会自动恢复正常:

启用Zabbix agent 对 "New host"主机的检查: 主机变为可达

14 Unreachable/unavailable host settings

Overview

Several configuration parameters define how Zabbix server should behave when an agent check (Zabbix, SNMP, IPMI, JMX) fails and a host becomes unreachable.

Unreachable host

A host is treated as unreachable after a failed check (network error, timeout) by Zabbix, SNMP, IPMI or JMX agents. Note that Zabbix agent active checks do not influence host availability in any way.

From that moment UnreachableDelay defines how often a host is rechecked using one of the items (including LLD rules) in this unreachability situation and such rechecks will be performed already by unreachable pollers (or IPMI pollers for IPMI checks). By default it is 15 seconds before the next check.

In the Zabbix server log unreachability is indicated by messages like these:

Zabbix agent item "system.cpu.load[percpu,avg1]" on host "New host" failed: first network error, wait for 15 seconds
       Zabbix agent item "system.cpu.load[percpu,avg15]" on host "New host" failed: another network error, wait for 15 seconds

Note that the exact item that failed is indicated and the item type (Zabbix agent).

The Timeout parameter will also affect how early a host is rechecked during unreachability. If the Timeout is 20 seconds and UnreachableDelay 30 seconds, the next check will be in 50 seconds after the first attempt.

The UnreachablePeriod parameter defines how long the unreachability period is in total. By default UnreachablePeriod is 45 seconds. UnreachablePeriod should be several times bigger than UnreachableDelay, so that a host is rechecked more than once before a host becomes unavailable.

If the unreachable host reappears, the monitoring returns to normal automatically:

resuming Zabbix agent checks on host "New host": connection restored

Unavailable host

After the UnreachablePeriod ends and the host has not reappeared, the host is treated as unavailable.

In the server log it is indicated by messages like these:

temporarily disabling Zabbix agent checks on host "New host": host unavailable

and in the frontend the host availability icon for the respective interface goes from green (or gray) to red (note that on mouseover a tooltip with the error description is displayed):

The UnavailableDelay parameter defines how often a host is checked during host unavailability.

By default it is 60 seconds (so in this case "temporarily disabling", from the log message above, will mean disabling checks for one minute).

When the connection to the host is restored, the monitoring returns to normal automatically, too:

enabling Zabbix agent checks on host "New host": host became available