Prometheus

Prometheus

Available solutions




Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/os/linux_prom


Template OS Linux by Prom

Overview

For Zabbix version: 4.4
This template collects Linux metrics from node_exporter 0.18 and above. Support for older node_exporter versions is provided as 'best effort'.

This template was tested on:

  • node_exporter, version 0.17.0
  • node_exporter, version 0.18.1

Setup

Please refer to the node_exporter docs. Use node_exporter v0.18.0 or above.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

NameDescriptionDefault
{$CPU.UTIL.CRIT}

-

90
{$IF.ERRORS.WARN}

-

2
{$IF.UTIL.MAX}

-

90
{$IFCONTROL}

-

1
{$KERNEL.MAXFILES.MIN}

-

256
{$LOAD_AVG_PER_CPU.MAX.WARN}

Load per CPU considered sustainable. Tune if needed.

1.5
{$MEMORY.AVAILABLE.MIN}

-

20M
{$MEMORY.UTIL.MAX}

-

90
{$NET.IF.IFALIAS.MATCHES}

-

^.*$
{$NET.IF.IFALIAS.NOT_MATCHES}

-

CHANGE_IF_NEEDED
{$NET.IF.IFNAME.MATCHES}

-

^.*$
{$NET.IF.IFNAME.NOT_MATCHES}

Filter out loopbacks, nulls, docker veth links and docker0 bridge by default

`(^Software Loopback Interface
{$NET.IF.IFOPERSTATUS.MATCHES}

-

^.*$
{$NET.IF.IFOPERSTATUS.NOT_MATCHES}

Ignore notPresent(7)

^7$
{$NODE_EXPORTER_PORT}

TCP Port node_exporter is listening on.

9100
{$SWAP.PFREE.MIN.WARN}

-

50
{$SYSTEM.FUZZYTIME.MAX}

-

60
{$VFS.DEV.DEVNAME.MATCHES}

This macro is used in block devices discovery. Can be overridden on the host or linked template level

.+
{$VFS.DEV.DEVNAME.NOT_MATCHES}

This macro is used in block devices discovery. Can be overridden on the host or linked template level

`^(loop[0-9]*
{$VFS.DEV.READ.AWAIT.WARN}

Disk read average response time (in ms) before the trigger would fire

20
{$VFS.DEV.WRITE.AWAIT.WARN}

Disk write average response time (in ms) before the trigger would fire

20
{$VFS.FS.FSDEVICE.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^.+$
{$VFS.FS.FSDEVICE.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^\s$
{$VFS.FS.FSNAME.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

.+
{$VFS.FS.FSNAME.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

`^(/dev
{$VFS.FS.FSTYPE.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

`^(btrfs
{$VFS.FS.FSTYPE.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^\s$
{$VFS.FS.INODE.PFREE.MIN.CRIT}

-

10
{$VFS.FS.INODE.PFREE.MIN.WARN}

-

20
{$VFS.FS.PUSED.MAX.CRIT}

-

90
{$VFS.FS.PUSED.MAX.WARN}

-

80

Template links

There are no template links in this template.

Discovery rules

NameDescriptionTypeKey and additional info
Network interface discovery

Discovery of network interfaces. Requires node_exporter v0.18 and up.

DEPENDENTnet.if.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_network_info$"}

Filter:

AND

- A: { #IFNAME} MATCHES_REGEX {$NET.IF.IFNAME.MATCHES}

- B: { #IFNAME} NOT_MATCHES_REGEX {$NET.IF.IFNAME.NOT_MATCHES}

- C: { #IFALIAS} MATCHES_REGEX {$NET.IF.IFALIAS.MATCHES}

- D: { #IFALIAS} NOT_MATCHES_REGEX {$NET.IF.IFALIAS.NOT_MATCHES}

- E: { #IFOPERSTATUS} MATCHES_REGEX {$NET.IF.IFOPERSTATUS.MATCHES}

- F: { #IFOPERSTATUS} NOT_MATCHES_REGEX {$NET.IF.IFOPERSTATUS.NOT_MATCHES}

Mounted filesystem discovery

Discovery of file systems of different types.

DEPENDENTvfs.fs.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_filesystem_size(?:_bytes)?$", mountpoint=~".+"}

Filter:

AND

- A: { #FSTYPE} MATCHES_REGEX {$VFS.FS.FSTYPE.MATCHES}

- B: { #FSTYPE} NOT_MATCHES_REGEX {$VFS.FS.FSTYPE.NOT_MATCHES}

- C: { #FSNAME} MATCHES_REGEX {$VFS.FS.FSNAME.MATCHES}

- D: { #FSNAME} NOT_MATCHES_REGEX {$VFS.FS.FSNAME.NOT_MATCHES}

- E: { #FSNAME} MATCHES_REGEX {$VFS.FS.FSDEVICE.MATCHES}

- F: { #FSDEVICE} NOT_MATCHES_REGEX {$VFS.FS.FSDEVICE.NOT_MATCHES}

Block devices discovery

-

DEPENDENTvfs.dev.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: node_disk_io_now{device=~".+"}

Filter:

AND

- A: { #DEVNAME} MATCHES_REGEX {$VFS.DEV.DEVNAME.MATCHES}

- B: { #DEVNAME} NOT_MATCHES_REGEX {$VFS.DEV.DEVNAME.NOT_MATCHES}

Items collected

GroupNameDescriptionTypeKey and additional info
CPULoad average (1m avg)

-

DEPENDENTsystem.cpu.load.avg1[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load1

CPULoad average (5m avg)

-

DEPENDENTsystem.cpu.load.avg5[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load5

CPULoad average (15m avg)

-

DEPENDENTsystem.cpu.load.avg15[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load15

CPUNumber of CPUs

-

DEPENDENTsystem.cpu.num[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="idle"}

- JAVASCRIPT: //count the number of cores return JSON.parse(value).length

CPUCPU utilization

CPU utilization in %

DEPENDENTsystem.cpu.util[node_exporter]

Preprocessing:

- JAVASCRIPT: //Calculate utilization return (100 - value)

CPUCPU idle time

The time the CPU has spent doing nothing.

DEPENDENTsystem.cpu.idle[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="idle"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU system time

The time the CPU has spent running the kernel and its processes.

DEPENDENTsystem.cpu.system[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="system"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU user time

The time the CPU has spent running users' processes that are not niced.

DEPENDENTsystem.cpu.user[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="user"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU steal time

The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

DEPENDENTsystem.cpu.steal[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="steal"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU softirq time

The amount of time the CPU has been servicing software interrupts.

DEPENDENTsystem.cpu.softirq[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="softirq"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU nice time

The time the CPU has spent running users' processes that have been niced.

DEPENDENTsystem.cpu.nice[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="nice"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU iowait time

Amount of time the CPU has been waiting for I/O to complete.

DEPENDENTsystem.cpu.iowait[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="iowait"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU interrupt time

The amount of time the CPU has been servicing hardware interrupts.

DEPENDENTsystem.cpu.interrupt[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="irq"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPUCPU guest time

Guest time (time spent running a virtual CPU for a guest operating system)

DEPENDENTsystem.cpu.guest[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: `{name=~"^node_cpu(?:_guest_seconds_total)?$",cpu=~".+",mode=~"^(?:user

CPUCPU guest nice time

Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel)

DEPENDENTsystem.cpu.guest_nice[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: `{name=~"^node_cpu(?:_guest_seconds_total)?$",cpu=~".+",mode=~"^(?:nice

CPUInterrupts per second

-

DEPENDENTsystem.cpu.intr[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_intr"}

- CHANGE_PER_SECOND

CPUContext switches per second

-

DEPENDENTsystem.cpu.switches[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_context_switches"}

- CHANGE_PER_SECOND

GeneralSystem boot time

-

DEPENDENTsystem.boottime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_boot_time(?:_seconds)?$"}

GeneralSystem local time

-

DEPENDENTsystem.localtime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_time(?:_seconds)?$"}

GeneralSystem name

System host name.

DEPENDENTsystem.name[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_uname_info nodename

- DISCARD_UNCHANGED_HEARTBEAT: 1d

GeneralSystem description

Labeled system information as provided by the uname system call.

DEPENDENTsystem.descr[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: node_uname_info

- JAVASCRIPT: var info = JSON.parse(value)[0]; return info.labels.sysname+' version: '+info.labels.release+' '+info.labels.version

- DISCARD_UNCHANGED_HEARTBEAT: 1d

GeneralMaximum number of open file descriptors

It could be increased by using sysctrl utility or modifying file /etc/sysctl.conf.

DEPENDENTkernel.maxfiles[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_filefd_maximum

- DISCARD_UNCHANGED_HEARTBEAT: 1d

GeneralNumber of open file descriptors

-

DEPENDENTfd.open[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_filefd_allocated

InventoryOperating system

-

DEPENDENTsystem.sw.os[node_exporter]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 1d

InventoryOperating system architecture

-

DEPENDENTsystem.sw.arch[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_uname_info machine

- DISCARD_UNCHANGED_HEARTBEAT: 1d

MemoryMemory utilization

Memory used percentage is calculated as (total-available)/total*100

CALCULATEDvm.memory.util[node_exporter]

Expression:

(last("vm.memory.total[node_exporter]")-last("vm.memory.available[node_exporter]"))/last("vm.memory.total[node_exporter]")*100
MemoryTotal memory

Total memory in Bytes

DEPENDENTvm.memory.total[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_MemTotal"}

MemoryAvailable memory

Available memory, in Linux, available = free + buffers + cache. On other platforms calculation may vary. See also: https://www.zabbix.com/documentation/current/manual/appendix/items/vm.memory.size_params

DEPENDENTvm.memory.available[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_MemAvailable"}

MemoryTotal swap space

-

DEPENDENTsystem.swap.total[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_SwapTotal"}

MemoryFree swap space

-

DEPENDENTsystem.swap.free[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_SwapFree"}

MemoryFree swap space in %

-

CALCULATEDsystem.swap.pfree[node_exporter]

Expression:

last("system.swap.free[node_exporter]")/last("system.swap.total[node_exporter]")*100
Monitoring_agentVersion of node_exporter running

-

DEPENDENTagent.version[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_exporter_build_info version

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Bits receivedDEPENDENTnet.if.in[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_bytes_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 8

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Bits sentDEPENDENTnet.if.out[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_bytes_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 8

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Outbound packets with errorsDEPENDENTnet.if.out.errors[node_exporter"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_errs_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Inbound packets with errorsDEPENDENTnet.if.in.errors[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_errs_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Inbound packets discardedDEPENDENTnet.if.in.discards[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_drop_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Outbound packets discardedDEPENDENTnet.if.out.discards[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_drop_total{device="{ #IFNAME}"}

- CHANGE_PER_SECOND

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Speed

Sets value to 0 if metric is missing in node_exporter output.

DEPENDENTnet.if.speed[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_speed_bytes{device="{ #IFNAME}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- MULTIPLIER: 8

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Interface type

node_network_protocol_type protocol_type value of /sys/class/net/.

DEPENDENTnet.if.type[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_protocol_type{device="{ #IFNAME}"}

Network_interfacesInterface { #IFNAME}({ #IFALIAS}): Operational status

Indicates the interface RFC2863 operational state as a string.

Possible values are:"unknown", "notpresent", "down", "lowerlayerdown", "testing","dormant", "up".

Reference: https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net

DEPENDENTnet.if.status[node_exporter,"{ #IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_info{device="{ #IFNAME}"} operstate

- JAVASCRIPT: var newvalue; switch(value) { case "up": newvalue = 1; break; case "down": newvalue = 2; break; case "testing": newvalue = 4; break; case "unknown": newvalue = 5; break; case "dormant": newvalue = 6; break; case "notPresent": newvalue = 7; break; default: newvalue = "Problem parsing interface operstate in JS"; } return newvalue;

StatusSystem uptime

-

DEPENDENTsystem.uptime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_boot_time(?:_seconds)?$"}

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Storage{ #FSNAME}: Free space

-

DEPENDENTvfs.fs.free[node_exporter,"{ #FSNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_filesystem_avail(?:_bytes)?$", mountpoint="{ #FSNAME}"}

Storage{ #FSNAME}: Total space

Total space in Bytes

DEPENDENTvfs.fs.total[node_exporter,"{ #FSNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_filesystem_size(?:_bytes)?$", mountpoint="{ #FSNAME}"}

Storage{ #FSNAME}: Used space

Used storage in Bytes

CALCULATEDvfs.fs.used[node_exporter,"{ #FSNAME}"]

Expression:

(last("vfs.fs.total[node_exporter,\"{ #FSNAME}\"]")-last("vfs.fs.free[node_exporter,\"{ #FSNAME}\"]"))
Storage{ #FSNAME}: Space utilization

Space utilization in % for { #FSNAME}

CALCULATEDvfs.fs.pused[node_exporter,"{ #FSNAME}"]

Expression:

(last("vfs.fs.used[node_exporter,\"{ #FSNAME}\"]")/last("vfs.fs.total[node_exporter,\"{ #FSNAME}\"]"))*100
Storage{ #FSNAME}: Free inodes in %

-

DEPENDENTvfs.fs.inode.pfree[node_exporter,"{ #FSNAME}"]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"node_filesystem_files.*",mountpoint="{ #FSNAME}"}

- JAVASCRIPT: //count vfs.fs.inode.pfree var inode_free; var inode_total; JSON.parse(value).forEach(function(metric) { if (metric['name'] == 'node_filesystem_files'){ inode_total = metric['value']; } else if (metric['name'] == 'node_filesystem_files_free'){ inode_free = metric['value']; } }); return (inode_free/inode_total)*100;

Storage{ #DEVNAME}: Disk read rate

r/s. The number (after merges) of read requests completed per second for the device.

DEPENDENTvfs.dev.read.rate[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_reads_completed_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

Storage{ #DEVNAME}: Disk write rate

w/s. The number (after merges) of write requests completed per second for the device.

DEPENDENTvfs.dev.write.rate[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_writes_completed_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

Storage{ #DEVNAME}: Disk read request avg waiting time (r_await)

This formula contains two boolean expressions that evaluates to 1 or 0 in order to set calculated metric to zero and to avoid division by zero exception.

CALCULATEDvfs.dev.read.await[node_exporter,"{ #DEVNAME}"]

Expression:

(last("vfs.dev.read.time.rate[node_exporter,\"{ #DEVNAME}\"]")/(last("vfs.dev.read.rate[node_exporter,\"{ #DEVNAME}\"]")+(last("vfs.dev.read.rate[node_exporter,\"{ #DEVNAME}\"]")=0)))*1000*(last("vfs.dev.read.rate[node_exporter,\"{ #DEVNAME}\"]") > 0)
Storage{ #DEVNAME}: Disk write request avg waiting time (w_await)

This formula contains two boolean expressions that evaluates to 1 or 0 in order to set calculated metric to zero and to avoid division by zero exception.

CALCULATEDvfs.dev.write.await[node_exporter,"{ #DEVNAME}"]

Expression:

(last("vfs.dev.write.time.rate[node_exporter,\"{ #DEVNAME}\"]")/(last("vfs.dev.write.rate[node_exporter,\"{ #DEVNAME}\"]")+(last("vfs.dev.write.rate[node_exporter,\"{ #DEVNAME}\"]")=0)))*1000*(last("vfs.dev.write.rate[node_exporter,\"{ #DEVNAME}\"]") > 0)
Storage{ #DEVNAME}: Disk average queue size (avgqu-sz)

-

DEPENDENTvfs.dev.queue_size[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_io_time_weighted_seconds_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

Storage{ #DEVNAME}: Disk utilization

-

DEPENDENTvfs.dev.util[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_io_time_seconds_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 100

Zabbix_raw_itemsGet node_exporter metrics

-

HTTP_AGENTnode_exporter.get
Zabbix_raw_items{ #DEVNAME}: Disk read time (rate)

Rate of total read time counter. Used in r_await calculation

DEPENDENTvfs.dev.read.time.rate[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_read_time_seconds_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

Zabbix_raw_items{ #DEVNAME}: Disk write time (rate)

Rate of total write time counter. Used in w_await calculation

DEPENDENTvfs.dev.write.time.rate[node_exporter,"{ #DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_write_time_seconds_total{device="{ #DEVNAME}"}

- CHANGE_PER_SECOND

Triggers

NameDescriptionExpressionSeverityDependencies and additional info
Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)

Per CPU load average is too high. Your system may be slow to respond.

{TEMPLATE_NAME:system.cpu.load.avg1[node_exporter].min(5m)}/{Template OS Linux by Prom:system.cpu.num[node_exporter].last()}>{$LOAD_AVG_PER_CPU.MAX.WARN} and {Template OS Linux by Prom:system.cpu.load.avg5[node_exporter].last()}>0 and {Template OS Linux by Prom:system.cpu.load.avg15[node_exporter].last()}>0AVERAGE
High CPU utilization (over {$CPU.UTIL.CRIT}% for 5m)

-

{TEMPLATE_NAME:system.cpu.util[node_exporter].min(5m)}>{$CPU.UTIL.CRIT}WARNING

Depends on:

- Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)

System time is out of sync (diff with Zabbix server > {$SYSTEM.FUZZYTIME.MAX}s)

-

{TEMPLATE_NAME:system.localtime[node_exporter].fuzzytime({$SYSTEM.FUZZYTIME.MAX})}=0WARNING

Manual close: YES

System name has changed (new name: {ITEM.VALUE})

System name has changed. Ack to close.

{TEMPLATE_NAME:system.name[node_exporter].diff()}=1 and {TEMPLATE_NAME:system.name[node_exporter].strlen()}>0INFO

Manual close: YES

Configured max number of open filedescriptors is too low (< {$KERNEL.MAXFILES.MIN})

-

{TEMPLATE_NAME:kernel.maxfiles[node_exporter].last()}<{$KERNEL.MAXFILES.MIN}INFO

Depends on:

- Running out of file descriptors (less than < 20% free)

Running out of file descriptors (less than < 20% free)

-

{TEMPLATE_NAME:fd.open[node_exporter].last()}/{Template OS Linux by Prom:kernel.maxfiles[node_exporter].last()}*100>80WARNING
Operating system description has changed

Operating system description has changed. Possible reasons that system has been updated or replaced. Ack to close.

{TEMPLATE_NAME:system.sw.os[node_exporter].diff()}=1 and {TEMPLATE_NAME:system.sw.os[node_exporter].strlen()}>0INFO

Manual close: YES

Depends on:

- System name has changed (new name: {ITEM.VALUE})

High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)

-

{TEMPLATE_NAME:vm.memory.util[node_exporter].min(5m)}>{$MEMORY.UTIL.MAX}AVERAGE

Depends on:

- Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

-

{TEMPLATE_NAME:vm.memory.available[node_exporter].min(5m)}<{$MEMORY.AVAILABLE.MIN} and {Template OS Linux by Prom:vm.memory.total[node_exporter].last()}>0AVERAGE
High swap space usage ( less than {$SWAP.PFREE.MIN.WARN}% free)

This trigger is ignored, if there is no swap configured

{TEMPLATE_NAME:system.swap.pfree[node_exporter].min(5m)}<{$SWAP.PFREE.MIN.WARN} and {Template OS Linux by Prom:system.swap.total[node_exporter].last()}>0WARNING

Depends on:

- High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)

- Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

Interface { #IFNAME}({ #IFALIAS}): High bandwidth usage ( > {$IF.UTIL.MAX:"{ #IFNAME}"}% )

-

({TEMPLATE_NAME:net.if.in[node_exporter,"{ #IFNAME}"].avg(15m)}>({$IF.UTIL.MAX:"{ #IFNAME}"}/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{ #IFNAME}"].last()} or {Template OS Linux by Prom:net.if.out[node_exporter,"{ #IFNAME}"].avg(15m)}>({$IF.UTIL.MAX:"{ #IFNAME}"}/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{ #IFNAME}"].last()}) and {Template OS Linux by Prom:net.if.speed[node_exporter,"{ #IFNAME}"].last()}>0

Recovery expression:

{TEMPLATE_NAME:net.if.in[node_exporter,"{ #IFNAME}"].avg(15m)}<(({$IF.UTIL.MAX:"{ #IFNAME}"}-3)/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{ #IFNAME}"].last()} and {Template OS Linux by Prom:net.if.out[node_exporter,"{ #IFNAME}"].avg(15m)}<(({$IF.UTIL.MAX:"{ #IFNAME}"}-3)/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{ #IFNAME}"].last()}
WARNING

Manual close: YES

Depends on:

- Interface { #IFNAME}({ #IFALIAS}): Link down

Interface { #IFNAME}({ #IFALIAS}): High error rate ( > {$IF.ERRORS.WARN:"{ #IFNAME}"} for 5m)

Recovers when below 80% of {$IF.ERRORS.WARN:"{ #IFNAME}"} threshold

{TEMPLATE_NAME:net.if.in.errors[node_exporter,"{ #IFNAME}"].min(5m)}>{$IF.ERRORS.WARN:"{ #IFNAME}"} or {Template OS Linux by Prom:net.if.out.errors[node_exporter"{ #IFNAME}"].min(5m)}>{$IF.ERRORS.WARN:"{ #IFNAME}"}

Recovery expression:

{TEMPLATE_NAME:net.if.in.errors[node_exporter,"{ #IFNAME}"].max(5m)}<{$IF.ERRORS.WARN:"{ #IFNAME}"}*0.8 and {Template OS Linux by Prom:net.if.out.errors[node_exporter"{ #IFNAME}"].max(5m)}<{$IF.ERRORS.WARN:"{ #IFNAME}"}*0.8
WARNING

Manual close: YES

Depends on:

- Interface { #IFNAME}({ #IFALIAS}): Link down

Interface { #IFNAME}({ #IFALIAS}): Ethernet has changed to lower speed than it was before

This Ethernet connection has transitioned down from its known maximum speed. This might be a sign of autonegotiation issues. Ack to close.

{TEMPLATE_NAME:net.if.speed[node_exporter,"{ #IFNAME}"].change()}<0 and {TEMPLATE_NAME:net.if.speed[node_exporter,"{ #IFNAME}"].last()}>0 and ( {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=6 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=7 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=11 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=62 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=69 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=117 ) and ({Template OS Linux by Prom:net.if.status[node_exporter,"{ #IFNAME}"].last()}<>2)

Recovery expression:

({TEMPLATE_NAME:net.if.speed[node_exporter,"{ #IFNAME}"].change()}>0 and {TEMPLATE_NAME:net.if.speed[node_exporter,"{ #IFNAME}"].prev()}>0) or ({Template OS Linux by Prom:net.if.status[node_exporter,"{ #IFNAME}"].last()}=2)
INFO

Manual close: YES

Depends on:

- Interface { #IFNAME}({ #IFALIAS}): Link down

Interface { #IFNAME}({ #IFALIAS}): Ethernet has changed to lower speed than it was before

This Ethernet connection has transitioned down from its known maximum speed. This might be a sign of autonegotiation issues. Ack to close.

{TEMPLATE_NAME:net.if.type[node_exporter,"{ #IFNAME}"].change()}<0 and {TEMPLATE_NAME:net.if.type[node_exporter,"{ #IFNAME}"].last()}>0 and ({Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=6 or {Template OS Linux by Prom:net.if.type[node_exporter,"{ #IFNAME}"].last()}=1) and ({Template OS Linux by Prom:net.if.status[node_exporter,"{ #IFNAME}"].last()}<>2)

Recovery expression:

({TEMPLATE_NAME:net.if.type[node_exporter,"{ #IFNAME}"].change()}>0 and {TEMPLATE_NAME:net.if.type[node_exporter,"{ #IFNAME}"].prev()}>0) or ({Template OS Linux by Prom:net.if.status[node_exporter,"{ #IFNAME}"].last()}=2)
INFO

Manual close: YES

Depends on:

- Interface { #IFNAME}({ #IFALIAS}): Link down

Interface { #IFNAME}({ #IFALIAS}): Link down

This trigger expression works as follows:

1. Can be triggered if operations status is down.

2. {$IFCONTROL:"{ #IFNAME}"}=1 - user can redefine Context macro to value - 0. That marks this interface as not important. No new trigger will be fired if this interface is down.

3. {TEMPLATE_NAME:METRIC.diff()}=1) - trigger fires only if operational status was up(1) sometime before. (So, do not fire 'ethernal off' interfaces.)

WARNING: if closed manually - won't fire again on next poll, because of .diff.

{$IFCONTROL:"{ #IFNAME}"}=1 and ({TEMPLATE_NAME:net.if.status[node_exporter,"{ #IFNAME}"].last()}=2 and {TEMPLATE_NAME:net.if.status[node_exporter,"{ #IFNAME}"].diff()}=1)

Recovery expression:

{TEMPLATE_NAME:net.if.status[node_exporter,"{ #IFNAME}"].last()}<>2
AVERAGE

Manual close: YES

{HOST.NAME} has been restarted (uptime < 10m)

The device uptime is less than 10 minutes

{TEMPLATE_NAME:system.uptime[node_exporter].last()}<10mWARNING

Manual close: YES

{ #FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{ #FSNAME}"}%)

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:"{ #FSNAME}"}.

Second condition should be one of the following:

- The disk free space is less than 5G.

- The disk will be full in less than 24 hours.

{TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{ #FSNAME}"].last()}>{$VFS.FS.PUSED.MAX.CRIT:"{ #FSNAME}"} and (({Template OS Linux by Prom:vfs.fs.total[node_exporter,"{ #FSNAME}"].last()}-{Template OS Linux by Prom:vfs.fs.used[node_exporter,"{ #FSNAME}"].last()})<5G or {TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{ #FSNAME}"].timeleft(1h,,100)}<1d)AVERAGE

Manual close: YES

{ #FSNAME}: Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:"{ #FSNAME}"}%)

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:"{ #FSNAME}"}.

Second condition should be one of the following:

- The disk free space is less than 10G.

- The disk will be full in less than 24 hours.

{TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{ #FSNAME}"].last()}>{$VFS.FS.PUSED.MAX.WARN:"{ #FSNAME}"} and (({Template OS Linux by Prom:vfs.fs.total[node_exporter,"{ #FSNAME}"].last()}-{Template OS Linux by Prom:vfs.fs.used[node_exporter,"{ #FSNAME}"].last()})<10G or {TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{ #FSNAME}"].timeleft(1h,,100)}<1d)WARNING

Manual close: YES

Depends on:

- { #FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{ #FSNAME}"}%)

{ #FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{ #FSNAME}"}%)

It may become impossible to write to disk if there are no index nodes left.

As symptoms, 'No space left on device' or 'Disk is full' errors may be seen even though free space is available.

{TEMPLATE_NAME:vfs.fs.inode.pfree[node_exporter,"{ #FSNAME}"].min(5m)}<{$VFS.FS.INODE.PFREE.MIN.CRIT:"{ #FSNAME}"}AVERAGE
{ #FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.WARN:"{ #FSNAME}"}%)

It may become impossible to write to disk if there are no index nodes left.

As symptoms, 'No space left on device' or 'Disk is full' errors may be seen even though free space is available.

{TEMPLATE_NAME:vfs.fs.inode.pfree[node_exporter,"{ #FSNAME}"].min(5m)}<{$VFS.FS.INODE.PFREE.MIN.WARN:"{ #FSNAME}"}WARNING

Depends on:

- { #FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{ #FSNAME}"}%)

{ #DEVNAME}: Disk read/write request response are too high (read > {$VFS.DEV.READ.AWAIT.WARN:"{ #DEVNAME}"} ms for 15m or write > {$VFS.DEV.WRITE.AWAIT.WARN:"{ #DEVNAME}"} ms for 15m)

This trigger might indicate disk { #DEVNAME} saturation.

{TEMPLATE_NAME:vfs.dev.read.await[node_exporter,"{ #DEVNAME}"].min(15m)} > {$VFS.DEV.READ.AWAIT.WARN:"{ #DEVNAME}"} or {Template OS Linux by Prom:vfs.dev.write.await[node_exporter,"{ #DEVNAME}"].min(15m)} > {$VFS.DEV.WRITE.AWAIT.WARN:"{ #DEVNAME}"}WARNING

Manual close: YES

node_exporter is not available (or no data for 30m)

Failed to fetch system metrics from node_exporter in time.

{TEMPLATE_NAME:node_exporter.get.nodata(30m)}=1WARNING

Manual close: YES

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template or ask for help with it at ZABBIX forums.

Known Issues

  • Description: node_exporter v0.16.0 renamed many metrics. CPU utilisation for 'guest' and 'guest_nice' metrics are not supported in this template with node_exporter < 0.16. Disk IO metrics are not supported. Other metrics provided as 'best effort'.
    See https://github.com/prometheus/node_exporter/releases/tag/v0.16.0 for details.

    • Version: below 0.16.0
  • Description: metric node_network_info with label 'device' cannot be found, so network discovery is not possible.

    • Version: below 0.18

References

https://github.com/prometheus/node_exporter

Articles and documentation

+ Propose new article
Add your solution