Prometheus

Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

Available solutions




Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/os/linux_prom


Template OS Linux by Prom

Overview

For Zabbix version: 4.4
This template collects Linux metrics from node_exporter 0.18 and above. Support for older node_exporter versions is provided as 'best effort'.

This template was tested on:

  • node_exporter, version 0.17.0
  • node_exporter, version 0.18.1

Setup

Please refer to the node_exporter docs. Use node_exporter v0.18.0 or above.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$CPU.UTIL.CRIT}

-

90
{$IF.ERRORS.WARN}

-

2
{$IF.UTIL.MAX}

-

90
{$IFCONTROL}

-

1
{$KERNEL.MAXFILES.MIN}

-

256
{$LOAD_AVG_PER_CPU.MAX.WARN}

Load per CPU considered sustainable. Tune if needed.

1.5
{$MEMORY.AVAILABLE.MIN}

-

20M
{$MEMORY.UTIL.MAX}

-

90
{$NET.IF.IFALIAS.MATCHES}

-

^.*$
{$NET.IF.IFALIAS.NOT_MATCHES}

-

CHANGE_IF_NEEDED
{$NET.IF.IFNAME.MATCHES}

-

^.*$
{$NET.IF.IFNAME.NOT_MATCHES}

Filter out loopbacks, nulls, docker veth links and docker0 bridge by default

(^Software Loopback Interface|^NULL[0-9.]*$|^[Ll]o[0-9.]*$|^[Ss]ystem$|^Nu[0-9.]*$|^veth[0-9a-z]+$|docker[0-9]+|br-[a-z0-9]{12})
{$NET.IF.IFOPERSTATUS.MATCHES}

-

^.*$
{$NET.IF.IFOPERSTATUS.NOT_MATCHES}

Ignore notPresent(7)

^7$
{$NODE_EXPORTER_PORT}

TCP Port node_exporter is listening on.

9100
{$SWAP.PFREE.MIN.WARN}

-

50
{$SYSTEM.FUZZYTIME.MAX}

-

60
{$VFS.DEV.DEVNAME.MATCHES}

This macro is used in block devices discovery. Can be overridden on the host or linked template level

.+
{$VFS.DEV.DEVNAME.NOT_MATCHES}

This macro is used in block devices discovery. Can be overridden on the host or linked template level

^(loop[0-9]*|sd[a-z][0-9]+|nbd[0-9]+|sr[0-9]+|fd[0-9]+|dm-[0-9]+|ram[0-9]+|ploop[a-z0-9]+|md[0-9]*|hcp[0-9]*|zram[0-9]*)
{$VFS.DEV.READ.AWAIT.WARN}

Disk read average response time (in ms) before the trigger would fire

20
{$VFS.DEV.WRITE.AWAIT.WARN}

Disk write average response time (in ms) before the trigger would fire

20
{$VFS.FS.FSDEVICE.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^.+$
{$VFS.FS.FSDEVICE.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^\s$
{$VFS.FS.FSNAME.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

.+
{$VFS.FS.FSNAME.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^(/dev|/sys|/run|/proc|.+/shm$)
{$VFS.FS.FSTYPE.MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^(btrfs|ext2|ext3|ext4|reiser|xfs|ffs|ufs|jfs|jfs2|vxfs|hfs|apfs|refs|ntfs|fat32|zfs)$
{$VFS.FS.FSTYPE.NOT_MATCHES}

This macro is used in filesystems discovery. Can be overridden on the host or linked template level

^\s$
{$VFS.FS.INODE.PFREE.MIN.CRIT}

-

10
{$VFS.FS.INODE.PFREE.MIN.WARN}

-

20
{$VFS.FS.PUSED.MAX.CRIT}

-

90
{$VFS.FS.PUSED.MAX.WARN}

-

80

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
Network interface discovery

Discovery of network interfaces. Requires node_exporter v0.18 and up.

DEPENDENT net.if.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_network_info$"}

Filter:

AND

- A: {#IFNAME} MATCHES_REGEX {$NET.IF.IFNAME.MATCHES}

- B: {#IFNAME} NOT_MATCHES_REGEX {$NET.IF.IFNAME.NOT_MATCHES}

- C: {#IFALIAS} MATCHES_REGEX {$NET.IF.IFALIAS.MATCHES}

- D: {#IFALIAS} NOT_MATCHES_REGEX {$NET.IF.IFALIAS.NOT_MATCHES}

- E: {#IFOPERSTATUS} MATCHES_REGEX {$NET.IF.IFOPERSTATUS.MATCHES}

- F: {#IFOPERSTATUS} NOT_MATCHES_REGEX {$NET.IF.IFOPERSTATUS.NOT_MATCHES}

Mounted filesystem discovery

Discovery of file systems of different types.

DEPENDENT vfs.fs.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_filesystem_size(?:_bytes)?$", mountpoint=~".+"}

Filter:

AND

- A: {#FSTYPE} MATCHES_REGEX {$VFS.FS.FSTYPE.MATCHES}

- B: {#FSTYPE} NOT_MATCHES_REGEX {$VFS.FS.FSTYPE.NOT_MATCHES}

- C: {#FSNAME} MATCHES_REGEX {$VFS.FS.FSNAME.MATCHES}

- D: {#FSNAME} NOT_MATCHES_REGEX {$VFS.FS.FSNAME.NOT_MATCHES}

- E: {#FSNAME} MATCHES_REGEX {$VFS.FS.FSDEVICE.MATCHES}

- F: {#FSDEVICE} NOT_MATCHES_REGEX {$VFS.FS.FSDEVICE.NOT_MATCHES}

Block devices discovery

-

DEPENDENT vfs.dev.discovery[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: node_disk_io_now{device=~".+"}

Filter:

AND

- A: {#DEVNAME} MATCHES_REGEX {$VFS.DEV.DEVNAME.MATCHES}

- B: {#DEVNAME} NOT_MATCHES_REGEX {$VFS.DEV.DEVNAME.NOT_MATCHES}

Items collected

Group Name Description Type Key and additional info
CPU Load average (1m avg)

-

DEPENDENT system.cpu.load.avg1[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load1

CPU Load average (5m avg)

-

DEPENDENT system.cpu.load.avg5[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load5

CPU Load average (15m avg)

-

DEPENDENT system.cpu.load.avg15[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_load15

CPU Number of CPUs

-

DEPENDENT system.cpu.num[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="idle"}

- JAVASCRIPT: //count the number of cores return JSON.parse(value).length

CPU CPU utilization

CPU utilization in %

DEPENDENT system.cpu.util[node_exporter]

Preprocessing:

- JAVASCRIPT: //Calculate utilization return (100 - value)

CPU CPU idle time

The time the CPU has spent doing nothing.

DEPENDENT system.cpu.idle[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="idle"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU system time

The time the CPU has spent running the kernel and its processes.

DEPENDENT system.cpu.system[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="system"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU user time

The time the CPU has spent running users' processes that are not niced.

DEPENDENT system.cpu.user[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="user"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU steal time

The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

DEPENDENT system.cpu.steal[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="steal"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU softirq time

The amount of time the CPU has been servicing software interrupts.

DEPENDENT system.cpu.softirq[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="softirq"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU nice time

The time the CPU has spent running users' processes that have been niced.

DEPENDENT system.cpu.nice[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="nice"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU iowait time

Amount of time the CPU has been waiting for I/O to complete.

DEPENDENT system.cpu.iowait[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="iowait"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU interrupt time

The amount of time the CPU has been servicing hardware interrupts.

DEPENDENT system.cpu.interrupt[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_seconds_total)?$",cpu=~".+",mode="irq"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU guest time

Guest time (time spent running a virtual CPU for a guest operating system)

DEPENDENT system.cpu.guest[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_guest_seconds_total)?$",cpu=~".+",mode=~"^(?:user|guest)$"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU CPU guest nice time

Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel)

DEPENDENT system.cpu.guest_nice[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"^node_cpu(?:_guest_seconds_total)?$",cpu=~".+",mode=~"^(?:nice|guest_nice)$"}

- JAVASCRIPT: //calculates average, all cpu utilization var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0)/valueArr.length;

- CHANGE_PER_SECOND

- MULTIPLIER: 100

CPU Interrupts per second

-

DEPENDENT system.cpu.intr[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_intr"}

- CHANGE_PER_SECOND

CPU Context switches per second

-

DEPENDENT system.cpu.switches[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_context_switches"}

- CHANGE_PER_SECOND

General System boot time

-

DEPENDENT system.boottime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_boot_time(?:_seconds)?$"}

General System local time

System local time of the host.

DEPENDENT system.localtime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_time(?:_seconds)?$"}

General System name

System host name.

DEPENDENT system.name[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_uname_info nodename

- DISCARD_UNCHANGED_HEARTBEAT: 1d

General System description

Labeled system information as provided by the uname system call.

DEPENDENT system.descr[node_exporter]

Preprocessing:

- PROMETHEUS_TO_JSON: node_uname_info

- JAVASCRIPT: var info = JSON.parse(value)[0]; return info.labels.sysname+' version: '+info.labels.release+' '+info.labels.version

- DISCARD_UNCHANGED_HEARTBEAT: 1d

General Maximum number of open file descriptors

It could be increased by using sysctrl utility or modifying file /etc/sysctl.conf.

DEPENDENT kernel.maxfiles[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_filefd_maximum

- DISCARD_UNCHANGED_HEARTBEAT: 1d

General Number of open file descriptors

-

DEPENDENT fd.open[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_filefd_allocated

Inventory Operating system

-

DEPENDENT system.sw.os[node_exporter]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Inventory Operating system architecture

Operating system architecture of the host.

DEPENDENT system.sw.arch[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_uname_info machine

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Memory Memory utilization

Memory used percentage is calculated as (total-available)/total*100

CALCULATED vm.memory.util[node_exporter]

Expression:

(last("vm.memory.total[node_exporter]")-last("vm.memory.available[node_exporter]"))/last("vm.memory.total[node_exporter]")*100
Memory Total memory

Total memory in Bytes

DEPENDENT vm.memory.total[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_MemTotal"}

Memory Available memory

Available memory, in Linux, available = free + buffers + cache. On other platforms calculation may vary. See also: https://www.zabbix.com/documentation/current/manual/appendix/items/vm.memory.size_params

DEPENDENT vm.memory.available[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_MemAvailable"}

Memory Total swap space

The total space of swap volume/file in bytes.

DEPENDENT system.swap.total[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_SwapTotal"}

Memory Free swap space

The free space of swap volume/file in bytes.

DEPENDENT system.swap.free[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"node_memory_SwapFree"}

Memory Free swap space in %

The free space of swap volume/file in percent.

CALCULATED system.swap.pfree[node_exporter]

Expression:

last("system.swap.free[node_exporter]")/last("system.swap.total[node_exporter]")*100
Monitoring_agent Version of node_exporter running

-

DEPENDENT agent.version[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: node_exporter_build_info version

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Bits received DEPENDENT net.if.in[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_bytes_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 8

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Bits sent DEPENDENT net.if.out[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_bytes_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 8

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Outbound packets with errors DEPENDENT net.if.out.errors[node_exporter"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_errs_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Inbound packets with errors DEPENDENT net.if.in.errors[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_errs_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Inbound packets discarded DEPENDENT net.if.in.discards[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_receive_drop_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Outbound packets discarded DEPENDENT net.if.out.discards[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_transmit_drop_total{device="{#IFNAME}"}

- CHANGE_PER_SECOND

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Speed

Sets value to 0 if metric is missing in node_exporter output.

DEPENDENT net.if.speed[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_speed_bytes{device="{#IFNAME}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- MULTIPLIER: 8

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Interface type

node_network_protocol_type protocol_type value of /sys/class/net/.

DEPENDENT net.if.type[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_protocol_type{device="{#IFNAME}"}

Network_interfaces Interface {#IFNAME}({#IFALIAS}): Operational status

Indicates the interface RFC2863 operational state as a string.

Possible values are:"unknown", "notpresent", "down", "lowerlayerdown", "testing","dormant", "up".

Reference: https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net

DEPENDENT net.if.status[node_exporter,"{#IFNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_network_info{device="{#IFNAME}"} operstate

- JAVASCRIPT: var newvalue; switch(value) { case "up": newvalue = 1; break; case "down": newvalue = 2; break; case "testing": newvalue = 4; break; case "unknown": newvalue = 5; break; case "dormant": newvalue = 6; break; case "notPresent": newvalue = 7; break; default: newvalue = "Problem parsing interface operstate in JS"; } return newvalue;

Status System uptime

System uptime in 'N days, hh:mm:ss' format.

DEPENDENT system.uptime[node_exporter]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_boot_time(?:_seconds)?$"}

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Storage {#FSNAME}: Free space

-

DEPENDENT vfs.fs.free[node_exporter,"{#FSNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_filesystem_avail(?:_bytes)?$", mountpoint="{#FSNAME}"}

Storage {#FSNAME}: Total space

Total space in Bytes

DEPENDENT vfs.fs.total[node_exporter,"{#FSNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: {__name__=~"^node_filesystem_size(?:_bytes)?$", mountpoint="{#FSNAME}"}

Storage {#FSNAME}: Used space

Used storage in Bytes

CALCULATED vfs.fs.used[node_exporter,"{#FSNAME}"]

Expression:

(last("vfs.fs.total[node_exporter,\"{#FSNAME}\"]")-last("vfs.fs.free[node_exporter,\"{#FSNAME}\"]"))
Storage {#FSNAME}: Space utilization

Space utilization in % for {#FSNAME}

CALCULATED vfs.fs.pused[node_exporter,"{#FSNAME}"]

Expression:

(last("vfs.fs.used[node_exporter,\"{#FSNAME}\"]")/last("vfs.fs.total[node_exporter,\"{#FSNAME}\"]"))*100
Storage {#FSNAME}: Free inodes in %

-

DEPENDENT vfs.fs.inode.pfree[node_exporter,"{#FSNAME}"]

Preprocessing:

- PROMETHEUS_TO_JSON: {__name__=~"node_filesystem_files.*",mountpoint="{#FSNAME}"}

- JAVASCRIPT: //count vfs.fs.inode.pfree var inode_free; var inode_total; JSON.parse(value).forEach(function(metric) { if (metric['name'] == 'node_filesystem_files'){ inode_total = metric['value']; } else if (metric['name'] == 'node_filesystem_files_free'){ inode_free = metric['value']; } }); return (inode_free/inode_total)*100;

Storage {#DEVNAME}: Disk read rate

r/s. The number (after merges) of read requests completed per second for the device.

DEPENDENT vfs.dev.read.rate[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_reads_completed_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

Storage {#DEVNAME}: Disk write rate

w/s. The number (after merges) of write requests completed per second for the device.

DEPENDENT vfs.dev.write.rate[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_writes_completed_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

Storage {#DEVNAME}: Disk read request avg waiting time (r_await)

This formula contains two boolean expressions that evaluates to 1 or 0 in order to set calculated metric to zero and to avoid division by zero exception.

CALCULATED vfs.dev.read.await[node_exporter,"{#DEVNAME}"]

Expression:

(last("vfs.dev.read.time.rate[node_exporter,\"{#DEVNAME}\"]")/(last("vfs.dev.read.rate[node_exporter,\"{#DEVNAME}\"]")+(last("vfs.dev.read.rate[node_exporter,\"{#DEVNAME}\"]")=0)))*1000*(last("vfs.dev.read.rate[node_exporter,\"{#DEVNAME}\"]") > 0)
Storage {#DEVNAME}: Disk write request avg waiting time (w_await)

This formula contains two boolean expressions that evaluates to 1 or 0 in order to set calculated metric to zero and to avoid division by zero exception.

CALCULATED vfs.dev.write.await[node_exporter,"{#DEVNAME}"]

Expression:

(last("vfs.dev.write.time.rate[node_exporter,\"{#DEVNAME}\"]")/(last("vfs.dev.write.rate[node_exporter,\"{#DEVNAME}\"]")+(last("vfs.dev.write.rate[node_exporter,\"{#DEVNAME}\"]")=0)))*1000*(last("vfs.dev.write.rate[node_exporter,\"{#DEVNAME}\"]") > 0)
Storage {#DEVNAME}: Disk average queue size (avgqu-sz)

Current average disk queue, the number of requests outstanding on the disk at the time the performance data is collected.

DEPENDENT vfs.dev.queue_size[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_io_time_weighted_seconds_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

Storage {#DEVNAME}: Disk utilization

This item is the percentage of elapsed time that the selected disk drive was busy servicing read or writes requests.

DEPENDENT vfs.dev.util[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_io_time_seconds_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

- MULTIPLIER: 100

Zabbix_raw_items Get node_exporter metrics

-

HTTP_AGENT node_exporter.get
Zabbix_raw_items {#DEVNAME}: Disk read time (rate)

Rate of total read time counter. Used in r_await calculation

DEPENDENT vfs.dev.read.time.rate[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_read_time_seconds_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

Zabbix_raw_items {#DEVNAME}: Disk write time (rate)

Rate of total write time counter. Used in w_await calculation

DEPENDENT vfs.dev.write.time.rate[node_exporter,"{#DEVNAME}"]

Preprocessing:

- PROMETHEUS_PATTERN: node_disk_write_time_seconds_total{device="{#DEVNAME}"}

- CHANGE_PER_SECOND

Triggers

Name Description Expression Severity Dependencies and additional info
Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)

Per CPU load average is too high. Your system may be slow to respond.

{TEMPLATE_NAME:system.cpu.load.avg1[node_exporter].min(5m)}/{Template OS Linux by Prom:system.cpu.num[node_exporter].last()}>{$LOAD_AVG_PER_CPU.MAX.WARN} and {Template OS Linux by Prom:system.cpu.load.avg5[node_exporter].last()}>0 and {Template OS Linux by Prom:system.cpu.load.avg15[node_exporter].last()}>0 AVERAGE
High CPU utilization (over {$CPU.UTIL.CRIT}% for 5m)

CPU utilization is too high. The system might be slow to respond.

{TEMPLATE_NAME:system.cpu.util[node_exporter].min(5m)}>{$CPU.UTIL.CRIT} WARNING

Depends on:

- Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)

System time is out of sync (diff with Zabbix server > {$SYSTEM.FUZZYTIME.MAX}s)

The host system time is different from the Zabbix server time.

{TEMPLATE_NAME:system.localtime[node_exporter].fuzzytime({$SYSTEM.FUZZYTIME.MAX})}=0 WARNING

Manual close: YES

System name has changed (new name: {ITEM.VALUE})

System name has changed. Ack to close.

{TEMPLATE_NAME:system.name[node_exporter].diff()}=1 and {TEMPLATE_NAME:system.name[node_exporter].strlen()}>0 INFO

Manual close: YES

Configured max number of open filedescriptors is too low (< {$KERNEL.MAXFILES.MIN})

-

{TEMPLATE_NAME:kernel.maxfiles[node_exporter].last()}<{$KERNEL.MAXFILES.MIN} INFO

Depends on:

- Running out of file descriptors (less than < 20% free)

Running out of file descriptors (less than < 20% free)

-

{TEMPLATE_NAME:fd.open[node_exporter].last()}/{Template OS Linux by Prom:kernel.maxfiles[node_exporter].last()}*100>80 WARNING
Operating system description has changed

Operating system description has changed. Possible reasons that system has been updated or replaced. Ack to close.

{TEMPLATE_NAME:system.sw.os[node_exporter].diff()}=1 and {TEMPLATE_NAME:system.sw.os[node_exporter].strlen()}>0 INFO

Manual close: YES

Depends on:

- System name has changed (new name: {ITEM.VALUE})

High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)

The system is running out of free memory.

{TEMPLATE_NAME:vm.memory.util[node_exporter].min(5m)}>{$MEMORY.UTIL.MAX} AVERAGE

Depends on:

- Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

-

{TEMPLATE_NAME:vm.memory.available[node_exporter].min(5m)}<{$MEMORY.AVAILABLE.MIN} and {Template OS Linux by Prom:vm.memory.total[node_exporter].last()}>0 AVERAGE
High swap space usage ( less than {$SWAP.PFREE.MIN.WARN}% free)

This trigger is ignored, if there is no swap configured

{TEMPLATE_NAME:system.swap.pfree[node_exporter].min(5m)}<{$SWAP.PFREE.MIN.WARN} and {Template OS Linux by Prom:system.swap.total[node_exporter].last()}>0 WARNING

Depends on:

- High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)

- Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})

Interface {#IFNAME}({#IFALIAS}): High bandwidth usage ( > {$IF.UTIL.MAX:"{#IFNAME}"}% )

The network interface utilization is close to its estimated maximum bandwidth.

({TEMPLATE_NAME:net.if.in[node_exporter,"{#IFNAME}"].avg(15m)}>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{#IFNAME}"].last()} or {Template OS Linux by Prom:net.if.out[node_exporter,"{#IFNAME}"].avg(15m)}>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{#IFNAME}"].last()}) and {Template OS Linux by Prom:net.if.speed[node_exporter,"{#IFNAME}"].last()}>0

Recovery expression:

{TEMPLATE_NAME:net.if.in[node_exporter,"{#IFNAME}"].avg(15m)}<(({$IF.UTIL.MAX:"{#IFNAME}"}-3)/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{#IFNAME}"].last()} and {Template OS Linux by Prom:net.if.out[node_exporter,"{#IFNAME}"].avg(15m)}<(({$IF.UTIL.MAX:"{#IFNAME}"}-3)/100)*{Template OS Linux by Prom:net.if.speed[node_exporter,"{#IFNAME}"].last()}
WARNING

Manual close: YES

Depends on:

- Interface {#IFNAME}({#IFALIAS}): Link down

Interface {#IFNAME}({#IFALIAS}): High error rate ( > {$IF.ERRORS.WARN:"{#IFNAME}"} for 5m)

Recovers when below 80% of {$IF.ERRORS.WARN:"{#IFNAME}"} threshold

{TEMPLATE_NAME:net.if.in.errors[node_exporter,"{#IFNAME}"].min(5m)}>{$IF.ERRORS.WARN:"{#IFNAME}"} or {Template OS Linux by Prom:net.if.out.errors[node_exporter"{#IFNAME}"].min(5m)}>{$IF.ERRORS.WARN:"{#IFNAME}"}

Recovery expression:

{TEMPLATE_NAME:net.if.in.errors[node_exporter,"{#IFNAME}"].max(5m)}<{$IF.ERRORS.WARN:"{#IFNAME}"}*0.8 and {Template OS Linux by Prom:net.if.out.errors[node_exporter"{#IFNAME}"].max(5m)}<{$IF.ERRORS.WARN:"{#IFNAME}"}*0.8
WARNING

Manual close: YES

Depends on:

- Interface {#IFNAME}({#IFALIAS}): Link down

Interface {#IFNAME}({#IFALIAS}): Ethernet has changed to lower speed than it was before

This Ethernet connection has transitioned down from its known maximum speed. This might be a sign of autonegotiation issues. Ack to close.

{TEMPLATE_NAME:net.if.speed[node_exporter,"{#IFNAME}"].change()}<0 and {TEMPLATE_NAME:net.if.speed[node_exporter,"{#IFNAME}"].last()}>0 and ( {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=6 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=7 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=11 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=62 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=69 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=117 ) and ({Template OS Linux by Prom:net.if.status[node_exporter,"{#IFNAME}"].last()}<>2)

Recovery expression:

({TEMPLATE_NAME:net.if.speed[node_exporter,"{#IFNAME}"].change()}>0 and {TEMPLATE_NAME:net.if.speed[node_exporter,"{#IFNAME}"].prev()}>0) or ({Template OS Linux by Prom:net.if.status[node_exporter,"{#IFNAME}"].last()}=2)
INFO

Manual close: YES

Depends on:

- Interface {#IFNAME}({#IFALIAS}): Link down

Interface {#IFNAME}({#IFALIAS}): Ethernet has changed to lower speed than it was before

This Ethernet connection has transitioned down from its known maximum speed. This might be a sign of autonegotiation issues. Ack to close.

{TEMPLATE_NAME:net.if.type[node_exporter,"{#IFNAME}"].change()}<0 and {TEMPLATE_NAME:net.if.type[node_exporter,"{#IFNAME}"].last()}>0 and ({Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=6 or {Template OS Linux by Prom:net.if.type[node_exporter,"{#IFNAME}"].last()}=1) and ({Template OS Linux by Prom:net.if.status[node_exporter,"{#IFNAME}"].last()}<>2)

Recovery expression:

({TEMPLATE_NAME:net.if.type[node_exporter,"{#IFNAME}"].change()}>0 and {TEMPLATE_NAME:net.if.type[node_exporter,"{#IFNAME}"].prev()}>0) or ({Template OS Linux by Prom:net.if.status[node_exporter,"{#IFNAME}"].last()}=2)
INFO

Manual close: YES

Depends on:

- Interface {#IFNAME}({#IFALIAS}): Link down

Interface {#IFNAME}({#IFALIAS}): Link down

This trigger expression works as follows:

1. Can be triggered if operations status is down.

2. {$IFCONTROL:"{#IFNAME}"}=1 - user can redefine Context macro to value - 0. That marks this interface as not important. No new trigger will be fired if this interface is down.

3. {TEMPLATE_NAME:METRIC.diff()}=1) - trigger fires only if operational status was up(1) sometime before. (So, do not fire 'ethernal off' interfaces.)

WARNING: if closed manually - won't fire again on next poll, because of .diff.

{$IFCONTROL:"{#IFNAME}"}=1 and ({TEMPLATE_NAME:net.if.status[node_exporter,"{#IFNAME}"].last()}=2 and {TEMPLATE_NAME:net.if.status[node_exporter,"{#IFNAME}"].diff()}=1)

Recovery expression:

{TEMPLATE_NAME:net.if.status[node_exporter,"{#IFNAME}"].last()}<>2
AVERAGE

Manual close: YES

{HOST.NAME} has been restarted (uptime < 10m)

The device uptime is less than 10 minutes

{TEMPLATE_NAME:system.uptime[node_exporter].last()}<10m WARNING

Manual close: YES

{#FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"}%)

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"}.

Second condition should be one of the following:

- The disk free space is less than 5G.

- The disk will be full in less than 24 hours.

{TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{#FSNAME}"].last()}>{$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"} and (({Template OS Linux by Prom:vfs.fs.total[node_exporter,"{#FSNAME}"].last()}-{Template OS Linux by Prom:vfs.fs.used[node_exporter,"{#FSNAME}"].last()})<5G or {TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{#FSNAME}"].timeleft(1h,,100)}<1d) AVERAGE

Manual close: YES

{#FSNAME}: Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"}%)

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"}.

Second condition should be one of the following:

- The disk free space is less than 10G.

- The disk will be full in less than 24 hours.

{TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{#FSNAME}"].last()}>{$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"} and (({Template OS Linux by Prom:vfs.fs.total[node_exporter,"{#FSNAME}"].last()}-{Template OS Linux by Prom:vfs.fs.used[node_exporter,"{#FSNAME}"].last()})<10G or {TEMPLATE_NAME:vfs.fs.pused[node_exporter,"{#FSNAME}"].timeleft(1h,,100)}<1d) WARNING

Manual close: YES

Depends on:

- {#FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"}%)

{#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{#FSNAME}"}%)

It may become impossible to write to disk if there are no index nodes left.

As symptoms, 'No space left on device' or 'Disk is full' errors may be seen even though free space is available.

{TEMPLATE_NAME:vfs.fs.inode.pfree[node_exporter,"{#FSNAME}"].min(5m)}<{$VFS.FS.INODE.PFREE.MIN.CRIT:"{#FSNAME}"} AVERAGE
{#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.WARN:"{#FSNAME}"}%)

It may become impossible to write to disk if there are no index nodes left.

As symptoms, 'No space left on device' or 'Disk is full' errors may be seen even though free space is available.

{TEMPLATE_NAME:vfs.fs.inode.pfree[node_exporter,"{#FSNAME}"].min(5m)}<{$VFS.FS.INODE.PFREE.MIN.WARN:"{#FSNAME}"} WARNING

Depends on:

- {#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{#FSNAME}"}%)

{#DEVNAME}: Disk read/write request responses are too high (read > {$VFS.DEV.READ.AWAIT.WARN:"{#DEVNAME}"} ms for 15m or write > {$VFS.DEV.WRITE.AWAIT.WARN:"{#DEVNAME}"} ms for 15m)

This trigger might indicate disk {#DEVNAME} saturation.

{TEMPLATE_NAME:vfs.dev.read.await[node_exporter,"{#DEVNAME}"].min(15m)} > {$VFS.DEV.READ.AWAIT.WARN:"{#DEVNAME}"} or {Template OS Linux by Prom:vfs.dev.write.await[node_exporter,"{#DEVNAME}"].min(15m)} > {$VFS.DEV.WRITE.AWAIT.WARN:"{#DEVNAME}"} WARNING

Manual close: YES

node_exporter is not available (or no data for 30m)

Failed to fetch system metrics from node_exporter in time.

{TEMPLATE_NAME:node_exporter.get.nodata(30m)}=1 WARNING

Manual close: YES

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template or ask for help with it at ZABBIX forums.

Known Issues

  • Description: node_exporter v0.16.0 renamed many metrics. CPU utilization for 'guest' and 'guest_nice' metrics are not supported in this template with node_exporter < 0.16. Disk IO metrics are not supported. Other metrics provided as 'best effort'.
    See https://github.com/prometheus/node_exporter/releases/tag/v0.16.0 for details.

    • Version: below 0.16.0
  • Description: metric node_network_info with label 'device' cannot be found, so network discovery is not possible.

    • Version: below 0.18

References

https://github.com/prometheus/node_exporter

Articles and documentation

+ Propose new article
Add your solution