Dostupná řešení

HashiCorp Nomad by HTTP
3rd party solutions

This template is for Zabbix version: 7.0

Also available for: 6.4 6.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nomad?at=release/7.0

HashiCorp Nomad by HTTP

Overview

This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
Define the {$NOMAD.ENDPOINT.API.URL} macro value with correct web protocol, host and port.
Prepare an ACL token with node:read, namespace:read-job, agent:read and management permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you have the HashiCorp Vault integration configured.

Additional information:

Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
If you're not using ACL - skip 3rd setup step.
The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.

Useful links

Macros used

Name	Description	Default
{$NOMAD.ENDPOINT.API.URL}	API endpoint URL for one of the Nomad cluster members.	`http://localhost:4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.NAME.MATCHES}	The filter to include HashiCorp Nomad servers by name.	`.*`
{$NOMAD.SERVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.SERVER.DC.MATCHES}	The filter to include HashiCorp Nomad servers by datacenter belonging.	`.*`
{$NOMAD.SERVER.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.NAME.MATCHES}	The filter to include HashiCorp Nomad clients by name.	`.*`
{$NOMAD.CLIENT.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by name.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.DC.MATCHES}	The filter to include HashiCorp Nomad clients by datacenter belonging.	`.*`
{$NOMAD.CLIENT.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES}	The filter to include HashiCorp Nomad clients by scheduling eligibility.	`.*`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by scheduling eligibility.	`CHANGE_IF_NEEDED`

Items

Name	Description	Type	Key and additional info
Nomad clients get	Nomad clients data in raw format.	HTTP agent	nomad.client.nodes.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Client nodes API response	Client nodes API response message.	Dependent item	nomad.client.nodes.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
Nomad servers get	Nomad servers data in raw format.	Script	nomad.server.nodes.get
Server-related APIs response	Server-related (`operator/raft/configuration`, `agent/members`) APIs error response message.	Dependent item	nomad.server.api.response Preprocessing JSON Path: `$.error` ⛔️Custom on fail: Set value to: `HTTP/1.1 200 OK` Discard unchanged with heartbeat: `1h`
Region	Current cluster region.	Dependent item	nomad.region Preprocessing JSON Path: `$..region.first()`
Nomad servers count	Nomad servers count.	Dependent item	nomad.servers.count Preprocessing JSON Path: `$[?(@.Name)].length()`
Nomad clients count	Nomad clients count.	Dependent item	nomad.clients.count Preprocessing JSON Path: `$.body[?(@.Name)].length()`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Client nodes API connection has failed	Client nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
Server-related API connection has failed	Server-related API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes

LLD rule Clients discovery

Name Description Type Key and additional info

Clients discovery

Name	Description	Type	Key and additional info
Clients discovery	Client nodes discovery.	Dependent item	nomad.clients.discovery Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Client nodes discovery.

Dependent item

nomad.clients.discovery

Preprocessing

JSON Path: $.body
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

LLD rule Servers discovery

Name Description Type Key and additional info

Servers discovery

Name	Description	Type	Key and additional info
Servers discovery	Server nodes discovery.	Dependent item	nomad.servers.discovery Preprocessing Check for error in JSON: `$.error` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Server nodes discovery.

Dependent item

nomad.servers.discovery

Preprocessing

Check for error in JSON: $.error
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client by HTTP

Overview

This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Prepare an ACL token with node:read, namespace:read-job permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you're using integration with HashiCorp Vault.

Set the values for the {$NOMAD.CLIENT.API.SCHEME} and {$NOMAD.CLIENT.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
If you're not using ACL - skip 2nd setup step.
The Nomad clients use the default web schema - HTTP and default API port - 4646. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.CLIENT.API.SCHEME}	Nomad client API scheme.	`http`
{$NOMAD.CLIENT.API.PORT}	Nomad client API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.CLIENT.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.CLIENT.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.DISK.NAME.MATCHES}	The filter to include HashiCorp Nomad client disks by name.	`.*`
{$NOMAD.DISK.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client disks by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAME.MATCHES}	The filter to include HashiCorp Nomad client jobs by name.	`.*`
{$NOMAD.JOB.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAMESPACE.MATCHES}	The filter to include HashiCorp Nomad client jobs by namespace.	`.*`
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by namespace.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TYPE.MATCHES}	The filter to include HashiCorp Nomad client jobs by type.	`.*`
{$NOMAD.JOB.TYPE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by type.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TASK.GROUP.MATCHES}	The filter to include HashiCorp Nomad client jobs by task group belonging.	`.*`
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by task group belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.NAME.MATCHES}	The filter to include HashiCorp Nomad client drivers by name.	`.*`
{$NOMAD.DRIVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.DETECT.MATCHES}	The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`.*`
{$NOMAD.DRIVER.DETECT.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`CHANGE_IF_NEEDED`
{$NOMAD.CPU.UTIL.MIN}	CPU utilization threshold. Measured as a percentage.	`90`
{$NOMAD.RAM.AVAIL.MIN}	CPU utilization threshold. Measured as a percentage.	`5`
{$NOMAD.INODES.FREE.MIN.WARN}	Warning threshold of the filesystem metadata utilization. Measured as a percentage.	`20`
{$NOMAD.INODES.FREE.MIN.CRIT}	Critical threshold of the filesystem metadata utilization. Measured as a percentage.	`10`

Items

Name	Description	Type	Key and additional info
Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.client.data.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Metrics	Nomad client metrics in raw format.	Dependent item	nomad.client.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
Monitoring API response	Monitoring API response message.	Dependent item	nomad.client.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
CPU allocated	Total amount of CPU shares the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_cpu)` ⛔️Custom on fail: Discard value
CPU unallocated	Total amount of CPU shares free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_cpu)` ⛔️Custom on fail: Discard value
Memory allocated	Total amount of memory the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
Memory unallocated	Total amount of memory free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
Disk allocated	Total amount of disk space the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
Disk unallocated	Total amount of disk space free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
Allocations blocked	Number of allocations waiting for previous versions.	Dependent item	nomad.client.allocations.blocked Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_blocked)` ⛔️Custom on fail: Set value to: `0`
Allocations migrating	Number of allocations migrating data from previous versions.	Dependent item	nomad.client.allocations.migrating Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_migrating)` ⛔️Custom on fail: Set value to: `0`
Allocations pending	Number of allocations pending (received by the client but not yet running).	Dependent item	nomad.client.allocations.pending Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_pending)` ⛔️Custom on fail: Set value to: `0`
Allocations starting	Number of allocations starting.	Dependent item	nomad.client.allocations.start Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_start)` ⛔️Custom on fail: Set value to: `0`
Allocations running	Number of allocations running.	Dependent item	nomad.client.allocations.running Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_running)` ⛔️Custom on fail: Set value to: `0`
Allocations terminal	Number of allocations terminal.	Dependent item	nomad.client.allocations.terminal Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_terminal)` ⛔️Custom on fail: Set value to: `0`
Allocations failed, rate	Number of allocations failed.	Dependent item	nomad.client.allocations.failed Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_failed)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
Allocations completed, rate	Number of allocations completed.	Dependent item	nomad.client.allocations.complete Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_complete)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
Allocations restarted, rate	Number of allocations restarted.	Dependent item	nomad.client.allocations.restart Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_restart)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
Allocations OOM killed	Number of allocations OOM killed.	Dependent item	nomad.client.allocations.oom_killed Preprocessing Prometheus pattern: `VALUE(nomad_client_allocs_oom_killed)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
CPU idle utilization	CPU utilization in idle state.	Dependent item	nomad.client.cpu.idle Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_idle)` ⛔️Custom on fail: Discard value
CPU system utilization	CPU utilization in system space.	Dependent item	nomad.client.cpu.system Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_system)` ⛔️Custom on fail: Discard value
CPU total utilization	Total CPU utilization.	Dependent item	nomad.client.cpu.total Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_total)` ⛔️Custom on fail: Discard value
CPU user utilization	CPU utilization in user space.	Dependent item	nomad.client.cpu.user Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_user)` ⛔️Custom on fail: Discard value
Memory available	Total amount of memory available to processes which includes free and cached memory.	Dependent item	nomad.client.memory.available Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_available)` ⛔️Custom on fail: Discard value
Memory free	Amount of memory which is free.	Dependent item	nomad.client.memory.free Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_free)`
Memory size	Total amount of physical memory on the node.	Dependent item	nomad.client.memory.total Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_total)`
Memory used	Amount of memory used by processes.	Dependent item	nomad.client.memory.used Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_used)`
Uptime	Uptime of the host running the Nomad client.	Dependent item	nomad.client.uptime Preprocessing Prometheus pattern: `VALUE(nomad_client_uptime)`
Node info get	Node info data in raw format.	HTTP agent	nomad.client.node.info.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Nomad client version	Nomad client version.	Dependent item	nomad.client.version Preprocessing JSON Path: `$.body..Version.first()`
Nodes API response	Nodes API response message.	Dependent item	nomad.client.node.info.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
Allocated jobs get	Allocated jobs data in raw format.	HTTP agent	nomad.client.job.allocs.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Allocations API response	Allocations API response message.	Dependent item	nomad.client.job.allocs.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0`	Average	Manual close: Yes
Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0`	Average	Manual close: Yes
OOM killed allocations found	OOM killed allocations found.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0`	Warning	Manual close: Yes
High CPU utilization	CPU utilization is too high. The system might be slow to respond.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN}`	Average
High memory utilization	RAM utilization is too high. The system might be slow to respond.	`(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN}`	Average
The host has been restarted	The host uptime is less than 10 minutes.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m`	Warning	Manual close: Yes
Nomad client version has changed	Nomad client version has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0`	Info	Manual close: Yes
Nodes API connection has failed	Nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: Monitoring API connection has failed
Allocations API connection has failed	Allocations API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: Monitoring API connection has failed

LLD rule Drivers discovery

Name Description Type Key and additional info

Drivers discovery

Name	Description	Type	Key and additional info
Drivers discovery	Client drivers discovery.	Dependent item	nomad.client.drivers.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Client drivers discovery.

Dependent item

nomad.client.drivers.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Drivers discovery

Name Description Type Key and additional info

Driver [{#DRIVER.NAME}] state

Name	Description	Type	Key and additional info
Driver [{#DRIVER.NAME}] state	Driver [{#DRIVER.NAME}] state.	Dependent item	nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Healthy.first()` Boolean to decimal Discard unchanged with heartbeat: `1h`
Driver [{#DRIVER.NAME}] detection state	Driver [{#DRIVER.NAME}] detection state.	Dependent item	nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Detected.first()` Boolean to decimal

Driver [{#DRIVER.NAME}] state.

Dependent item

nomad.client.driver.state["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Healthy.first()
Boolean to decimal
Discard unchanged with heartbeat: 1h

Driver [{#DRIVER.NAME}] detection state

Driver [{#DRIVER.NAME}] detection state.

Dependent item

nomad.client.driver.detected["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Detected.first()
Boolean to decimal

Trigger prototypes for Drivers discovery

Name	Description	Expression	Severity	Dependencies and additional info
Driver [{#DRIVER.NAME}] is in unhealthy state	The [{#DRIVER.NAME}] driver detected, but its state is unhealthy.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1`	Warning	Manual close: Yes
Driver [{#DRIVER.NAME}] detection state has changed	The [{#DRIVER.NAME}] driver detection state has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0`	Info	Manual close: Yes

LLD rule Physical disks discovery

Name Description Type Key and additional info

Physical disks discovery

Name	Description	Type	Key and additional info
Physical disks discovery	Physical disks discovery.	Dependent item	nomad.client.disk.discovery Preprocessing Prometheus to JSON: `nomad_client_host_disk_available{disk=~".*"}`

Physical disks discovery.

Dependent item

nomad.client.disk.discovery

Preprocessing

Prometheus to JSON: nomad_client_host_disk_available{disk=~".*"}

Item prototypes for Physical disks discovery

Name	Description	Type	Key and additional info
Disk ["{#DEV.NAME}"] space available	Amount of space which is available on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.available["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_available{disk="{#DEV.NAME}"})`
Disk ["{#DEV.NAME}"] inodes utilization	Disk space consumed by the inodes on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Disk ["{#DEV.NAME}"] size	Total size of the ["{#DEV.NAME}"] device.	Dependent item	nomad.client.disk.size["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_size{disk="{#DEV.NAME}"})`
Disk ["{#DEV.NAME}"] space utilization	Percentage of disk ["{#DEV.NAME}"] space used.	Dependent item	nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Disk ["{#DEV.NAME}"] space used	Amount of disk ["{#DEV.NAME}"] space which has been used.	Dependent item	nomad.client.disk.used["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_used{disk="{#DEV.NAME}"})`

Trigger prototypes for Physical disks discovery

Name	Description	Expression	Severity	Dependencies and additional info
Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: Running out of free inodes on [{#DEV.NAME}] device
Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes
High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: Running out of free inodes on [{#DEV.NAME}] device
High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes

LLD rule Allocated jobs discovery

Name Description Type Key and additional info

Allocated jobs discovery

Name	Description	Type	Key and additional info
Allocated jobs discovery	Allocated jobs discovery.	Dependent item	nomad.client.alloc.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Allocated jobs discovery.

Dependent item

nomad.client.alloc.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Allocated jobs discovery

Name	Description	Type	Key and additional info
Job ["{#JOB.NAME}"] CPU allocated	Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] CPU system utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space.	Dependent item	nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] CPU user utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space.	Dependent item	nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] CPU total utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] CPU throttled periods time	Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` Custom multiplier: `1e-09`
Job ["{#JOB.NAME}"] CPU throttled time	Total time that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Job ["{#JOB.NAME}"] CPU ticks	CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval.	Dependent item	nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] Memory allocated	Amount of memory allocated by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] Memory cached	Amount of memory cached by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] Memory used	Total amount of memory used by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
Job ["{#JOB.NAME}"] Memory swapped	Amount of memory swapped by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`

HashiCorp Nomad Server by HTTP

Overview

This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Set the values for the {$NOMAD.SERVER.API.SCHEME} and {$NOMAD.SERVER.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

The Nomad servers use the default web schema - HTTP and default API port - 4646. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
Don't forget to define the {$NOMAD.REDUNDANCY.MIN} macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.SERVER.API.SCHEME}	Nomad SERVER API scheme.	`http`
{$NOMAD.SERVER.API.PORT}	Nomad SERVER API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.SERVER.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.REDUNDANCY.MIN}	Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed.	`1`
{$NOMAD.OPEN.FDS.MAX}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.SERVER.LEADER.LATENCY}	Leader last contact latency threshold.	`0.3s`

Items

Name	Description	Type	Key and additional info
Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.server.data.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Metrics	Nomad server metrics in raw format.	Dependent item	nomad.server.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
Monitoring API response	Monitoring API response message.	Dependent item	nomad.server.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
Internal stats get	Internal stats data in raw format.	HTTP agent	nomad.server.stats.get Preprocessing Check for not supported value: `any error` ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
Internal stats API response	Internal stats API response message.	Dependent item	nomad.server.stats.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
Nomad server version	Nomad server version.	Dependent item	nomad.server.version Preprocessing JSON Path: `$.body.config.Version.Version`
Nomad raft version	Nomad raft version.	Dependent item	nomad.raft.version Preprocessing JSON Path: `$.body.stats.raft.protocol_version` ⛔️Custom on fail: Discard value
Raft peers	Current cluster raft peers amount.	Dependent item	nomad.server.raft.peers Preprocessing JSON Path: `$.body.stats.raft.num_peers` ⛔️Custom on fail: Discard value
Cluster role	Current role in the cluster.	Dependent item	nomad.server.raft.cluster_role Preprocessing JSON Path: `$.body.stats.raft.state` ⛔️Custom on fail: Discard value JavaScript: `The text is too long. Please see the template.`
CPU time, rate	Total user and system CPU time spent in seconds.	Dependent item	nomad.server.cpu.time Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` ⛔️Custom on fail: Discard value Change per second
Memory used	Memory utilization in bytes.	Dependent item	nomad.server.runtime.alloc_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_alloc_bytes)` ⛔️Custom on fail: Discard value
Virtual memory size	Virtual memory size in bytes.	Dependent item	nomad.server.virtual_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)` ⛔️Custom on fail: Discard value
Resident memory size	Resident memory size in bytes.	Dependent item	nomad.server.resident_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)` ⛔️Custom on fail: Discard value
Heap objects	Number of objects on the heap. General memory pressure indicator.	Dependent item	nomad.server.runtime.heap_objects Preprocessing Prometheus pattern: `VALUE(nomad_runtime_heap_objects)` ⛔️Custom on fail: Discard value
Open file descriptors	Number of open file descriptors.	Dependent item	nomad.server.process_open_fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)` ⛔️Custom on fail: Discard value
Open file descriptors, max	Maximum number of open file descriptors.	Dependent item	nomad.server.process_max_fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)` ⛔️Custom on fail: Discard value
Goroutines	Number of goroutines and general load pressure indicator.	Dependent item	nomad.server.runtime.num_goroutines Preprocessing Prometheus pattern: `VALUE(nomad_runtime_num_goroutines)` ⛔️Custom on fail: Discard value
Evaluations pending	Evaluations that are pending until an existing evaluation for the same job completes.	Dependent item	nomad.server.broker.total_pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_pending)` ⛔️Custom on fail: Discard value
Evaluations ready	Number of evaluations ready to be processed.	Dependent item	nomad.server.broker.total_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_ready)` ⛔️Custom on fail: Discard value
Evaluations unacked	Evaluations dispatched for processing but incomplete.	Dependent item	nomad.server.broker.total_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_unacked)` ⛔️Custom on fail: Discard value
CPU shares for blocked evaluations	Amount of CPU shares requested by blocked evals.	Dependent item	nomad.server.blocked_evals.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_cpu)` ⛔️Custom on fail: Discard value
Memory shares by blocked evaluations	Amount of memory requested by blocked evals.	Dependent item	nomad.server.blocked_evals.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_memory)` ⛔️Custom on fail: Discard value
CPU shares for blocked job evaluations	Amount of CPU shares requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_cpu)` ⛔️Custom on fail: Discard value
Memory shares for blocked job evaluations	Amount of memory requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_memory)` ⛔️Custom on fail: Discard value
Evaluations blocked	Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits).	Dependent item	nomad.server.blocked_evals.total_blocked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_blocked)` ⛔️Custom on fail: Discard value
Evaluations escaped	Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern.	Dependent item	nomad.server.blocked_evals.total_escaped Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_escaped)` ⛔️Custom on fail: Discard value
Evaluations waiting	Count of evals waiting to be enqueued.	Dependent item	nomad.server.broker.total_waiting Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_waiting)` ⛔️Custom on fail: Discard value
Evaluations blocked due to quota limit	Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked).	Dependent item	nomad.server.blocked_evals.total_quota_limit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_quota_limit)` ⛔️Custom on fail: Discard value
Evaluations enqueue time	Average time elapsed with evaluations waiting to be enqueued.	Dependent item	nomad.server.broker.eval_waiting Preprocessing Prometheus pattern: `AVG(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC evaluation acknowledgement time	Time elapsed for Eval.Ack RPC call.	Dependent item	nomad.server.eval.ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC job summary time	Time elapsed for Job.Summary RPC call.	Dependent item	nomad.server.job_summary.get_job_summary Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_summary_get_job_summary_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Heartbeats active	Number of active heartbeat timers. Each timer represents a Nomad client connection.	Dependent item	nomad.server.heartbeat.active Preprocessing Prometheus pattern: `VALUE(nomad_nomad_heartbeat_active)` ⛔️Custom on fail: Discard value
RPC requests, rate	Number of RPC requests being handled.	Dependent item	nomad.server.rpc.request Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
RPC error requests, rate	Number of RPC requests being handled that result in an error.	Dependent item	nomad.server.rpc.request_error Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
RPC queries, rate	Number of RPC queries.	Dependent item	nomad.server.rpc.query Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_query)` ⛔️Custom on fail: Discard value Change per second
RPC job allocations time	Time elapsed for Job.Allocations RPC call.	Dependent item	nomad.server.job.allocations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_allocations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC job evaluations time	Time elapsed for Job.Evaluations RPC call.	Dependent item	nomad.server.job.evaluations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_evaluations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC get job time	Time elapsed for Job.GetJob RPC call.	Dependent item	nomad.server.job.get_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_get_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Plan apply time	Time elapsed to apply a plan.	Dependent item	nomad.server.plan.apply Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Plan evaluate time	Time elapsed to evaluate a plan.	Dependent item	nomad.server.plan.evaluate Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_evaluate_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC plan submit time	Time elapsed for Plan.Submit RPC call.	Dependent item	nomad.server.plan.submit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_submit_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Plan raft index processing time	Time elapsed that planner waits for the raft index of the plan to be processed.	Dependent item	nomad.server.plan.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC list time	Time elapsed for Node.List RPC call.	Dependent item	nomad.server.client.list Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_list_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC update allocations time	Time elapsed for Node.UpdateAlloc RPC call.	Dependent item	nomad.server.client.update_alloc Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_alloc_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC update status time	Time elapsed for Node.UpdateStatus RPC call.	Dependent item	nomad.server.client.update_status Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_status_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC get client allocs time	Time elapsed for Node.GetClientAllocs RPC call.	Dependent item	nomad.server.client.get_client_allocs Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_get_client_allocs_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
RPC eval dequeue time	Time elapsed for Eval.Dequeue RPC call.	Dependent item	nomad.server.client.dequeue Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_dequeue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Vault token last renewal	Time since last successful Vault token renewal.	Dependent item	nomad.server.vault.token_last_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_last_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
Vault token next renewal	Time until next Vault token renewal attempt.	Dependent item	nomad.server.vault.token_next_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_next_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
Vault token TTL	Time to live for Vault token.	Dependent item	nomad.server.vault.token_ttl Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_ttl)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
Vault tokens revoked	Count of revoked tokens.	Dependent item	nomad.server.vault.distributed_tokens_revoked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_distributed_tokens_revoking)` ⛔️Custom on fail: Discard value
Jobs dead	Number of dead jobs.	Dependent item	nomad.server.job_status.dead Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_dead)` ⛔️Custom on fail: Set value to: `0`
Jobs pending	Number of pending jobs.	Dependent item	nomad.server.job_status.pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_pending)` ⛔️Custom on fail: Set value to: `0`
Jobs running	Number of running jobs.	Dependent item	nomad.server.job_status.running Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_running)` ⛔️Custom on fail: Set value to: `0`
Job allocations completed	Number of complete allocations for a job.	Dependent item	nomad.server.job_summary.complete Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_complete)` ⛔️Custom on fail: Set value to: `0`
Job allocations failed	Number of failed allocations for a job.	Dependent item	nomad.server.job_summary.failed Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_failed)` ⛔️Custom on fail: Set value to: `0`
Job allocations lost	Number of lost allocations for a job.	Dependent item	nomad.server.job_summary.lost Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_lost)` ⛔️Custom on fail: Set value to: `0`
Job allocations unknown	Number of unknown allocations for a job.	Dependent item	nomad.server.job_summary.unknown Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_unknown)` ⛔️Custom on fail: Set value to: `0`
Job allocations queued	Number of queued allocations for a job.	Dependent item	nomad.server.job_summary.queued Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_queued)` ⛔️Custom on fail: Set value to: `0`
Job allocations running	Number of running allocations for a job.	Dependent item	nomad.server.job_summary.running Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_running)` ⛔️Custom on fail: Set value to: `0`
Job allocations starting	Number of starting allocations for a job.	Dependent item	nomad.server.job_summary.starting Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_starting)` ⛔️Custom on fail: Set value to: `0`
Gossip time	Time elapsed to broadcast gossip messages.	Dependent item	nomad.server.memberlist.gossip Preprocessing Prometheus pattern: `VALUE(nomad_memberlist_gossip_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Leader barrier time	Time elapsed to establish a raft barrier during leader transition.	Dependent item	nomad.server.leader.barrier Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_barrier_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Reconcile peer time	Time elapsed to reconcile a serf peer with state store.	Dependent item	nomad.server.leader.reconcile_member Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcileMember_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Total reconcile time	Time elapsed to reconcile all serf peers with state store.	Dependent item	nomad.server.leader.reconcile Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcile_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Leader last contact	Time since last contact to leader. General indicator of Raft latency.	Dependent item	nomad.server.raft.leader.lastContact Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_lastContact{quantile="0.99"})` ⛔️Custom on fail: Discard value Replace: `NaN -> 0` Custom multiplier: `0.001`
Plan queue	Count of evals in the plan queue.	Dependent item	nomad.server.plan.queue_depth Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_queue_depth)` ⛔️Custom on fail: Discard value
Worker evaluation create time	Time elapsed for worker to create an eval.	Dependent item	nomad.server.worker.create_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker evaluation dequeue time	Time elapsed for worker to dequeue an eval.	Dependent item	nomad.server.worker.dequeue_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker invoke scheduler time	Time elapsed for worker to invoke the scheduler.	Dependent item	nomad.server.worker.invoke_scheduler_service Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_invoke_scheduler_service_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker acknowledgement send time	Time elapsed for worker to send acknowledgement.	Dependent item	nomad.server.worker.send_ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_send_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker submit plan time	Time elapsed for worker to submit plan.	Dependent item	nomad.server.worker.submit_plan Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_submit_plan_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker update evaluation time	Time elapsed for worker to submit updated eval.	Dependent item	nomad.server.worker.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Worker log replication time	Time elapsed that worker waits for the raft index of the eval to be processed.	Dependent item	nomad.server.worker.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Raft calls blocked, rate	Count of blocking raft API calls.	Dependent item	nomad.server.raft.barrier Preprocessing Prometheus pattern: `VALUE(nomad_raft_barrier)` ⛔️Custom on fail: Discard value Change per second
Raft commit logs enqueued	Count of logs enqueued.	Dependent item	nomad.server.raft.commit_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_commitNumLogs)` ⛔️Custom on fail: Discard value
Raft transactions, rate	Number of Raft transactions.	Dependent item	nomad.server.raft.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_apply)` ⛔️Custom on fail: Set value to: `0` Change per second
Raft commit time	Time elapsed to commit writes.	Dependent item	nomad.server.raft.commit_time Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Raft transaction commit time	Raft transaction commit time.	Dependent item	nomad.server.raft.replication.appendEntries Preprocessing Prometheus pattern: `AVG(nomad_raft_replication_appendEntries_rpc)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
FSM apply time	Time elapsed to apply write to FSM.	Dependent item	nomad.server.raft.fsm.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM enqueue time	Time elapsed to enqueue write to FSM.	Dependent item	nomad.server.raft.fsm.enqueue Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_enqueue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM autopilot time	Time elapsed to apply Autopilot raft entry.	Dependent item	nomad.server.raft.fsm.autopilot Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_autopilot_sum)` ⛔️Custom on fail: Set value to: `0` Custom multiplier: `1e-09`
FSM register node time	Time elapsed to apply RegisterNode raft entry.	Dependent item	nomad.server.raft.fsm.register_node Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_node_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM index	Current index applied to FSM.	Dependent item	nomad.server.raft.applied_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_appliedIndex)` ⛔️Custom on fail: Discard value
Raft last index	Most recent index seen.	Dependent item	nomad.server.raft.last_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_lastIndex)` ⛔️Custom on fail: Discard value
Dispatch log time	Time elapsed to write log, mark in flight, and start replication.	Dependent item	nomad.server.raft.leader.dispatch_log Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchLog_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Logs dispatched	Count of logs dispatched.	Dependent item	nomad.server.raft.leader.dispatch_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchNumLogs)` ⛔️Custom on fail: Set value to: `0`
Heartbeat fails	Count of failing to heartbeat and starting election.	Dependent item	nomad.server.raft.transition.heartbeat_timeout Preprocessing Prometheus pattern: `VALUE(nomad_raft_transition_heartbeat_timeout)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
Objects freed, rate	Count of objects freed from heap by go runtime GC.	Dependent item	nomad.server.runtime.free_count Preprocessing Prometheus pattern: `VALUE(nomad_runtime_free_count)` ⛔️Custom on fail: Discard value Change per second
GC pause time	Go runtime GC pause times.	Dependent item	nomad.server.runtime.gc_pause_ns Preprocessing Prometheus pattern: `VALUE(nomad_runtime_gc_pause_ns_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
GC metadata size	Go runtime GC metadata size in bytes.	Dependent item	nomad.server.runtime.sys_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_sys_bytes)` ⛔️Custom on fail: Discard value
GC runs	Count of go runtime GC runs.	Dependent item	nomad.server.runtime.total_gc_runs Preprocessing Prometheus pattern: `VALUE(nomad_runtime_total_gc_runs)` ⛔️Custom on fail: Discard value
Memberlist events	Count of memberlist events received.	Dependent item	nomad.server.serf.queue.event Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Event_sum)` ⛔️Custom on fail: Discard value
Memberlist changes	Count of memberlist changes.	Dependent item	nomad.server.serf.queue.intent Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Intent_sum)` ⛔️Custom on fail: Discard value
Memberlist queries	Count of memberlist queries.	Dependent item	nomad.server.serf.queue.queries Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Query_sum)` ⛔️Custom on fail: Discard value
Snapshot index	Current snapshot index.	Dependent item	nomad.server.state.snapshot.index Preprocessing Prometheus pattern: `VALUE(nomad_state_snapshotIndex)` ⛔️Custom on fail: Discard value
Services ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.service_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_ready)` ⛔️Custom on fail: Discard value
Services unacknowledged	Count of unacknowledged service evals.	Dependent item	nomad.server.broker.service_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_unacked)` ⛔️Custom on fail: Discard value
System evaluations ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.system_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_ready)` ⛔️Custom on fail: Discard value
System evaluations unacknowledged	Count of unacknowledged system evals.	Dependent item	nomad.server.broker.system_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_unacked)` ⛔️Custom on fail: Discard value
BoltDB free pages	Number of BoltDB free pages.	Dependent item	nomad.server.raft.boltdb.num_free_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numFreePages)` ⛔️Custom on fail: Discard value
BoltDB pending pages	Number of BoltDB pending pages.	Dependent item	nomad.server.raft.boltdb.num_pending_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numPendingPages)` ⛔️Custom on fail: Discard value
BoltDB free page bytes	Number of free page bytes.	Dependent item	nomad.server.raft.boltdb.free_page_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freePageBytes)` ⛔️Custom on fail: Discard value
BoltDB freelist bytes	Number of freelist bytes.	Dependent item	nomad.server.raft.boltdb.freelist_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freelistBytes)` ⛔️Custom on fail: Discard value
BoltDB read transactions, rate	Count of total read transactions.	Dependent item	nomad.server.raft.boltdb.total_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_totalReadTxn)` ⛔️Custom on fail: Discard value Change per second
BoltDB open read transactions	Number of current open read transactions.	Dependent item	nomad.server.raft.boltdb.open_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_openReadTxn)` ⛔️Custom on fail: Discard value
BoltDB pages in use	Number of pages in use.	Dependent item	nomad.server.raft.boltdb.txstats.page_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageCount)` ⛔️Custom on fail: Discard value
BoltDB page allocations, rate	Number of page allocations.	Dependent item	nomad.server.raft.boltdb.txstats.page_alloc Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageAlloc)` ⛔️Custom on fail: Discard value Change per second
BoltDB cursors	Count of total database cursors.	Dependent item	nomad.server.raft.boltdb.txstats.cursor_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_cursorCount)` ⛔️Custom on fail: Discard value Change per second
BoltDB nodes, rate	Count of total database nodes.	Dependent item	nomad.server.raft.boltdb.txstats.node_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeCount)` ⛔️Custom on fail: Discard value Change per second
BoltDB node dereferences, rate	Count of total database node dereferences.	Dependent item	nomad.server.raft.boltdb.txstats.node_deref Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeDeref)` ⛔️Custom on fail: Discard value Change per second
BoltDB rebalance operations, rate	Count of total rebalance operations.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalance)` ⛔️Custom on fail: Discard value Change per second
BoltDB split operations, rate	Count of total split operations.	Dependent item	nomad.server.raft.boltdb.txstats.split Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_split)` ⛔️Custom on fail: Discard value Change per second
BoltDB spill operations, rate	Count of total spill operations.	Dependent item	nomad.server.raft.boltdb.txstats.spill Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spill)` ⛔️Custom on fail: Discard value Change per second
BoltDB write operations, rate	Count of total write operations.	Dependent item	nomad.server.raft.boltdb.txstats.write Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_write)` ⛔️Custom on fail: Discard value Change per second
BoltDB rebalance time	Sample of rebalance operation times.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalanceTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
BoltDB spill time	Sample of spill operation times.	Dependent item	nomad.server.raft.boltdb.txstats.spill_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spillTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
BoltDB write time	Sample of write operation times.	Dependent item	nomad.server.raft.boltdb.txstats.write_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_writeTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
Namespace list time	Time elapsed for Namespace.ListNamespaces.	Dependent item	nomad.server.namespace.list_namespace Preprocessing Prometheus pattern: `VALUE(nomad_nomad_namespace_list_namespace_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Autopilot state	Current autopilot state.	Dependent item	nomad.server.autopilot.state Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_healthy)` ⛔️Custom on fail: Discard value
Autopilot failure tolerance	The number of redundant healthy servers that can fail without causing an outage.	Dependent item	nomad.server.autopilot.failure_tolerance Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_failure_tolerance)` ⛔️Custom on fail: Discard value
FSM allocation client update time	Time elapsed to apply AllocClientUpdate raft entry.	Dependent item	nomad.server.alloc_client_update Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_alloc_client_update_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM apply plan results time	Time elapsed to apply ApplyPlanResults raft entry.	Dependent item	nomad.server.fsm.apply_plan_results Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_apply_plan_results_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM update evaluation time	Time elapsed to apply UpdateEval raft entry.	Dependent item	nomad.server.fsm.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
FSM job registration time	Time elapsed to apply RegisterJob raft entry.	Dependent item	nomad.server.fsm.register_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
Allocation reschedule attempts	Count of attempts to reschedule an allocation.	Dependent item	nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing Prometheus pattern: `SUM(nomad_scheduler_allocs_reschedule_attempted)` ⛔️Custom on fail: Set value to: `0`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
Internal stats API connection has failed	Internal stats API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: Monitoring API connection has failed
Nomad server version has changed	Nomad server version has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0`	Info	Manual close: Yes
Cluster role has changed	Cluster role has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0`	Info	Manual close: Yes
Current number of open files is too high	Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX}`	Warning
Dead jobs found	Jobs with the `Dead` state discovered. Check the {$NOMAD.SERVER.API.SCHEME}://{HOST.IP}:{$NOMAD.SERVER.API.PORT}/v1/jobs URL for the details.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0`	Warning	Manual close: Yes
Leader last contact timeout exceeded	The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. If this number trends upwards, look at CPU, disk IOPs, and network latency. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0`	Warning
Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0`	Average	Manual close: Yes
Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0`	Average	Manual close: Yes
Autopilot is unhealthy	The autopilot is in unhealthy state. The successful failover probability is extremely low.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0`	Average	Manual close: Yes
Autopilot redundancy is low	The autopilot redundancy is low. Cluster crash risk is high due to one more server failure.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0`	Warning	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.4

Also available for: 7.0 6.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nomad?at=release/6.4

HashiCorp Nomad by HTTP

Overview

This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.

Requirements

Zabbix version: 6.4 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
Define the {$NOMAD.ENDPOINT.API.URL} macro value with correct web protocol, host and port.
Prepare an ACL token with node:read, namespace:read-job, agent:read and management permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you have the HashiCorp Vault integration configured.

Additional information:

Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
If you're not using ACL - skip 3rd setup step.
The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.

Useful links

Macros used

Name	Description	Default
{$NOMAD.ENDPOINT.API.URL}	API endpoint URL for one of the Nomad cluster members.	`http://localhost:4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.NAME.MATCHES}	The filter to include HashiCorp Nomad servers by name.	`.*`
{$NOMAD.SERVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.SERVER.DC.MATCHES}	The filter to include HashiCorp Nomad servers by datacenter belonging.	`.*`
{$NOMAD.SERVER.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.NAME.MATCHES}	The filter to include HashiCorp Nomad clients by name.	`.*`
{$NOMAD.CLIENT.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by name.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.DC.MATCHES}	The filter to include HashiCorp Nomad clients by datacenter belonging.	`.*`
{$NOMAD.CLIENT.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES}	The filter to include HashiCorp Nomad clients by scheduling eligibility.	`.*`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by scheduling eligibility.	`CHANGE_IF_NEEDED`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad: Nomad clients get	Nomad clients data in raw format.	HTTP agent	nomad.client.nodes.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad: Client nodes API response	Client nodes API response message.	Dependent item	nomad.client.nodes.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad: Nomad servers get	Nomad servers data in raw format.	Script	nomad.server.nodes.get
HashiCorp Nomad: Server-related APIs response	Server-related (`operator/raft/configuration`, `agent/members`) APIs error response message.	Dependent item	nomad.server.api.response Preprocessing JSON Path: `$.error` ⛔️Custom on fail: Set value to: `HTTP/1.1 200 OK` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad: Region	Current cluster region.	Dependent item	nomad.region Preprocessing JSON Path: `$..region.first()`
HashiCorp Nomad: Nomad servers count	Nomad servers count.	Dependent item	nomad.servers.count Preprocessing JSON Path: `$[?(@.Name)].length()`
HashiCorp Nomad: Nomad clients count	Nomad clients count.	Dependent item	nomad.clients.count Preprocessing JSON Path: `$.body[?(@.Name)].length()`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad: Client nodes API connection has failed	Client nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad: Server-related API connection has failed	Server-related API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes

LLD rule Clients discovery

Name Description Type Key and additional info

Clients discovery

Name	Description	Type	Key and additional info
Clients discovery	Client nodes discovery.	Dependent item	nomad.clients.discovery Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Client nodes discovery.

Dependent item

nomad.clients.discovery

Preprocessing

JSON Path: $.body
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

LLD rule Servers discovery

Name Description Type Key and additional info

Servers discovery

Name	Description	Type	Key and additional info
Servers discovery	Server nodes discovery.	Dependent item	nomad.servers.discovery Preprocessing Check for error in JSON: `$.error` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Server nodes discovery.

Dependent item

nomad.servers.discovery

Preprocessing

Check for error in JSON: $.error
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client by HTTP

Overview

This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 6.4 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Prepare an ACL token with node:read, namespace:read-job permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you're using integration with HashiCorp Vault.

Set the values for the {$NOMAD.CLIENT.API.SCHEME} and {$NOMAD.CLIENT.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
If you're not using ACL - skip 2nd setup step.
The Nomad clients use the default web schema - HTTP and default API port - 4646. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.CLIENT.API.SCHEME}	Nomad client API scheme.	`http`
{$NOMAD.CLIENT.API.PORT}	Nomad client API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.CLIENT.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.CLIENT.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.DISK.NAME.MATCHES}	The filter to include HashiCorp Nomad client disks by name.	`.*`
{$NOMAD.DISK.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client disks by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAME.MATCHES}	The filter to include HashiCorp Nomad client jobs by name.	`.*`
{$NOMAD.JOB.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAMESPACE.MATCHES}	The filter to include HashiCorp Nomad client jobs by namespace.	`.*`
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by namespace.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TYPE.MATCHES}	The filter to include HashiCorp Nomad client jobs by type.	`.*`
{$NOMAD.JOB.TYPE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by type.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TASK.GROUP.MATCHES}	The filter to include HashiCorp Nomad client jobs by task group belonging.	`.*`
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by task group belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.NAME.MATCHES}	The filter to include HashiCorp Nomad client drivers by name.	`.*`
{$NOMAD.DRIVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.DETECT.MATCHES}	The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`.*`
{$NOMAD.DRIVER.DETECT.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`CHANGE_IF_NEEDED`
{$NOMAD.CPU.UTIL.MIN}	CPU utilization threshold. Measured as a percentage.	`90`
{$NOMAD.RAM.AVAIL.MIN}	CPU utilization threshold. Measured as a percentage.	`5`
{$NOMAD.INODES.FREE.MIN.WARN}	Warning threshold of the filesystem metadata utilization. Measured as a percentage.	`20`
{$NOMAD.INODES.FREE.MIN.CRIT}	Critical threshold of the filesystem metadata utilization. Measured as a percentage.	`10`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.client.data.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Metrics	Nomad client metrics in raw format.	Dependent item	nomad.client.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Monitoring API response	Monitoring API response message.	Dependent item	nomad.client.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: CPU allocated	Total amount of CPU shares the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU unallocated	Total amount of CPU shares free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory allocated	Total amount of memory the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Memory unallocated	Total amount of memory free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Disk allocated	Total amount of disk space the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Disk unallocated	Total amount of disk space free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Allocations blocked	Number of allocations waiting for previous versions.	Dependent item	nomad.client.allocations.blocked Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_blocked)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations migrating	Number of allocations migrating data from previous versions.	Dependent item	nomad.client.allocations.migrating Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_migrating)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations pending	Number of allocations pending (received by the client but not yet running).	Dependent item	nomad.client.allocations.pending Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_pending)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations starting	Number of allocations starting.	Dependent item	nomad.client.allocations.start Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_start)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations running	Number of allocations running.	Dependent item	nomad.client.allocations.running Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations terminal	Number of allocations terminal.	Dependent item	nomad.client.allocations.terminal Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_terminal)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations failed, rate	Number of allocations failed.	Dependent item	nomad.client.allocations.failed Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_failed)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations completed, rate	Number of allocations completed.	Dependent item	nomad.client.allocations.complete Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_complete)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations restarted, rate	Number of allocations restarted.	Dependent item	nomad.client.allocations.restart Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_restart)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations OOM killed	Number of allocations OOM killed.	Dependent item	nomad.client.allocations.oom_killed Preprocessing Prometheus pattern: `VALUE(nomad_client_allocs_oom_killed)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: CPU idle utilization	CPU utilization in idle state.	Dependent item	nomad.client.cpu.idle Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_idle)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU system utilization	CPU utilization in system space.	Dependent item	nomad.client.cpu.system Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_system)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU total utilization	Total CPU utilization.	Dependent item	nomad.client.cpu.total Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_total)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU user utilization	CPU utilization in user space.	Dependent item	nomad.client.cpu.user Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_user)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory available	Total amount of memory available to processes which includes free and cached memory.	Dependent item	nomad.client.memory.available Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_available)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory free	Amount of memory which is free.	Dependent item	nomad.client.memory.free Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_free)`
HashiCorp Nomad Client: Memory size	Total amount of physical memory on the node.	Dependent item	nomad.client.memory.total Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_total)`
HashiCorp Nomad Client: Memory used	Amount of memory used by processes.	Dependent item	nomad.client.memory.used Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_used)`
HashiCorp Nomad Client: Uptime	Uptime of the host running the Nomad client.	Dependent item	nomad.client.uptime Preprocessing Prometheus pattern: `VALUE(nomad_client_uptime)`
HashiCorp Nomad Client: Node info get	Node info data in raw format.	HTTP agent	nomad.client.node.info.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Nomad client version	Nomad client version.	Dependent item	nomad.client.version Preprocessing JSON Path: `$.body..Version.first()`
HashiCorp Nomad Client: Nodes API response	Nodes API response message.	Dependent item	nomad.client.node.info.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocated jobs get	Allocated jobs data in raw format.	HTTP agent	nomad.client.job.allocs.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Allocations API response	Allocations API response message.	Dependent item	nomad.client.job.allocs.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad Client: Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Client: Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Client: OOM killed allocations found	OOM killed allocations found.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0`	Warning	Manual close: Yes
HashiCorp Nomad Client: High CPU utilization	CPU utilization is too high. The system might be slow to respond.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN}`	Average
HashiCorp Nomad Client: High memory utilization	RAM utilization is too high. The system might be slow to respond.	`(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN}`	Average
HashiCorp Nomad Client: The host has been restarted	The host uptime is less than 10 minutes.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m`	Warning	Manual close: Yes
HashiCorp Nomad Client: Nomad client version has changed	Nomad client version has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0`	Info	Manual close: Yes
HashiCorp Nomad Client: Nodes API connection has failed	Nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Client: Monitoring API connection has failed
HashiCorp Nomad Client: Allocations API connection has failed	Allocations API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Client: Monitoring API connection has failed

LLD rule Drivers discovery

Name Description Type Key and additional info

Drivers discovery

Name	Description	Type	Key and additional info
Drivers discovery	Client drivers discovery.	Dependent item	nomad.client.drivers.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Client drivers discovery.

Dependent item

nomad.client.drivers.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Drivers discovery

Name Description Type Key and additional info

HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state	Driver [{#DRIVER.NAME}] state.	Dependent item	nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Healthy.first()` Boolean to decimal Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state	Driver [{#DRIVER.NAME}] detection state.	Dependent item	nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Detected.first()` Boolean to decimal

Driver [{#DRIVER.NAME}] state.

Dependent item

nomad.client.driver.state["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Healthy.first()
Boolean to decimal
Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state

Driver [{#DRIVER.NAME}] detection state.

Dependent item

nomad.client.driver.detected["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Detected.first()
Boolean to decimal

Trigger prototypes for Drivers discovery

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state	The [{#DRIVER.NAME}] driver detected, but its state is unhealthy.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1`	Warning	Manual close: Yes
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed	The [{#DRIVER.NAME}] driver detection state has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0`	Info	Manual close: Yes

LLD rule Physical disks discovery

Name Description Type Key and additional info

Physical disks discovery

Name	Description	Type	Key and additional info
Physical disks discovery	Physical disks discovery.	Dependent item	nomad.client.disk.discovery Preprocessing Prometheus to JSON: `nomad_client_host_disk_available{disk=~".*"}`

Physical disks discovery.

Dependent item

nomad.client.disk.discovery

Preprocessing

Prometheus to JSON: nomad_client_host_disk_available{disk=~".*"}

Item prototypes for Physical disks discovery

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space available	Amount of space which is available on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.available["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_available{disk="{#DEV.NAME}"})`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] inodes utilization	Disk space consumed by the inodes on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] size	Total size of the ["{#DEV.NAME}"] device.	Dependent item	nomad.client.disk.size["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_size{disk="{#DEV.NAME}"})`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space utilization	Percentage of disk ["{#DEV.NAME}"] space used.	Dependent item	nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space used	Amount of disk ["{#DEV.NAME}"] space which has been used.	Dependent item	nomad.client.disk.used["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_used{disk="{#DEV.NAME}"})`

Trigger prototypes for Physical disks discovery

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes

LLD rule Allocated jobs discovery

Name Description Type Key and additional info

Allocated jobs discovery

Name	Description	Type	Key and additional info
Allocated jobs discovery	Allocated jobs discovery.	Dependent item	nomad.client.alloc.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Allocated jobs discovery.

Dependent item

nomad.client.alloc.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Allocated jobs discovery

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU allocated	Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU system utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space.	Dependent item	nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU user utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space.	Dependent item	nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU total utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled periods time	Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` Custom multiplier: `1e-09`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled time	Total time that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU ticks	CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval.	Dependent item	nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory allocated	Amount of memory allocated by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory cached	Amount of memory cached by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory used	Total amount of memory used by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory swapped	Amount of memory swapped by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`

HashiCorp Nomad Server by HTTP

Overview

This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 6.4 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Set the values for the {$NOMAD.SERVER.API.SCHEME} and {$NOMAD.SERVER.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

The Nomad servers use the default web schema - HTTP and default API port - 4646. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
Don't forget to define the {$NOMAD.REDUNDANCY.MIN} macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.SERVER.API.SCHEME}	Nomad SERVER API scheme.	`http`
{$NOMAD.SERVER.API.PORT}	Nomad SERVER API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.SERVER.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.REDUNDANCY.MIN}	Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed.	`1`
{$NOMAD.OPEN.FDS.MAX}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.SERVER.LEADER.LATENCY}	Leader last contact latency threshold.	`0.3s`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad Server: Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.server.data.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Server: Metrics	Nomad server metrics in raw format.	Dependent item	nomad.server.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Monitoring API response	Monitoring API response message.	Dependent item	nomad.server.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Internal stats get	Internal stats data in raw format.	HTTP agent	nomad.server.stats.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Server: Internal stats API response	Internal stats API response message.	Dependent item	nomad.server.stats.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Nomad server version	Nomad server version.	Dependent item	nomad.server.version Preprocessing JSON Path: `$.body.config.Version.Version`
HashiCorp Nomad Server: Nomad raft version	Nomad raft version.	Dependent item	nomad.raft.version Preprocessing JSON Path: `$.body.stats.raft.protocol_version` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft peers	Current cluster raft peers amount.	Dependent item	nomad.server.raft.peers Preprocessing JSON Path: `$.body.stats.raft.num_peers` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Cluster role	Current role in the cluster.	Dependent item	nomad.server.raft.cluster_role Preprocessing JSON Path: `$.body.stats.raft.state` ⛔️Custom on fail: Discard value JavaScript: `The text is too long. Please see the template.`
HashiCorp Nomad Server: CPU time, rate	Total user and system CPU time spent in seconds.	Dependent item	nomad.server.cpu.time Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: Memory used	Memory utilization in bytes.	Dependent item	nomad.server.runtime.alloc_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_alloc_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Virtual memory size	Virtual memory size in bytes.	Dependent item	nomad.server.virtual_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Resident memory size	Resident memory size in bytes.	Dependent item	nomad.server.resident_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Heap objects	Number of objects on the heap. General memory pressure indicator.	Dependent item	nomad.server.runtime.heap_objects Preprocessing Prometheus pattern: `VALUE(nomad_runtime_heap_objects)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Open file descriptors	Number of open file descriptors.	Dependent item	nomad.server.process_open_fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Open file descriptors, max	Maximum number of open file descriptors.	Dependent item	nomad.server.process_max_fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Goroutines	Number of goroutines and general load pressure indicator.	Dependent item	nomad.server.runtime.num_goroutines Preprocessing Prometheus pattern: `VALUE(nomad_runtime_num_goroutines)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations pending	Evaluations that are pending until an existing evaluation for the same job completes.	Dependent item	nomad.server.broker.total_pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_pending)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations ready	Number of evaluations ready to be processed.	Dependent item	nomad.server.broker.total_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations unacked	Evaluations dispatched for processing but incomplete.	Dependent item	nomad.server.broker.total_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: CPU shares for blocked evaluations	Amount of CPU shares requested by blocked evals.	Dependent item	nomad.server.blocked_evals.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memory shares by blocked evaluations	Amount of memory requested by blocked evals.	Dependent item	nomad.server.blocked_evals.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_memory)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: CPU shares for blocked job evaluations	Amount of CPU shares requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memory shares for blocked job evaluations	Amount of memory requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_memory)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations blocked	Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits).	Dependent item	nomad.server.blocked_evals.total_blocked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_blocked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations escaped	Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern.	Dependent item	nomad.server.blocked_evals.total_escaped Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_escaped)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations waiting	Count of evals waiting to be enqueued.	Dependent item	nomad.server.broker.total_waiting Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_waiting)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations blocked due to quota limit	Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked).	Dependent item	nomad.server.blocked_evals.total_quota_limit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_quota_limit)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations enqueue time	Average time elapsed with evaluations waiting to be enqueued.	Dependent item	nomad.server.broker.eval_waiting Preprocessing Prometheus pattern: `AVG(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC evaluation acknowledgement time	Time elapsed for Eval.Ack RPC call.	Dependent item	nomad.server.eval.ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC job summary time	Time elapsed for Job.Summary RPC call.	Dependent item	nomad.server.job_summary.get_job_summary Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_summary_get_job_summary_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Heartbeats active	Number of active heartbeat timers. Each timer represents a Nomad client connection.	Dependent item	nomad.server.heartbeat.active Preprocessing Prometheus pattern: `VALUE(nomad_nomad_heartbeat_active)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: RPC requests, rate	Number of RPC requests being handled.	Dependent item	nomad.server.rpc.request Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC error requests, rate	Number of RPC requests being handled that result in an error.	Dependent item	nomad.server.rpc.request_error Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC queries, rate	Number of RPC queries.	Dependent item	nomad.server.rpc.query Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_query)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC job allocations time	Time elapsed for Job.Allocations RPC call.	Dependent item	nomad.server.job.allocations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_allocations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC job evaluations time	Time elapsed for Job.Evaluations RPC call.	Dependent item	nomad.server.job.evaluations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_evaluations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC get job time	Time elapsed for Job.GetJob RPC call.	Dependent item	nomad.server.job.get_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_get_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan apply time	Time elapsed to apply a plan.	Dependent item	nomad.server.plan.apply Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan evaluate time	Time elapsed to evaluate a plan.	Dependent item	nomad.server.plan.evaluate Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_evaluate_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC plan submit time	Time elapsed for Plan.Submit RPC call.	Dependent item	nomad.server.plan.submit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_submit_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan raft index processing time	Time elapsed that planner waits for the raft index of the plan to be processed.	Dependent item	nomad.server.plan.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC list time	Time elapsed for Node.List RPC call.	Dependent item	nomad.server.client.list Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_list_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC update allocations time	Time elapsed for Node.UpdateAlloc RPC call.	Dependent item	nomad.server.client.update_alloc Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_alloc_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC update status time	Time elapsed for Node.UpdateStatus RPC call.	Dependent item	nomad.server.client.update_status Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_status_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC get client allocs time	Time elapsed for Node.GetClientAllocs RPC call.	Dependent item	nomad.server.client.get_client_allocs Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_get_client_allocs_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC eval dequeue time	Time elapsed for Eval.Dequeue RPC call.	Dependent item	nomad.server.client.dequeue Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_dequeue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Vault token last renewal	Time since last successful Vault token renewal.	Dependent item	nomad.server.vault.token_last_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_last_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault token next renewal	Time until next Vault token renewal attempt.	Dependent item	nomad.server.vault.token_next_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_next_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault token TTL	Time to live for Vault token.	Dependent item	nomad.server.vault.token_ttl Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_ttl)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault tokens revoked	Count of revoked tokens.	Dependent item	nomad.server.vault.distributed_tokens_revoked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_distributed_tokens_revoking)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Jobs dead	Number of dead jobs.	Dependent item	nomad.server.job_status.dead Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_dead)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Jobs pending	Number of pending jobs.	Dependent item	nomad.server.job_status.pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_pending)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Jobs running	Number of running jobs.	Dependent item	nomad.server.job_status.running Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations completed	Number of complete allocations for a job.	Dependent item	nomad.server.job_summary.complete Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_complete)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations failed	Number of failed allocations for a job.	Dependent item	nomad.server.job_summary.failed Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_failed)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations lost	Number of lost allocations for a job.	Dependent item	nomad.server.job_summary.lost Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_lost)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations unknown	Number of unknown allocations for a job.	Dependent item	nomad.server.job_summary.unknown Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_unknown)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations queued	Number of queued allocations for a job.	Dependent item	nomad.server.job_summary.queued Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_queued)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations running	Number of running allocations for a job.	Dependent item	nomad.server.job_summary.running Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations starting	Number of starting allocations for a job.	Dependent item	nomad.server.job_summary.starting Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_starting)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Gossip time	Time elapsed to broadcast gossip messages.	Dependent item	nomad.server.memberlist.gossip Preprocessing Prometheus pattern: `VALUE(nomad_memberlist_gossip_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Leader barrier time	Time elapsed to establish a raft barrier during leader transition.	Dependent item	nomad.server.leader.barrier Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_barrier_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Reconcile peer time	Time elapsed to reconcile a serf peer with state store.	Dependent item	nomad.server.leader.reconcile_member Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcileMember_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Total reconcile time	Time elapsed to reconcile all serf peers with state store.	Dependent item	nomad.server.leader.reconcile Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcile_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Leader last contact	Time since last contact to leader. General indicator of Raft latency.	Dependent item	nomad.server.raft.leader.lastContact Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_lastContact{quantile="0.99"})` ⛔️Custom on fail: Discard value Replace: `NaN -> 0` Custom multiplier: `0.001`
HashiCorp Nomad Server: Plan queue	Count of evals in the plan queue.	Dependent item	nomad.server.plan.queue_depth Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_queue_depth)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Worker evaluation create time	Time elapsed for worker to create an eval.	Dependent item	nomad.server.worker.create_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker evaluation dequeue time	Time elapsed for worker to dequeue an eval.	Dependent item	nomad.server.worker.dequeue_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker invoke scheduler time	Time elapsed for worker to invoke the scheduler.	Dependent item	nomad.server.worker.invoke_scheduler_service Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_invoke_scheduler_service_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker acknowledgement send time	Time elapsed for worker to send acknowledgement.	Dependent item	nomad.server.worker.send_ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_send_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker submit plan time	Time elapsed for worker to submit plan.	Dependent item	nomad.server.worker.submit_plan Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_submit_plan_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker update evaluation time	Time elapsed for worker to submit updated eval.	Dependent item	nomad.server.worker.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker log replication time	Time elapsed that worker waits for the raft index of the eval to be processed.	Dependent item	nomad.server.worker.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Raft calls blocked, rate	Count of blocking raft API calls.	Dependent item	nomad.server.raft.barrier Preprocessing Prometheus pattern: `VALUE(nomad_raft_barrier)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: Raft commit logs enqueued	Count of logs enqueued.	Dependent item	nomad.server.raft.commit_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_commitNumLogs)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft transactions, rate	Number of Raft transactions.	Dependent item	nomad.server.raft.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_apply)` ⛔️Custom on fail: Set value to: `0` Change per second
HashiCorp Nomad Server: Raft commit time	Time elapsed to commit writes.	Dependent item	nomad.server.raft.commit_time Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Raft transaction commit time	Raft transaction commit time.	Dependent item	nomad.server.raft.replication.appendEntries Preprocessing Prometheus pattern: `AVG(nomad_raft_replication_appendEntries_rpc)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: FSM apply time	Time elapsed to apply write to FSM.	Dependent item	nomad.server.raft.fsm.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM enqueue time	Time elapsed to enqueue write to FSM.	Dependent item	nomad.server.raft.fsm.enqueue Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_enqueue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM autopilot time	Time elapsed to apply Autopilot raft entry.	Dependent item	nomad.server.raft.fsm.autopilot Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_autopilot_sum)` ⛔️Custom on fail: Set value to: `0` Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM register node time	Time elapsed to apply RegisterNode raft entry.	Dependent item	nomad.server.raft.fsm.register_node Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_node_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM index	Current index applied to FSM.	Dependent item	nomad.server.raft.applied_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_appliedIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft last index	Most recent index seen.	Dependent item	nomad.server.raft.last_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_lastIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Dispatch log time	Time elapsed to write log, mark in flight, and start replication.	Dependent item	nomad.server.raft.leader.dispatch_log Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchLog_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Logs dispatched	Count of logs dispatched.	Dependent item	nomad.server.raft.leader.dispatch_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchNumLogs)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Heartbeat fails	Count of failing to heartbeat and starting election.	Dependent item	nomad.server.raft.transition.heartbeat_timeout Preprocessing Prometheus pattern: `VALUE(nomad_raft_transition_heartbeat_timeout)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Objects freed, rate	Count of objects freed from heap by go runtime GC.	Dependent item	nomad.server.runtime.free_count Preprocessing Prometheus pattern: `VALUE(nomad_runtime_free_count)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: GC pause time	Go runtime GC pause times.	Dependent item	nomad.server.runtime.gc_pause_ns Preprocessing Prometheus pattern: `VALUE(nomad_runtime_gc_pause_ns_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: GC metadata size	Go runtime GC metadata size in bytes.	Dependent item	nomad.server.runtime.sys_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_sys_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: GC runs	Count of go runtime GC runs.	Dependent item	nomad.server.runtime.total_gc_runs Preprocessing Prometheus pattern: `VALUE(nomad_runtime_total_gc_runs)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist events	Count of memberlist events received.	Dependent item	nomad.server.serf.queue.event Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Event_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist changes	Count of memberlist changes.	Dependent item	nomad.server.serf.queue.intent Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Intent_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist queries	Count of memberlist queries.	Dependent item	nomad.server.serf.queue.queries Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Query_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Snapshot index	Current snapshot index.	Dependent item	nomad.server.state.snapshot.index Preprocessing Prometheus pattern: `VALUE(nomad_state_snapshotIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Services ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.service_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Services unacknowledged	Count of unacknowledged service evals.	Dependent item	nomad.server.broker.service_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: System evaluations ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.system_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: System evaluations unacknowledged	Count of unacknowledged system evals.	Dependent item	nomad.server.broker.system_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB free pages	Number of BoltDB free pages.	Dependent item	nomad.server.raft.boltdb.num_free_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numFreePages)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB pending pages	Number of BoltDB pending pages.	Dependent item	nomad.server.raft.boltdb.num_pending_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numPendingPages)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB free page bytes	Number of free page bytes.	Dependent item	nomad.server.raft.boltdb.free_page_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freePageBytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB freelist bytes	Number of freelist bytes.	Dependent item	nomad.server.raft.boltdb.freelist_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freelistBytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB read transactions, rate	Count of total read transactions.	Dependent item	nomad.server.raft.boltdb.total_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_totalReadTxn)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB open read transactions	Number of current open read transactions.	Dependent item	nomad.server.raft.boltdb.open_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_openReadTxn)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB pages in use	Number of pages in use.	Dependent item	nomad.server.raft.boltdb.txstats.page_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageCount)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB page allocations, rate	Number of page allocations.	Dependent item	nomad.server.raft.boltdb.txstats.page_alloc Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageAlloc)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB cursors	Count of total database cursors.	Dependent item	nomad.server.raft.boltdb.txstats.cursor_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_cursorCount)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB nodes, rate	Count of total database nodes.	Dependent item	nomad.server.raft.boltdb.txstats.node_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeCount)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB node dereferences, rate	Count of total database node dereferences.	Dependent item	nomad.server.raft.boltdb.txstats.node_deref Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeDeref)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB rebalance operations, rate	Count of total rebalance operations.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalance)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB split operations, rate	Count of total split operations.	Dependent item	nomad.server.raft.boltdb.txstats.split Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_split)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB spill operations, rate	Count of total spill operations.	Dependent item	nomad.server.raft.boltdb.txstats.spill Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spill)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB write operations, rate	Count of total write operations.	Dependent item	nomad.server.raft.boltdb.txstats.write Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_write)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB rebalance time	Sample of rebalance operation times.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalanceTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: BoltDB spill time	Sample of spill operation times.	Dependent item	nomad.server.raft.boltdb.txstats.spill_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spillTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: BoltDB write time	Sample of write operation times.	Dependent item	nomad.server.raft.boltdb.txstats.write_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_writeTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Namespace list time	Time elapsed for Namespace.ListNamespaces.	Dependent item	nomad.server.namespace.list_namespace Preprocessing Prometheus pattern: `VALUE(nomad_nomad_namespace_list_namespace_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Autopilot state	Current autopilot state.	Dependent item	nomad.server.autopilot.state Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_healthy)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Autopilot failure tolerance	The number of redundant healthy servers that can fail without causing an outage.	Dependent item	nomad.server.autopilot.failure_tolerance Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_failure_tolerance)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: FSM allocation client update time	Time elapsed to apply AllocClientUpdate raft entry.	Dependent item	nomad.server.alloc_client_update Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_alloc_client_update_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM apply plan results time	Time elapsed to apply ApplyPlanResults raft entry.	Dependent item	nomad.server.fsm.apply_plan_results Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_apply_plan_results_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM update evaluation time	Time elapsed to apply UpdateEval raft entry.	Dependent item	nomad.server.fsm.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM job registration time	Time elapsed to apply RegisterJob raft entry.	Dependent item	nomad.server.fsm.register_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Allocation reschedule attempts	Count of attempts to reschedule an allocation.	Dependent item	nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing Prometheus pattern: `SUM(nomad_scheduler_allocs_reschedule_attempted)` ⛔️Custom on fail: Set value to: `0`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Server: Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad Server: Internal stats API connection has failed	Internal stats API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Server: Monitoring API connection has failed
HashiCorp Nomad Server: Nomad server version has changed	Nomad server version has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0`	Info	Manual close: Yes
HashiCorp Nomad Server: Cluster role has changed	Cluster role has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0`	Info	Manual close: Yes
HashiCorp Nomad Server: Current number of open files is too high	Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX}`	Warning
HashiCorp Nomad Server: Dead jobs found	Jobs with the `Dead` state discovered. Check the {$NOMAD.SERVER.API.SCHEME}://{HOST.IP}:{$NOMAD.SERVER.API.PORT}/v1/jobs URL for the details.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0`	Warning	Manual close: Yes
HashiCorp Nomad Server: Leader last contact timeout exceeded	The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. If this number trends upwards, look at CPU, disk IOPs, and network latency. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0`	Warning
HashiCorp Nomad Server: Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Autopilot is unhealthy	The autopilot is in unhealthy state. The successful failover probability is extremely low.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Autopilot redundancy is low	The autopilot redundancy is low. Cluster crash risk is high due to one more server failure.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0`	Warning	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.0

Also available for: 7.0 6.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nomad?at=release/6.0

HashiCorp Nomad by HTTP

Overview

This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.

Requirements

Zabbix version: 6.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
Define the {$NOMAD.ENDPOINT.API.URL} macro value with correct web protocol, host and port.
Prepare an ACL token with node:read, namespace:read-job, agent:read and management permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you have the HashiCorp Vault integration configured.

Additional information:

Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
If you're not using ACL - skip 3rd setup step.
The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.

Useful links

Macros used

Name	Description	Default
{$NOMAD.ENDPOINT.API.URL}	API endpoint URL for one of the Nomad cluster members.	`http://localhost:4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.NAME.MATCHES}	The filter to include HashiCorp Nomad servers by name.	`.*`
{$NOMAD.SERVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.SERVER.DC.MATCHES}	The filter to include HashiCorp Nomad servers by datacenter belonging.	`.*`
{$NOMAD.SERVER.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad servers by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.NAME.MATCHES}	The filter to include HashiCorp Nomad clients by name.	`.*`
{$NOMAD.CLIENT.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by name.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.DC.MATCHES}	The filter to include HashiCorp Nomad clients by datacenter belonging.	`.*`
{$NOMAD.CLIENT.DC.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by datacenter belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES}	The filter to include HashiCorp Nomad clients by scheduling eligibility.	`.*`
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES}	The filter to exclude HashiCorp Nomad clients by scheduling eligibility.	`CHANGE_IF_NEEDED`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad: Nomad clients get	Nomad clients data in raw format.	HTTP agent	nomad.client.nodes.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad: Client nodes API response	Client nodes API response message.	Dependent item	nomad.client.nodes.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad: Nomad servers get	Nomad servers data in raw format.	Script	nomad.server.nodes.get
HashiCorp Nomad: Server-related APIs response	Server-related (`operator/raft/configuration`, `agent/members`) APIs error response message.	Dependent item	nomad.server.api.response Preprocessing JSON Path: `$.error` ⛔️Custom on fail: Set value to: `HTTP/1.1 200 OK` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad: Region	Current cluster region.	Dependent item	nomad.region Preprocessing JSON Path: `$..region.first()`
HashiCorp Nomad: Nomad servers count	Nomad servers count.	Dependent item	nomad.servers.count Preprocessing JSON Path: `$[?(@.Name)].length()`
HashiCorp Nomad: Nomad clients count	Nomad clients count.	Dependent item	nomad.clients.count Preprocessing JSON Path: `$.body[?(@.Name)].length()`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad: Client nodes API connection has failed	Client nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad: Server-related API connection has failed	Server-related API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes

LLD rule Clients discovery

Name Description Type Key and additional info

Clients discovery

Name	Description	Type	Key and additional info
Clients discovery	Client nodes discovery.	Dependent item	nomad.clients.discovery Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Client nodes discovery.

Dependent item

nomad.clients.discovery

Preprocessing

JSON Path: $.body
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

LLD rule Servers discovery

Name Description Type Key and additional info

Servers discovery

Name	Description	Type	Key and additional info
Servers discovery	Server nodes discovery.	Dependent item	nomad.servers.discovery Preprocessing Check for error in JSON: `$.error` ⛔️Custom on fail: Discard value Discard unchanged with heartbeat: `1h`

Server nodes discovery.

Dependent item

nomad.servers.discovery

Preprocessing

Check for error in JSON: $.error
⛔️Custom on fail: Discard value
Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client by HTTP

Overview

This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 6.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Prepare an ACL token with node:read, namespace:read-job permissions applied. Define the {$NOMAD.TOKEN} macro value.

Refer to the vendor documentation about Nomad native ACL or Nomad Vault-generated tokens if you're using integration with HashiCorp Vault.

Set the values for the {$NOMAD.CLIENT.API.SCHEME} and {$NOMAD.CLIENT.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
If you're not using ACL - skip 2nd setup step.
The Nomad clients use the default web schema - HTTP and default API port - 4646. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.CLIENT.API.SCHEME}	Nomad client API scheme.	`http`
{$NOMAD.CLIENT.API.PORT}	Nomad client API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.CLIENT.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.CLIENT.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.DISK.NAME.MATCHES}	The filter to include HashiCorp Nomad client disks by name.	`.*`
{$NOMAD.DISK.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client disks by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAME.MATCHES}	The filter to include HashiCorp Nomad client jobs by name.	`.*`
{$NOMAD.JOB.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by name.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.NAMESPACE.MATCHES}	The filter to include HashiCorp Nomad client jobs by namespace.	`.*`
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by namespace.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TYPE.MATCHES}	The filter to include HashiCorp Nomad client jobs by type.	`.*`
{$NOMAD.JOB.TYPE.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by type.	`CHANGE_IF_NEEDED`
{$NOMAD.JOB.TASK.GROUP.MATCHES}	The filter to include HashiCorp Nomad client jobs by task group belonging.	`.*`
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client jobs by task group belonging.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.NAME.MATCHES}	The filter to include HashiCorp Nomad client drivers by name.	`.*`
{$NOMAD.DRIVER.NAME.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by name.	`CHANGE_IF_NEEDED`
{$NOMAD.DRIVER.DETECT.MATCHES}	The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`.*`
{$NOMAD.DRIVER.DETECT.NOT_MATCHES}	The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: `true`, `false`.	`CHANGE_IF_NEEDED`
{$NOMAD.CPU.UTIL.MIN}	CPU utilization threshold. Measured as a percentage.	`90`
{$NOMAD.RAM.AVAIL.MIN}	CPU utilization threshold. Measured as a percentage.	`5`
{$NOMAD.INODES.FREE.MIN.WARN}	Warning threshold of the filesystem metadata utilization. Measured as a percentage.	`20`
{$NOMAD.INODES.FREE.MIN.CRIT}	Critical threshold of the filesystem metadata utilization. Measured as a percentage.	`10`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.client.data.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Metrics	Nomad client metrics in raw format.	Dependent item	nomad.client.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Monitoring API response	Monitoring API response message.	Dependent item	nomad.client.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: CPU allocated	Total amount of CPU shares the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU unallocated	Total amount of CPU shares free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.cpu Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory allocated	Total amount of memory the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Memory unallocated	Total amount of memory free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.memory Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_memory)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Disk allocated	Total amount of disk space the scheduler has allocated to tasks.	Dependent item	nomad.client.allocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_allocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Disk unallocated	Total amount of disk space free for the scheduler to allocate to tasks.	Dependent item	nomad.client.unallocated.disk Preprocessing Prometheus pattern: `VALUE(nomad_client_unallocated_disk)` ⛔️Custom on fail: Discard value Custom multiplier: `1.0E+6`
HashiCorp Nomad Client: Allocations blocked	Number of allocations waiting for previous versions.	Dependent item	nomad.client.allocations.blocked Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_blocked)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations migrating	Number of allocations migrating data from previous versions.	Dependent item	nomad.client.allocations.migrating Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_migrating)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations pending	Number of allocations pending (received by the client but not yet running).	Dependent item	nomad.client.allocations.pending Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_pending)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations starting	Number of allocations starting.	Dependent item	nomad.client.allocations.start Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_start)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations running	Number of allocations running.	Dependent item	nomad.client.allocations.running Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations terminal	Number of allocations terminal.	Dependent item	nomad.client.allocations.terminal Preprocessing Prometheus pattern: `VALUE(nomad_client_allocations_terminal)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Client: Allocations failed, rate	Number of allocations failed.	Dependent item	nomad.client.allocations.failed Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_failed)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations completed, rate	Number of allocations completed.	Dependent item	nomad.client.allocations.complete Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_complete)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations restarted, rate	Number of allocations restarted.	Dependent item	nomad.client.allocations.restart Preprocessing Prometheus pattern: `SUM(nomad_client_allocs_restart)` ⛔️Custom on fail: Set value to: `0` Change per second Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocations OOM killed	Number of allocations OOM killed.	Dependent item	nomad.client.allocations.oom_killed Preprocessing Prometheus pattern: `VALUE(nomad_client_allocs_oom_killed)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: CPU idle utilization	CPU utilization in idle state.	Dependent item	nomad.client.cpu.idle Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_idle)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU system utilization	CPU utilization in system space.	Dependent item	nomad.client.cpu.system Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_system)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU total utilization	Total CPU utilization.	Dependent item	nomad.client.cpu.total Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_total)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: CPU user utilization	CPU utilization in user space.	Dependent item	nomad.client.cpu.user Preprocessing Prometheus pattern: `AVG(nomad_client_host_cpu_user)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory available	Total amount of memory available to processes which includes free and cached memory.	Dependent item	nomad.client.memory.available Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_available)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Client: Memory free	Amount of memory which is free.	Dependent item	nomad.client.memory.free Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_free)`
HashiCorp Nomad Client: Memory size	Total amount of physical memory on the node.	Dependent item	nomad.client.memory.total Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_total)`
HashiCorp Nomad Client: Memory used	Amount of memory used by processes.	Dependent item	nomad.client.memory.used Preprocessing Prometheus pattern: `VALUE(nomad_client_host_memory_used)`
HashiCorp Nomad Client: Uptime	Uptime of the host running the Nomad client.	Dependent item	nomad.client.uptime Preprocessing Prometheus pattern: `VALUE(nomad_client_uptime)`
HashiCorp Nomad Client: Node info get	Node info data in raw format.	HTTP agent	nomad.client.node.info.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Nomad client version	Nomad client version.	Dependent item	nomad.client.version Preprocessing JSON Path: `$.body..Version.first()`
HashiCorp Nomad Client: Nodes API response	Nodes API response message.	Dependent item	nomad.client.node.info.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Allocated jobs get	Allocated jobs data in raw format.	HTTP agent	nomad.client.job.allocs.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Client: Allocations API response	Allocations API response message.	Dependent item	nomad.client.job.allocs.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad Client: Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Client: Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Client: OOM killed allocations found	OOM killed allocations found.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0`	Warning	Manual close: Yes
HashiCorp Nomad Client: High CPU utilization	CPU utilization is too high. The system might be slow to respond.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN}`	Average
HashiCorp Nomad Client: High memory utilization	RAM utilization is too high. The system might be slow to respond.	`(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN}`	Average
HashiCorp Nomad Client: The host has been restarted	The host uptime is less than 10 minutes.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m`	Warning	Manual close: Yes
HashiCorp Nomad Client: Nomad client version has changed	Nomad client version has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0`	Info	Manual close: Yes
HashiCorp Nomad Client: Nodes API connection has failed	Nodes API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Client: Monitoring API connection has failed
HashiCorp Nomad Client: Allocations API connection has failed	Allocations API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Client: Monitoring API connection has failed

LLD rule Drivers discovery

Name Description Type Key and additional info

Drivers discovery

Name	Description	Type	Key and additional info
Drivers discovery	Client drivers discovery.	Dependent item	nomad.client.drivers.discovery Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`

Client drivers discovery.

Dependent item

nomad.client.drivers.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Drivers discovery

Name Description Type Key and additional info

HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] state	Driver [{#DRIVER.NAME}] state.	Dependent item	nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Healthy.first()` Boolean to decimal Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state	Driver [{#DRIVER.NAME}] detection state.	Dependent item	nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing JSON Path: `$.body..Drivers.{#DRIVER.NAME}.Detected.first()` Boolean to decimal

Driver [{#DRIVER.NAME}] state.

Dependent item

nomad.client.driver.state["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Healthy.first()
Boolean to decimal
Discard unchanged with heartbeat: 1h

HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state

Driver [{#DRIVER.NAME}] detection state.

Dependent item

nomad.client.driver.detected["{#DRIVER.NAME}"]

Preprocessing

JSON Path: $.body..Drivers.{#DRIVER.NAME}.Detected.first()
Boolean to decimal

Trigger prototypes for Drivers discovery

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state	The [{#DRIVER.NAME}] driver detected, but its state is unhealthy.	`last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1`	Warning	Manual close: Yes
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed	The [{#DRIVER.NAME}] driver detection state has changed.	`change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0`	Info	Manual close: Yes

LLD rule Physical disks discovery

Name Description Type Key and additional info

Physical disks discovery

Name	Description	Type	Key and additional info
Physical disks discovery	Physical disks discovery.	Dependent item	nomad.client.disk.discovery Preprocessing Prometheus to JSON: `nomad_client_host_disk_available{disk=~".*"}`

Physical disks discovery.

Dependent item

nomad.client.disk.discovery

Preprocessing

Prometheus to JSON: nomad_client_host_disk_available{disk=~".*"}

Item prototypes for Physical disks discovery

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space available	Amount of space which is available on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.available["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_available{disk="{#DEV.NAME}"})`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] inodes utilization	Disk space consumed by the inodes on ["{#DEV.NAME}"] disk.	Dependent item	nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] size	Total size of the ["{#DEV.NAME}"] device.	Dependent item	nomad.client.disk.size["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_size{disk="{#DEV.NAME}"})`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space utilization	Percentage of disk ["{#DEV.NAME}"] space used.	Dependent item	nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Disk ["{#DEV.NAME}"] space used	Amount of disk ["{#DEV.NAME}"] space which has been used.	Dependent item	nomad.client.disk.used["{#DEV.NAME}"] Preprocessing Prometheus pattern: `VALUE(nomad_client_host_disk_used{disk="{#DEV.NAME}"})`

Trigger prototypes for Physical disks discovery

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device	It may become impossible to write to a disk if there are no index nodes left. The following error messages may be returned as symptoms, even though the free space: - No space left on device; - Disk is full.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"}`	Warning	Manual close: Yes Depends on: HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization	High disk [{#DEV.NAME}] utilization.	`min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"}`	Average	Manual close: Yes

LLD rule Allocated jobs discovery

Name Description Type Key and additional info

Allocated jobs discovery

Allocated jobs discovery.

Dependent item

nomad.client.alloc.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.
Discard unchanged with heartbeat: 1h

Item prototypes for Allocated jobs discovery

Name	Description	Type	Key and additional info
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU allocated	Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU system utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space.	Dependent item	nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU user utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space.	Dependent item	nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU total utilization	Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores.	Dependent item	nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled periods time	Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` Custom multiplier: `1e-09`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU throttled time	Total time that the ["{#JOB.NAME}"] job was throttled.	Dependent item	nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] CPU ticks	CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval.	Dependent item	nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory allocated	Amount of memory allocated by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory cached	Amount of memory cached by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory used	Total amount of memory used by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`
HashiCorp Nomad Client: Job ["{#JOB.NAME}"] Memory swapped	Amount of memory swapped by the ["{#JOB.NAME}"] job.	Dependent item	nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing Prometheus pattern: `The text is too long. Please see the template.`

HashiCorp Nomad Server by HTTP

Overview

This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.

Requirements

Zabbix version: 6.0 and higher.

Tested versions

This template has been tested on:

HashiCorp Nomad version 1.5.6/1.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.

Refer to the vendor documentation.

Set the values for the {$NOMAD.SERVER.API.SCHEME} and {$NOMAD.SERVER.API.PORT} macros to define the common Nomad API web schema and connection port.

Additional information:

The Nomad servers use the default web schema - HTTP and default API port - 4646. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP}} on master host or template level.
Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
Don't forget to define the {$NOMAD.REDUNDANCY.MIN} macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.

Useful links:

Macros used

Name	Description	Default
{$NOMAD.SERVER.API.SCHEME}	Nomad SERVER API scheme.	`http`
{$NOMAD.SERVER.API.PORT}	Nomad SERVER API port.	`4646`
{$NOMAD.TOKEN}	Nomad authentication token.	`<PUT YOUR AUTH TOKEN>`
{$NOMAD.DATA.TIMEOUT}	Response timeout for an API.	`15s`
{$NOMAD.HTTP.PROXY}	Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used.
{$NOMAD.API.RESPONSE.SUCCESS}	HTTP API successful response code. Availability triggers threshold. Change, if needed.	`200`
{$NOMAD.SERVER.RPC.PORT}	Nomad RPC service port.	`4647`
{$NOMAD.SERVER.SERF.PORT}	Nomad serf service port.	`4648`
{$NOMAD.REDUNDANCY.MIN}	Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed.	`1`
{$NOMAD.OPEN.FDS.MAX}	Maximum percentage of used file descriptors.	`90`
{$NOMAD.SERVER.LEADER.LATENCY}	Leader last contact latency threshold.	`0.3s`

Items

Name	Description	Type	Key and additional info
HashiCorp Nomad Server: Telemetry get	Telemetry data in raw format.	HTTP agent	nomad.server.data.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Server: Metrics	Nomad server metrics in raw format.	Dependent item	nomad.server.metrics.get Preprocessing JSON Path: `$.body` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Monitoring API response	Monitoring API response message.	Dependent item	nomad.server.data.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Internal stats get	Internal stats data in raw format.	HTTP agent	nomad.server.stats.get Preprocessing Check for not supported value ⛔️Custom on fail: Set value to: `{"header":{"HTTP/1.1 408 Request timeout":""}}`
HashiCorp Nomad Server: Internal stats API response	Internal stats API response message.	Dependent item	nomad.server.stats.api.response Preprocessing JavaScript: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Nomad server version	Nomad server version.	Dependent item	nomad.server.version Preprocessing JSON Path: `$.body.config.Version.Version`
HashiCorp Nomad Server: Nomad raft version	Nomad raft version.	Dependent item	nomad.raft.version Preprocessing JSON Path: `$.body.stats.raft.protocol_version` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft peers	Current cluster raft peers amount.	Dependent item	nomad.server.raft.peers Preprocessing JSON Path: `$.body.stats.raft.num_peers` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Cluster role	Current role in the cluster.	Dependent item	nomad.server.raft.cluster_role Preprocessing JSON Path: `$.body.stats.raft.state` ⛔️Custom on fail: Discard value JavaScript: `The text is too long. Please see the template.`
HashiCorp Nomad Server: CPU time, rate	Total user and system CPU time spent in seconds.	Dependent item	nomad.server.cpu.time Preprocessing Prometheus pattern: `VALUE(process_cpu_seconds_total)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: Memory used	Memory utilization in bytes.	Dependent item	nomad.server.runtime.alloc_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_alloc_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Virtual memory size	Virtual memory size in bytes.	Dependent item	nomad.server.virtual_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_virtual_memory_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Resident memory size	Resident memory size in bytes.	Dependent item	nomad.server.resident_memory_bytes Preprocessing Prometheus pattern: `VALUE(process_resident_memory_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Heap objects	Number of objects on the heap. General memory pressure indicator.	Dependent item	nomad.server.runtime.heap_objects Preprocessing Prometheus pattern: `VALUE(nomad_runtime_heap_objects)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Open file descriptors	Number of open file descriptors.	Dependent item	nomad.server.process_open_fds Preprocessing Prometheus pattern: `VALUE(process_open_fds)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Open file descriptors, max	Maximum number of open file descriptors.	Dependent item	nomad.server.process_max_fds Preprocessing Prometheus pattern: `VALUE(process_max_fds)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Goroutines	Number of goroutines and general load pressure indicator.	Dependent item	nomad.server.runtime.num_goroutines Preprocessing Prometheus pattern: `VALUE(nomad_runtime_num_goroutines)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations pending	Evaluations that are pending until an existing evaluation for the same job completes.	Dependent item	nomad.server.broker.total_pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_pending)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations ready	Number of evaluations ready to be processed.	Dependent item	nomad.server.broker.total_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations unacked	Evaluations dispatched for processing but incomplete.	Dependent item	nomad.server.broker.total_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: CPU shares for blocked evaluations	Amount of CPU shares requested by blocked evals.	Dependent item	nomad.server.blocked_evals.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memory shares by blocked evaluations	Amount of memory requested by blocked evals.	Dependent item	nomad.server.blocked_evals.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_memory)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: CPU shares for blocked job evaluations	Amount of CPU shares requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.cpu Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_cpu)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memory shares for blocked job evaluations	Amount of memory requested by blocked evals of a job.	Dependent item	nomad.server.blocked_evals.job.memory Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_job_memory)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations blocked	Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits).	Dependent item	nomad.server.blocked_evals.total_blocked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_blocked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations escaped	Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern.	Dependent item	nomad.server.blocked_evals.total_escaped Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_escaped)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations waiting	Count of evals waiting to be enqueued.	Dependent item	nomad.server.broker.total_waiting Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_total_waiting)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations blocked due to quota limit	Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked).	Dependent item	nomad.server.blocked_evals.total_quota_limit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_blocked_evals_total_quota_limit)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Evaluations enqueue time	Average time elapsed with evaluations waiting to be enqueued.	Dependent item	nomad.server.broker.eval_waiting Preprocessing Prometheus pattern: `AVG(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC evaluation acknowledgement time	Time elapsed for Eval.Ack RPC call.	Dependent item	nomad.server.eval.ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC job summary time	Time elapsed for Job.Summary RPC call.	Dependent item	nomad.server.job_summary.get_job_summary Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_summary_get_job_summary_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Heartbeats active	Number of active heartbeat timers. Each timer represents a Nomad client connection.	Dependent item	nomad.server.heartbeat.active Preprocessing Prometheus pattern: `VALUE(nomad_nomad_heartbeat_active)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: RPC requests, rate	Number of RPC requests being handled.	Dependent item	nomad.server.rpc.request Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC error requests, rate	Number of RPC requests being handled that result in an error.	Dependent item	nomad.server.rpc.request_error Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_request)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC queries, rate	Number of RPC queries.	Dependent item	nomad.server.rpc.query Preprocessing Prometheus pattern: `VALUE(nomad_nomad_rpc_query)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: RPC job allocations time	Time elapsed for Job.Allocations RPC call.	Dependent item	nomad.server.job.allocations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_allocations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC job evaluations time	Time elapsed for Job.Evaluations RPC call.	Dependent item	nomad.server.job.evaluations Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_evaluations_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC get job time	Time elapsed for Job.GetJob RPC call.	Dependent item	nomad.server.job.get_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_get_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan apply time	Time elapsed to apply a plan.	Dependent item	nomad.server.plan.apply Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan evaluate time	Time elapsed to evaluate a plan.	Dependent item	nomad.server.plan.evaluate Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_evaluate_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC plan submit time	Time elapsed for Plan.Submit RPC call.	Dependent item	nomad.server.plan.submit Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_submit_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Plan raft index processing time	Time elapsed that planner waits for the raft index of the plan to be processed.	Dependent item	nomad.server.plan.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC list time	Time elapsed for Node.List RPC call.	Dependent item	nomad.server.client.list Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_list_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC update allocations time	Time elapsed for Node.UpdateAlloc RPC call.	Dependent item	nomad.server.client.update_alloc Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_alloc_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC update status time	Time elapsed for Node.UpdateStatus RPC call.	Dependent item	nomad.server.client.update_status Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_update_status_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC get client allocs time	Time elapsed for Node.GetClientAllocs RPC call.	Dependent item	nomad.server.client.get_client_allocs Preprocessing Prometheus pattern: `VALUE(nomad_nomad_client_get_client_allocs_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: RPC eval dequeue time	Time elapsed for Eval.Dequeue RPC call.	Dependent item	nomad.server.client.dequeue Preprocessing Prometheus pattern: `VALUE(nomad_nomad_eval_dequeue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Vault token last renewal	Time since last successful Vault token renewal.	Dependent item	nomad.server.vault.token_last_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_last_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault token next renewal	Time until next Vault token renewal attempt.	Dependent item	nomad.server.vault.token_next_renewal Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_next_renewal)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault token TTL	Time to live for Vault token.	Dependent item	nomad.server.vault.token_ttl Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_token_ttl)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: Vault tokens revoked	Count of revoked tokens.	Dependent item	nomad.server.vault.distributed_tokens_revoked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_vault_distributed_tokens_revoking)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Jobs dead	Number of dead jobs.	Dependent item	nomad.server.job_status.dead Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_dead)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Jobs pending	Number of pending jobs.	Dependent item	nomad.server.job_status.pending Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_pending)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Jobs running	Number of running jobs.	Dependent item	nomad.server.job_status.running Preprocessing Prometheus pattern: `VALUE(nomad_nomad_job_status_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations completed	Number of complete allocations for a job.	Dependent item	nomad.server.job_summary.complete Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_complete)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations failed	Number of failed allocations for a job.	Dependent item	nomad.server.job_summary.failed Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_failed)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations lost	Number of lost allocations for a job.	Dependent item	nomad.server.job_summary.lost Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_lost)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations unknown	Number of unknown allocations for a job.	Dependent item	nomad.server.job_summary.unknown Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_unknown)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations queued	Number of queued allocations for a job.	Dependent item	nomad.server.job_summary.queued Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_queued)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations running	Number of running allocations for a job.	Dependent item	nomad.server.job_summary.running Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_running)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Job allocations starting	Number of starting allocations for a job.	Dependent item	nomad.server.job_summary.starting Preprocessing Prometheus pattern: `SUM(nomad_nomad_job_summary_starting)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Gossip time	Time elapsed to broadcast gossip messages.	Dependent item	nomad.server.memberlist.gossip Preprocessing Prometheus pattern: `VALUE(nomad_memberlist_gossip_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Leader barrier time	Time elapsed to establish a raft barrier during leader transition.	Dependent item	nomad.server.leader.barrier Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_barrier_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Reconcile peer time	Time elapsed to reconcile a serf peer with state store.	Dependent item	nomad.server.leader.reconcile_member Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcileMember_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Total reconcile time	Time elapsed to reconcile all serf peers with state store.	Dependent item	nomad.server.leader.reconcile Preprocessing Prometheus pattern: `VALUE(nomad_nomad_leader_reconcile_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Leader last contact	Time since last contact to leader. General indicator of Raft latency.	Dependent item	nomad.server.raft.leader.lastContact Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_lastContact{quantile="0.99"})` ⛔️Custom on fail: Discard value Replace: `NaN -> 0` Custom multiplier: `0.001`
HashiCorp Nomad Server: Plan queue	Count of evals in the plan queue.	Dependent item	nomad.server.plan.queue_depth Preprocessing Prometheus pattern: `VALUE(nomad_nomad_plan_queue_depth)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Worker evaluation create time	Time elapsed for worker to create an eval.	Dependent item	nomad.server.worker.create_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker evaluation dequeue time	Time elapsed for worker to dequeue an eval.	Dependent item	nomad.server.worker.dequeue_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker invoke scheduler time	Time elapsed for worker to invoke the scheduler.	Dependent item	nomad.server.worker.invoke_scheduler_service Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_invoke_scheduler_service_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker acknowledgement send time	Time elapsed for worker to send acknowledgement.	Dependent item	nomad.server.worker.send_ack Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_send_ack_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker submit plan time	Time elapsed for worker to submit plan.	Dependent item	nomad.server.worker.submit_plan Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_submit_plan_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker update evaluation time	Time elapsed for worker to submit updated eval.	Dependent item	nomad.server.worker.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Worker log replication time	Time elapsed that worker waits for the raft index of the eval to be processed.	Dependent item	nomad.server.worker.wait_for_index Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_wait_for_index_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Raft calls blocked, rate	Count of blocking raft API calls.	Dependent item	nomad.server.raft.barrier Preprocessing Prometheus pattern: `VALUE(nomad_raft_barrier)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: Raft commit logs enqueued	Count of logs enqueued.	Dependent item	nomad.server.raft.commit_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_commitNumLogs)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft transactions, rate	Number of Raft transactions.	Dependent item	nomad.server.raft.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_apply)` ⛔️Custom on fail: Set value to: `0` Change per second
HashiCorp Nomad Server: Raft commit time	Time elapsed to commit writes.	Dependent item	nomad.server.raft.commit_time Preprocessing Prometheus pattern: `VALUE(nomad_nomad_worker_dequeue_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Raft transaction commit time	Raft transaction commit time.	Dependent item	nomad.server.raft.replication.appendEntries Preprocessing Prometheus pattern: `AVG(nomad_raft_replication_appendEntries_rpc)` ⛔️Custom on fail: Discard value Custom multiplier: `0.001`
HashiCorp Nomad Server: FSM apply time	Time elapsed to apply write to FSM.	Dependent item	nomad.server.raft.fsm.apply Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_apply_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM enqueue time	Time elapsed to enqueue write to FSM.	Dependent item	nomad.server.raft.fsm.enqueue Preprocessing Prometheus pattern: `VALUE(nomad_raft_fsm_enqueue_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM autopilot time	Time elapsed to apply Autopilot raft entry.	Dependent item	nomad.server.raft.fsm.autopilot Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_autopilot_sum)` ⛔️Custom on fail: Set value to: `0` Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM register node time	Time elapsed to apply RegisterNode raft entry.	Dependent item	nomad.server.raft.fsm.register_node Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_node_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM index	Current index applied to FSM.	Dependent item	nomad.server.raft.applied_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_appliedIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Raft last index	Most recent index seen.	Dependent item	nomad.server.raft.last_index Preprocessing Prometheus pattern: `VALUE(nomad_raft_lastIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Dispatch log time	Time elapsed to write log, mark in flight, and start replication.	Dependent item	nomad.server.raft.leader.dispatch_log Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchLog_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Logs dispatched	Count of logs dispatched.	Dependent item	nomad.server.raft.leader.dispatch_num_logs Preprocessing Prometheus pattern: `VALUE(nomad_raft_leader_dispatchNumLogs)` ⛔️Custom on fail: Set value to: `0`
HashiCorp Nomad Server: Heartbeat fails	Count of failing to heartbeat and starting election.	Dependent item	nomad.server.raft.transition.heartbeat_timeout Preprocessing Prometheus pattern: `VALUE(nomad_raft_transition_heartbeat_timeout)` ⛔️Custom on fail: Set value to: `0` Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Objects freed, rate	Count of objects freed from heap by go runtime GC.	Dependent item	nomad.server.runtime.free_count Preprocessing Prometheus pattern: `VALUE(nomad_runtime_free_count)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: GC pause time	Go runtime GC pause times.	Dependent item	nomad.server.runtime.gc_pause_ns Preprocessing Prometheus pattern: `VALUE(nomad_runtime_gc_pause_ns_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: GC metadata size	Go runtime GC metadata size in bytes.	Dependent item	nomad.server.runtime.sys_bytes Preprocessing Prometheus pattern: `VALUE(nomad_runtime_sys_bytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: GC runs	Count of go runtime GC runs.	Dependent item	nomad.server.runtime.total_gc_runs Preprocessing Prometheus pattern: `VALUE(nomad_runtime_total_gc_runs)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist events	Count of memberlist events received.	Dependent item	nomad.server.serf.queue.event Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Event_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist changes	Count of memberlist changes.	Dependent item	nomad.server.serf.queue.intent Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Intent_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Memberlist queries	Count of memberlist queries.	Dependent item	nomad.server.serf.queue.queries Preprocessing Prometheus pattern: `VALUE(nomad_serf_queue_Query_sum)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Snapshot index	Current snapshot index.	Dependent item	nomad.server.state.snapshot.index Preprocessing Prometheus pattern: `VALUE(nomad_state_snapshotIndex)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Services ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.service_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Services unacknowledged	Count of unacknowledged service evals.	Dependent item	nomad.server.broker.service_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_service_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: System evaluations ready to schedule	Count of service evals ready to be scheduled.	Dependent item	nomad.server.broker.system_ready Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_ready)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: System evaluations unacknowledged	Count of unacknowledged system evals.	Dependent item	nomad.server.broker.system_unacked Preprocessing Prometheus pattern: `VALUE(nomad_nomad_broker_system_unacked)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB free pages	Number of BoltDB free pages.	Dependent item	nomad.server.raft.boltdb.num_free_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numFreePages)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB pending pages	Number of BoltDB pending pages.	Dependent item	nomad.server.raft.boltdb.num_pending_pages Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_numPendingPages)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB free page bytes	Number of free page bytes.	Dependent item	nomad.server.raft.boltdb.free_page_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freePageBytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB freelist bytes	Number of freelist bytes.	Dependent item	nomad.server.raft.boltdb.freelist_bytes Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_freelistBytes)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB read transactions, rate	Count of total read transactions.	Dependent item	nomad.server.raft.boltdb.total_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_totalReadTxn)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB open read transactions	Number of current open read transactions.	Dependent item	nomad.server.raft.boltdb.open_read_txn Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_openReadTxn)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB pages in use	Number of pages in use.	Dependent item	nomad.server.raft.boltdb.txstats.page_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageCount)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: BoltDB page allocations, rate	Number of page allocations.	Dependent item	nomad.server.raft.boltdb.txstats.page_alloc Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_pageAlloc)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB cursors	Count of total database cursors.	Dependent item	nomad.server.raft.boltdb.txstats.cursor_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_cursorCount)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB nodes, rate	Count of total database nodes.	Dependent item	nomad.server.raft.boltdb.txstats.node_count Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeCount)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB node dereferences, rate	Count of total database node dereferences.	Dependent item	nomad.server.raft.boltdb.txstats.node_deref Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_nodeDeref)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB rebalance operations, rate	Count of total rebalance operations.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalance)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB split operations, rate	Count of total split operations.	Dependent item	nomad.server.raft.boltdb.txstats.split Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_split)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB spill operations, rate	Count of total spill operations.	Dependent item	nomad.server.raft.boltdb.txstats.spill Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spill)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB write operations, rate	Count of total write operations.	Dependent item	nomad.server.raft.boltdb.txstats.write Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_write)` ⛔️Custom on fail: Discard value Change per second
HashiCorp Nomad Server: BoltDB rebalance time	Sample of rebalance operation times.	Dependent item	nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_rebalanceTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: BoltDB spill time	Sample of spill operation times.	Dependent item	nomad.server.raft.boltdb.txstats.spill_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_spillTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: BoltDB write time	Sample of write operation times.	Dependent item	nomad.server.raft.boltdb.txstats.write_time Preprocessing Prometheus pattern: `VALUE(nomad_raft_boltdb_txstats_writeTime_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Service [rpc] state	Current [rpc] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Service [serf] state	Current [serf] service state.	Simple check	net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing Discard unchanged with heartbeat: `1h`
HashiCorp Nomad Server: Namespace list time	Time elapsed for Namespace.ListNamespaces.	Dependent item	nomad.server.namespace.list_namespace Preprocessing Prometheus pattern: `VALUE(nomad_nomad_namespace_list_namespace_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Autopilot state	Current autopilot state.	Dependent item	nomad.server.autopilot.state Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_healthy)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: Autopilot failure tolerance	The number of redundant healthy servers that can fail without causing an outage.	Dependent item	nomad.server.autopilot.failure_tolerance Preprocessing Prometheus pattern: `VALUE(nomad_nomad_autopilot_failure_tolerance)` ⛔️Custom on fail: Discard value
HashiCorp Nomad Server: FSM allocation client update time	Time elapsed to apply AllocClientUpdate raft entry.	Dependent item	nomad.server.alloc_client_update Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_alloc_client_update_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM apply plan results time	Time elapsed to apply ApplyPlanResults raft entry.	Dependent item	nomad.server.fsm.apply_plan_results Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_apply_plan_results_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM update evaluation time	Time elapsed to apply UpdateEval raft entry.	Dependent item	nomad.server.fsm.update_eval Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_update_eval_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: FSM job registration time	Time elapsed to apply RegisterJob raft entry.	Dependent item	nomad.server.fsm.register_job Preprocessing Prometheus pattern: `VALUE(nomad_nomad_fsm_register_job_sum)` ⛔️Custom on fail: Discard value Custom multiplier: `1e-09`
HashiCorp Nomad Server: Allocation reschedule attempts	Count of attempts to reschedule an allocation.	Dependent item	nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing Prometheus pattern: `SUM(nomad_scheduler_allocs_reschedule_attempted)` ⛔️Custom on fail: Set value to: `0`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
HashiCorp Nomad Server: Monitoring API connection has failed	Monitoring API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes
HashiCorp Nomad Server: Internal stats API connection has failed	Internal stats API connection has failed. Ensure that Nomad API URL and the necessary permissions have been defined correctly, check the service state and network connectivity between Nomad and Zabbix.	`find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0`	Average	Manual close: Yes Depends on: HashiCorp Nomad Server: Monitoring API connection has failed
HashiCorp Nomad Server: Nomad server version has changed	Nomad server version has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0`	Info	Manual close: Yes
HashiCorp Nomad Server: Cluster role has changed	Cluster role has changed.	`change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0`	Info	Manual close: Yes
HashiCorp Nomad Server: Current number of open files is too high	Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX}`	Warning
HashiCorp Nomad Server: Dead jobs found	Jobs with the `Dead` state discovered. Check the {$NOMAD.SERVER.API.SCHEME}://{HOST.IP}:{$NOMAD.SERVER.API.PORT}/v1/jobs URL for the details.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0`	Warning	Manual close: Yes
HashiCorp Nomad Server: Leader last contact timeout exceeded	The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. If this number trends upwards, look at CPU, disk IOPs, and network latency. nomad.raft.leader.lastContact should not get too close to the leader lease timeout of 500ms.	`min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0`	Warning
HashiCorp Nomad Server: Service [rpc] is down	Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Service [serf] is down	Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. Check the Nomad state and network connectivity between Nomad and Zabbix.	`last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Autopilot is unhealthy	The autopilot is in unhealthy state. The successful failover probability is extremely low.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0`	Average	Manual close: Yes
HashiCorp Nomad Server: Autopilot redundancy is low	The autopilot redundancy is low. Cluster crash risk is high due to one more server failure.	`last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0`	Warning	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Podívejte se na Zabbix demo video

Zjistit více

Zjistit více

Zabbix + HashiCorp Nomad

HashiCorp Nomad

Dostupná řešení

HashiCorp Nomad by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Clients discovery

LLD rule Servers discovery

HashiCorp Nomad Client by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Drivers discovery

Item prototypes for Drivers discovery

Trigger prototypes for Drivers discovery

LLD rule Physical disks discovery

Item prototypes for Physical disks discovery

Trigger prototypes for Physical disks discovery

LLD rule Allocated jobs discovery

Item prototypes for Allocated jobs discovery

HashiCorp Nomad Server by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

Feedback

HashiCorp Nomad by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Clients discovery

LLD rule Servers discovery

HashiCorp Nomad Client by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Drivers discovery

Item prototypes for Drivers discovery

Trigger prototypes for Drivers discovery

LLD rule Physical disks discovery

Item prototypes for Physical disks discovery

Trigger prototypes for Physical disks discovery

LLD rule Allocated jobs discovery

Item prototypes for Allocated jobs discovery

HashiCorp Nomad Server by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items