HashiCorp Consul Node by HTTP
Overview
The template to monitor HashiCorp Consul by Zabbix that works without any external scripts.
Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
Do not forget to enable Prometheus format for export metrics.
See documentation.
More information about metrics you can find in official documentation.
Template HashiCorp Consul Node by HTTP
— collects metrics by HTTP agent from /v1/agent/metrics endpoint.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- HashiCorp Consul 1.10.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
Internal service metrics are collected from /v1/agent/metrics endpoint. Do not forget to enable Prometheus format for export metrics. See documentation. Template need to use Authorization via API token.
Don't forget to change macros {$CONSUL.NODE.API.URL}, {$CONSUL.TOKEN}.
Also, see the Macros section for a list of macros used to set trigger values.
More information about metrics you can find in official documentation.
This template support Consul namespaces. You can set macros {$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.MATCHES}, {$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.NOT_MATCHES} if you want to filter discovered services by namespace.
In case of Open Source version service namespace will be set to 'None'.
NOTE. Some metrics may not be collected depending on your HashiCorp Consul instance version and configuration.
NOTE. You maybe are interested in Envoy Proxy by HTTP template.
Macros used
Name | Description | Default |
---|---|---|
{$CONSUL.NODE.API.URL} | Consul instance URL. |
http://localhost:8500 |
{$CONSUL.TOKEN} | Consul auth token. |
<PUT YOUR AUTH TOKEN> |
{$CONSUL.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors. |
90 |
{$CONSUL.LLD.FILTER.LOCAL_SERVICE_NAME.MATCHES} | Filter of discoverable discovered services on local node. |
.* |
{$CONSUL.LLD.FILTER.LOCAL_SERVICE_NAME.NOT_MATCHES} | Filter to exclude discovered services on local node. |
CHANGE IF NEEDED |
{$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.MATCHES} | Filter of discoverable discovered service by namespace on local node. Enterprise only, in case of Open Source version Namespace will be set to 'None'. |
.* |
{$CONSUL.LLD.FILTER.SERVICE_NAMESPACE.NOT_MATCHES} | Filter to exclude discovered service by namespace on local node. Enterprise only, in case of Open Source version Namespace will be set to 'None'. |
CHANGE IF NEEDED |
{$CONSUL.NODE.HEALTH_SCORE.MAX.WARN} | Maximum acceptable value of node's health score for WARNING trigger expression. |
2 |
{$CONSUL.NODE.HEALTH_SCORE.MAX.HIGH} | Maximum acceptable value of node's health score for AVERAGE trigger expression. |
4 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Get instance metrics | Get raw metrics from Consul instance /metrics endpoint. |
HTTP agent | consul.get_metrics Preprocessing
|
Get node info | Get configuration and member information of the local agent. |
HTTP agent | consul.get_node_info Preprocessing
|
Role | Role of current Consul agent. |
Dependent item | consul.role Preprocessing
|
Version | Version of Consul agent. |
Dependent item | consul.version Preprocessing
|
Number of services | Number of services on current node. |
Dependent item | consul.services_number Preprocessing
|
Number of checks | Number of checks on current node. |
Dependent item | consul.checks_number Preprocessing
|
Number of check monitors | Number of check monitors on current node. |
Dependent item | consul.check_monitors_number Preprocessing
|
Process CPU seconds, total | Total user and system CPU time spent in seconds. |
Dependent item | consul.cpu_seconds_total.rate Preprocessing
|
Virtual memory size | Virtual memory size in bytes. |
Dependent item | consul.virtual_memory_bytes Preprocessing
|
RSS memory usage | Resident memory size in bytes. |
Dependent item | consul.resident_memory_bytes Preprocessing
|
Goroutine count | The number of Goroutines on Consul instance. |
Dependent item | consul.goroutines Preprocessing
|
Open file descriptors | Number of open file descriptors. |
Dependent item | consul.process_open_fds Preprocessing
|
Open file descriptors, max | Maximum number of open file descriptors. |
Dependent item | consul.process_max_fds Preprocessing
|
Client RPC, per second | Number of times per second whenever a Consul agent in client mode makes an RPC request to a Consul server. This gives a measure of how much a given agent is loading the Consul servers. This is only generated by agents in client mode, not Consul servers. |
Dependent item | consul.client_rpc Preprocessing
|
Client RPC failed ,per second | Number of times per second whenever a Consul agent in client mode makes an RPC request to a Consul server and fails. |
Dependent item | consul.client_rpc_failed Preprocessing
|
TCP connections, accepted per second | This metric counts the number of times a Consul agent has accepted an incoming TCP stream connection per second. |
Dependent item | consul.memberlist.tcp_accept Preprocessing
|
TCP connections, per second | This metric counts the number of times a Consul agent has initiated a push/pull sync with an other agent per second. |
Dependent item | consul.memberlist.tcp_connect Preprocessing
|
TCP send bytes, per second | This metric measures the total number of bytes sent by a Consul agent through the TCP protocol per second. |
Dependent item | consul.memberlist.tcp_sent Preprocessing
|
UDP received bytes, per second | This metric measures the total number of bytes received by a Consul agent through the UDP protocol per second. |
Dependent item | consul.memberlist.udp_received Preprocessing
|
UDP sent bytes, per second | This metric measures the total number of bytes sent by a Consul agent through the UDP protocol per second. |
Dependent item | consul.memberlist.udp_sent Preprocessing
|
GC pause, p90 | The 90 percentile for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started, in milliseconds. |
Dependent item | consul.gc_pause.p90 Preprocessing
|
GC pause, p50 | The 50 percentile (median) for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started, in milliseconds. |
Dependent item | consul.gc_pause.p50 Preprocessing
|
Memberlist: degraded | This metric counts the number of times the Consul agent has performed failure detection on another agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. If its health score is low, it means that the node is healthy, and vice versa. |
Dependent item | consul.memberlist.degraded Preprocessing
|
Memberlist: health score | This metric describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol. This metric ranges from 0 to 8, where 0 indicates "totally healthy". |
Dependent item | consul.memberlist.health_score Preprocessing
|
Memberlist: gossip, p90 | The 90 percentile for the number of gossips (messages) broadcasted to a set of randomly selected nodes. |
Dependent item | consul.memberlist.dispatch_log.p90 Preprocessing
|
Memberlist: gossip, p50 | The 50 for the number of gossips (messages) broadcasted to a set of randomly selected nodes. |
Dependent item | consul.memberlist.gossip.p50 Preprocessing
|
Memberlist: msg alive | This metric counts the number of alive Consul agents, that the agent has mapped out so far, based on the message information given by the network layer. |
Dependent item | consul.memberlist.msg.alive Preprocessing
|
Memberlist: msg dead | This metric counts the number of times a Consul agent has marked another agent to be a dead node. |
Dependent item | consul.memberlist.msg.dead Preprocessing
|
Memberlist: msg suspect | The number of times a Consul agent suspects another as failed while probing during gossip protocol. |
Dependent item | consul.memberlist.msg.suspect Preprocessing
|
Memberlist: probe node, p90 | The 90 percentile for the time taken to perform a single round of failure detection on a select Consul agent. |
Dependent item | consul.memberlist.probe_node.p90 Preprocessing
|
Memberlist: probe node, p50 | The 50 percentile (median) for the time taken to perform a single round of failure detection on a select Consul agent. |
Dependent item | consul.memberlist.probe_node.p50 Preprocessing
|
Memberlist: push pull node, p90 | The 90 percentile for the number of Consul agents that have exchanged state with this agent. |
Dependent item | consul.memberlist.push_pull_node.p90 Preprocessing
|
Memberlist: push pull node, p50 | The 50 percentile (median) for the number of Consul agents that have exchanged state with this agent. |
Dependent item | consul.memberlist.push_pull_node.p50 Preprocessing
|
KV store: apply, p90 | The 90 percentile for the time it takes to complete an update to the KV store. |
Dependent item | consul.kvs.apply.p90 Preprocessing
|
KV store: apply, p50 | The 50 percentile (median) for the time it takes to complete an update to the KV store. |
Dependent item | consul.kvs.apply.p50 Preprocessing
|
KV store: apply, rate | The number of updates to the KV store per second. |
Dependent item | consul.kvs.apply.rate Preprocessing
|
Serf member: flap, rate | Increments when an agent is marked dead and then recovers within a short time period. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports. Shown as events per second. |
Dependent item | consul.serf.member.flap.rate Preprocessing
|
Serf member: failed, rate | Increments when an agent is marked dead. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports. Shown as events per second. |
Dependent item | consul.serf.member.failed.rate Preprocessing
|
Serf member: join, rate | Increments when an agent joins the cluster. If an agent flapped or failed this counter also increments when it re-joins. Shown as events per second. |
Dependent item | consul.serf.member.join.rate Preprocessing
|
Serf member: left, rate | Increments when an agent leaves the cluster. Shown as events per second. |
Dependent item | consul.serf.member.left.rate Preprocessing
|
Serf member: update, rate | Increments when a Consul agent updates. Shown as events per second. |
Dependent item | consul.serf.member.update.rate Preprocessing
|
ACL: resolves, rate | The number of ACL resolves per second. |
Dependent item | consul.acl.resolves.rate Preprocessing
|
Catalog: register, rate | The number of catalog register operation per second. |
Dependent item | consul.catalog.register.rate Preprocessing
|
Catalog: deregister, rate | The number of catalog deregister operation per second. |
Dependent item | consul.catalog.deregister.rate Preprocessing
|
Snapshot: append line, p90 | The 90 percentile for the time taken by the Consul agent to append an entry into the existing log. |
Dependent item | consul.snapshot.append_line.p90 Preprocessing
|
Snapshot: append line, p50 | The 50 percentile (median) for the time taken by the Consul agent to append an entry into the existing log. |
Dependent item | consul.snapshot.append_line.p50 Preprocessing
|
Snapshot: append line, rate | The number of snapshot appendLine operations per second. |
Dependent item | consul.snapshot.append_line.rate Preprocessing
|
Snapshot: compact, p90 | The 90 percentile for the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction. |
Dependent item | consul.snapshot.compact.p90 Preprocessing
|
Snapshot: compact, p50 | The 50 percentile (median) for the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction. |
Dependent item | consul.snapshot.compact.p50 Preprocessing
|
Snapshot: compact, rate | The number of snapshot compact operations per second. |
Dependent item | consul.snapshot.compact.rate Preprocessing
|
Get local services | Get all the services that are registered with the local agent and their status. |
Script | consul.get_local_services |
Get local services check | Data collection check. |
Dependent item | consul.get_local_services.check Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Consul Node: Version has been changed | Consul version has changed. Acknowledge to close the problem manually. |
last(/HashiCorp Consul Node by HTTP/consul.version,#1)<>last(/HashiCorp Consul Node by HTTP/consul.version,#2) and length(last(/HashiCorp Consul Node by HTTP/consul.version))>0 |
Info | Manual close: Yes |
HashiCorp Consul Node: Current number of open files is too high | "Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue." |
min(/HashiCorp Consul Node by HTTP/consul.process_open_fds,5m)/last(/HashiCorp Consul Node by HTTP/consul.process_max_fds)*100>{$CONSUL.OPEN.FDS.MAX.WARN} |
Warning | |
HashiCorp Consul Node: Node's health score is warning | This metric ranges from 0 to 8, where 0 indicates "totally healthy". |
max(/HashiCorp Consul Node by HTTP/consul.memberlist.health_score,#3)>{$CONSUL.NODE.HEALTH_SCORE.MAX.WARN} |
Warning | Depends on:
|
HashiCorp Consul Node: Node's health score is critical | This metric ranges from 0 to 8, where 0 indicates "totally healthy". |
max(/HashiCorp Consul Node by HTTP/consul.memberlist.health_score,#3)>{$CONSUL.NODE.HEALTH_SCORE.MAX.HIGH} |
Average | |
HashiCorp Consul Node: Failed to get local services | Failed to get local services. Check debug log for more information. |
length(last(/HashiCorp Consul Node by HTTP/consul.get_local_services.check))>0 |
Warning |
LLD rule Local node services discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Local node services discovery | Discover metrics for services that are registered with the local agent. |
Dependent item | consul.node_services_lld Preprocessing
|
Item prototypes for Local node services discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
["{#SERVICE_NAME}"]: Aggregated status | Aggregated values of all health checks for the service instance. |
Dependent item | consul.service.aggregated_state["{#SERVICE_ID}"] Preprocessing
|
["{#SERVICE_NAME}"]: Check ["{#SERVICE_CHECK_NAME}"]: Status | Current state of health check for the service. |
Dependent item | consul.service.check.state["{#SERVICE_ID}/{#SERVICE_CHECK_ID}"] Preprocessing
|
["{#SERVICE_NAME}"]: Check ["{#SERVICE_CHECK_NAME}"]: Output | Current output of health check for the service. |
Dependent item | consul.service.check.output["{#SERVICE_ID}/{#SERVICE_CHECK_ID}"] Preprocessing
|
Trigger prototypes for Local node services discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Consul Node: Aggregated status is 'warning' | Aggregated state of service on the local agent is 'warning'. |
last(/HashiCorp Consul Node by HTTP/consul.service.aggregated_state["{#SERVICE_ID}"]) = 1 |
Warning | |
HashiCorp Consul Node: Aggregated status is 'critical' | Aggregated state of service on the local agent is 'critical'. |
last(/HashiCorp Consul Node by HTTP/consul.service.aggregated_state["{#SERVICE_ID}"]) = 2 |
Average |
LLD rule HTTP API methods discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
HTTP API methods discovery | Discovery HTTP API methods specific metrics. |
Dependent item | consul.http_api_discovery Preprocessing
|
Item prototypes for HTTP API methods discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
HTTP request: ["{#HTTP_METHOD}"], p90 | The 90 percentile of how long it takes to service the given HTTP request for the given verb. |
Dependent item | consul.http.api.p90["{#HTTP_METHOD}"] Preprocessing
|
HTTP request: ["{#HTTP_METHOD}"], p50 | The 50 percentile (median) of how long it takes to service the given HTTP request for the given verb. |
Dependent item | consul.http.api.p50["{#HTTP_METHOD}"] Preprocessing
|
HTTP request: ["{#HTTP_METHOD}"], rate | The number of HTTP request for the given verb per second. |
Dependent item | consul.http.api.rate["{#HTTP_METHOD}"] Preprocessing
|
LLD rule Raft server metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Raft server metrics discovery | Discover raft metrics for server nodes. |
Dependent item | consul.raft.server.discovery Preprocessing
|
Item prototypes for Raft server metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Raft state | Current state of Consul agent. |
Dependent item | consul.raft.state[{#SINGLETON}] Preprocessing
|
Raft state: leader | Increments when a server becomes a leader. |
Dependent item | consul.raft.state_leader[{#SINGLETON}] Preprocessing
|
Raft state: candidate | The number of initiated leader elections. |
Dependent item | consul.raft.state_candidate[{#SINGLETON}] Preprocessing
|
Raft: apply, rate | Incremented whenever a leader first passes a message into the Raft commit process (called an Apply operation). This metric describes the arrival rate of new logs into Raft per second. |
Dependent item | consul.raft.apply.rate[{#SINGLETON}] Preprocessing
|
LLD rule Raft leader metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Raft leader metrics discovery | Discover raft metrics for leader nodes. |
Dependent item | consul.raft.leader.discovery Preprocessing
|
Item prototypes for Raft leader metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Raft state: leader last contact, p90 | The 90 percentile of how long it takes a leader node to communicate with followers during a leader lease check, in milliseconds. |
Dependent item | consul.raft.leader_last_contact.p90[{#SINGLETON}] Preprocessing
|
Raft state: leader last contact, p50 | The 50 percentile (median) of how long it takes a leader node to communicate with followers during a leader lease check, in milliseconds. |
Dependent item | consul.raft.leader_last_contact.p50[{#SINGLETON}] Preprocessing
|
Raft state: commit time, p90 | The 90 percentile time it takes to commit a new entry to the raft log on the leader, in milliseconds. |
Dependent item | consul.raft.commit_time.p90[{#SINGLETON}] Preprocessing
|
Raft state: commit time, p50 | The 50 percentile (median) time it takes to commit a new entry to the raft log on the leader, in milliseconds. |
Dependent item | consul.raft.commit_time.p50[{#SINGLETON}] Preprocessing
|
Raft state: dispatch log, p90 | The 90 percentile time it takes for the leader to write log entries to disk, in milliseconds. |
Dependent item | consul.raft.dispatch_log.p90[{#SINGLETON}] Preprocessing
|
Raft state: dispatch log, p50 | The 50 percentile (median) time it takes for the leader to write log entries to disk, in milliseconds. |
Dependent item | consul.raft.dispatch_log.p50[{#SINGLETON}] Preprocessing
|
Raft state: dispatch log, rate | The number of times a Raft leader writes a log to disk per second. |
Dependent item | consul.raft.dispatch_log.rate[{#SINGLETON}] Preprocessing
|
Raft state: commit, rate | The number of commits a new entry to the Raft log on the leader per second. |
Dependent item | consul.raft.commit_time.rate[{#SINGLETON}] Preprocessing
|
Autopilot healthy | Tracks the overall health of the local server cluster. 1 if all servers are healthy, 0 if one or more are unhealthy. |
Dependent item | consul.autopilot.healthy[{#SINGLETON}] Preprocessing
|
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums