etcd

etcd is an open source distributed key-value store used to hold and manage the critical information that distributed systems need to keep running. Most notably, it manages the configuration data, state data, and metadata for Kubernetes, the popular container orchestration platform.

Available solutions




This template is for Zabbix version: 6.2
Also available for: 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/6.2

Etcd by HTTP

Overview

For Zabbix version: 6.2 and higher. This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

This template has been tested on:

  • Etcd, version 3.5.6

Setup

See Zabbix template operation for basic instructions.

Follow these instructions:

  1. Import the template into Zabbix.
  2. After importing the template, make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.
  4. Add the template to each etcd node. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag. (For more details, see etcd documentation).

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run : zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.PASSWORD}

-

``
{$ETCD.PORT}

The port of etcd API endpoint.

2379
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}

-

``

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
gRPC codes discovery

-

DEPENDENT etcd.grpc_code.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Filter:

AND

- {#GRPC.CODE} NOT_MATCHES_REGEX {$ETCD.GRPC_CODE.NOT_MATCHES}

- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.MATCHES}

Overrides:

trigger
- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.TRIGGER.MATCHES}
- TRIGGER_PROTOTYPE LIKE Too many failed gRPC requests
- DISCOVER

Peers discovery

-

DEPENDENT etcd.peer.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_network_peer_sent_bytes_total

Items collected

Group Name Description Type Key and additional info
Etcd Etcd: Service's TCP port state

-

SIMPLE net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Node health

-

HTTP_AGENT etcd.health

Preprocessing:

- JSONPATH: $.health

- BOOL_TO_DECIMAL

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

DEPENDENT etcd.is.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_is_leader

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

DEPENDENT etcd.has.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_has_leader

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Leader changes

The number of leader changes the member has seen since its start.

DEPENDENT etcd.leader.changes

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_leader_changes_seen_total

Etcd Etcd: Proposals committed per second

The number of consensus proposals committed.

DEPENDENT etcd.proposals.committed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_committed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals applied per second

The number of consensus proposals applied.

DEPENDENT etcd.proposals.applied.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_applied_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals failed per second

The number of failed proposals seen.

DEPENDENT etcd.proposals.failed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_failed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals pending

The current number of pending proposals to commit.

DEPENDENT etcd.proposals.pending

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_pending

Etcd Etcd: Reads per second

The number of read actions by get/getRecursive, local to this member.

DEPENDENT etcd.reads.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_reads_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

DEPENDENT etcd.writes.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_writes_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

DEPENDENT etcd.network.grpc.received.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_received_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

DEPENDENT etcd.network.grpc.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_sent_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: HTTP requests received

The number of requests received into the system (successfully parsed and authd).

DEPENDENT etcd.http.requests.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

DEPENDENT etcd.http.requests.5xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"5.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

DEPENDENT etcd.http.requests.4xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"4.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs received per second

The number of RPC stream messages received on the server.

DEPENDENT etcd.grpc.received.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

DEPENDENT etcd.grpc.sent.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_sent_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs started per second

The number of RPCs started on the server.

DEPENDENT etcd.grpc.started.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_started_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Server version

The version of the etcd server.

DEPENDENT etcd.server.version

Preprocessing:

- JSONPATH: $.etcdserver

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: Cluster version

The version of the etcd cluster.

DEPENDENT etcd.cluster.version

Preprocessing:

- JSONPATH: $.etcdcluster

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: DB size

The total size of the underlying database.

DEPENDENT etcd.db.size

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_db_total_size_in_bytes

Etcd Etcd: Keys compacted per second

The number of DB keys compacted per second.

DEPENDENT etcd.keys.compacted.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_compaction_keys_total

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Keys expired per second

The number of expired keys per second.

DEPENDENT etcd.keys.expired.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_store_expires_total

- CHANGE_PER_SECOND

Etcd Etcd: Keys total

The total number of keys.

DEPENDENT etcd.keys.total

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_keys_total

Etcd Etcd: Uptime

Etcd server uptime.

DEPENDENT etcd.uptime

Preprocessing:

- PROMETHEUS_PATTERN: process_start_time_seconds

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Etcd Etcd: Virtual memory

The size of virtual memory expressed in bytes.

DEPENDENT etcd.virtual.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_virtual_memory_bytes

Etcd Etcd: Resident memory

The size of resident memory expressed in bytes.

DEPENDENT etcd.res.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_resident_memory_bytes

Etcd Etcd: CPU

The total user and system CPU time spent in seconds.

DEPENDENT etcd.cpu.util

Preprocessing:

- PROMETHEUS_PATTERN: process_cpu_seconds_total

- CHANGE_PER_SECOND

Etcd Etcd: Open file descriptors

The number of open file descriptors.

DEPENDENT etcd.open.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_open_fds

Etcd Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

DEPENDENT etcd.max.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_max_fds

Etcd Etcd: Deletes per second

The number of deletes seen by this member per second.

DEPENDENT etcd.delete.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_delete_total

- CHANGE_PER_SECOND

Etcd Etcd: PUT per second

The number of puts seen by this member per second.

DEPENDENT etcd.put.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_put_total

- CHANGE_PER_SECOND

Etcd Etcd: Range per second

The number of ranges seen by this member per second.

DEPENDENT etcd.range.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Transaction per second

The number of transactions seen by this member per second.

DEPENDENT etcd.txn.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Pending events

The total number of pending events to be sent.

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_pending_events_total

Etcd Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

DEPENDENT etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_bytes_total{From="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Zabbix raw items Etcd: Get node metrics

-

HTTP_AGENT etcd.get_metrics
Zabbix raw items Etcd: Get version

-

HTTP_AGENT etcd.get_version

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable

-

last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0 AVERAGE

Manual close: YES

Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 AVERAGE

Depends on:

- Etcd: Service is unavailable

Etcd: Failed to fetch info data

Zabbix has not received data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 WARNING

Manual close: YES

Depends on:

- Etcd: Service is unavailable

Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 AVERAGE
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} WARNING
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} WARNING
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} WARNING
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} WARNING
Etcd: Server version has changed

The Etcd version has changed. Acknowledge to close manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 INFO

Manual close: YES

Etcd: Cluster version has changed

The Etcd version has changed. Acknowledge to close manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 INFO

Manual close: YES

Etcd: Host has been restarted

The host uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m INFO

Manual close: YES

Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.

If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} WARNING
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE}

-

min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com.

Articles and documentation

+ Propose new article

Didn't find what you are looking for?