etcd

etcd

etcd is an open source distributed key-value store used to hold and manage the critical information that distributed systems need to keep running. Most notably, it manages the configuration data, state data, and metadata for Kubernetes, the popular container orchestration platform.

Available solutions




Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http


Template App Etcd by HTTP

Overview

For Zabbix version: 5.0
The template to monitor Etcd by Zabbix that work without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template App Etcd — collects metrics by HTTP agent from /metrics endpoint. See https://etcd.io/docs/v3.4.0/op-guide/monitoring/#metrics-endpoint.

This template was tested on:

  • Etcd, version 3.0+

Setup

  1. Import template into Zabbix
  2. After importing template make sure that etcd allows for metric collection. Test by running: curl -L http://localhost:2379/metrics
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify run curl -L http://<etcd_node_adress>:2379/metrics
  4. Add the template to each node with etcd. By default template use client port. You can configure metrics endpoint location by --listen-metrics-urls flag (See etcd docs).

    If you have specified a non-standard port for etcd, don't forget change macros {$ETCD.SCHEME}, {$ETCD.PORT}.

    If you need it, you can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template for using on the host level.

    Test availability: zabbix_get -s etcd-host -k etcd.health

Besides, see the macros section as it will set the trigger values.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$ETCD.GRPC.ERRORS.MAX.WARN}

Maximum number of gRPC requests failures

1
{$ETCD.GRPC_CODE.MATCHES}

Filter of discoverable gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

Filter to exclude discovered gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

CHANGE_IF_NEEDED
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

Filter of discoverable gRPC codes which will be create triggers

Aborted|Unavailable
{$ETCD.HTTP.FAIL.MAX.WARN}

Maximum number of HTTP requests failures

2
{$ETCD.LEADER.CHANGES.MAX.WARN}

Maximum number of leader changes

5
{$ETCD.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors

90
{$ETCD.PASSWORD}

-

``
{$ETCD.PORT}

The port of Etcd API endpoint

2379
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

Maximum number of proposal failures

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

Maximum number of proposals in queue

5
{$ETCD.SCHEME}

Request scheme which may be http or https

http
{$ETCD.USER}

-

``

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
gRPC codes discovery DEPENDENT etcd.grpc_code.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total

- JAVASCRIPT: var data = JSON.parse(value), lookup = {}, result =[]; for (var item, i = 0; item = data[i++];) { var code = item.labels.grpc_code; if (!(code in lookup)) { lookup[code] = 1; result.push({ "{#GRPC.CODE}": code}); } } return JSON.stringify(result);

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Filter:

AND

- A: {#GRPC.CODE} NOT_MATCHES_REGEX {$ETCD.GRPC_CODE.NOT_MATCHES}

- B: {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.MATCHES}

Peers discovery DEPENDENT etcd.peer.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_network_peer_sent_bytes_total

Items collected

Group Name Description Type Key and additional info
Etcd Etcd: Service's TCP port state

-

SIMPLE net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Node health

-

HTTP_AGENT etcd.health

Preprocessing:

- JSONPATH: $.health

- BOOL_TO_DECIMAL

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server is a leader

Whether or not this member is a leader. 1 if is, 0 otherwise.

DEPENDENT etcd.is.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_is_leader

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server has a leader

Whether or not a leader exists. 1 is existence, 0 is not.

DEPENDENT etcd.has.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_has_leader

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Leader changes

The the number of leader changes the member has seen since its start.

DEPENDENT etcd.leader.changes

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_leader_changes_seen_total

Etcd Etcd: Proposals committed per second

The number of consensus proposals committed.

DEPENDENT etcd.proposals.committed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_committed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals applied per second

The number of consensus proposals applied.

DEPENDENT etcd.proposals.applied.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_applied_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals failed per second

The number of failed proposals seen.

DEPENDENT etcd.proposals.failed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_failed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals pending

The current number of pending proposals to commit.

DEPENDENT etcd.proposals.pending

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_pending

Etcd Etcd: Reads per second

Number of reads action by (get/getRecursive), local to this member.

DEPENDENT etcd.reads.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_reads_total

- JAVASCRIPT: //calculates total reads var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: Writes per second

Number of writes (e.g. set/compareAndDelete) seen by this member.

DEPENDENT etcd.writes.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_writes_total

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC received bytes per second

The number of bytes received from grpc clients per second

DEPENDENT etcd.network.grpc.received.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_received_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC sent bytes per second

The number of bytes sent from grpc clients per second

DEPENDENT etcd.network.grpc.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_sent_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: HTTP requests received

Number of requests received into the system (successfully parsed and authd).

DEPENDENT etcd.http.requests.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_received_total

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 5XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 5XX.

DEPENDENT etcd.http.requests.5xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"5.+"}

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 4XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 4XX.

DEPENDENT etcd.http.requests.4xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"4.+"}

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: RPCs received per second

The number of RPC stream messages received on the server.

DEPENDENT etcd.grpc.received.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_received_total

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

DEPENDENT etcd.grpc.sent.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_sent_total

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: RPCs started per second

The number of RPCs started on the server.

DEPENDENT etcd.grpc.started.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_started_total

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: Server version

Version of the Etcd server.

DEPENDENT etcd.server.version

Preprocessing:

- JSONPATH: $.etcdserver

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: Cluster version

Version of the Etcd cluster.

DEPENDENT etcd.cluster.version

Preprocessing:

- JSONPATH: $.etcdcluster

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: DB size

Total size of the underlying database.

DEPENDENT etcd.db.size

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_total_size_in_bytes

Etcd Etcd: Keys compacted per second

The number of DB keys compacted per second.

DEPENDENT etcd.keys.compacted.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_compaction_keys_total

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Keys expired per second

The number of expired keys per second.

DEPENDENT etcd.keys.expired.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_store_expires_total

- CHANGE_PER_SECOND

Etcd Etcd: Keys total

Total number of keys.

DEPENDENT etcd.keys.total

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_keys_total

Etcd Etcd: Uptime

Etcd server uptime.

DEPENDENT etcd.uptime

Preprocessing:

- PROMETHEUS_PATTERN: process_start_time_seconds

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Etcd Etcd: Virtual memory

Virtual memory size in bytes.

DEPENDENT etcd.virtual.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_virtual_memory_bytes

Etcd Etcd: Resident memory

Resident memory size in bytes.

DEPENDENT etcd.res.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_resident_memory_bytes

Etcd Etcd: CPU

Total user and system CPU time spent in seconds.

DEPENDENT etcd.cpu.util

Preprocessing:

- PROMETHEUS_PATTERN: process_cpu_seconds_total

- CHANGE_PER_SECOND

Etcd Etcd: Open file descriptors

Number of open file descriptors.

DEPENDENT etcd.open.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_open_fds

Etcd Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

DEPENDENT etcd.max.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_max_fds

Etcd Etcd: Deletes per second

The number of deletes seen by this member per second.

DEPENDENT etcd.delete.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_delete_total

- CHANGE_PER_SECOND

Etcd Etcd: PUT per second

The number of puts seen by this member per second.

DEPENDENT etcd.put.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_put_total

- CHANGE_PER_SECOND

Etcd Etcd: Range per second

The number of ranges seen by this member per second.

DEPENDENT etcd.range.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Transaction per second

The number of transactions seen by this member per second.

DEPENDENT etcd.txn.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Events sent per second

The number of events sent by this member per second

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_events_total

- CHANGE_PER_SECOND

Etcd Etcd: Pending events

Total number of pending events to be sent.

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_pending_events_total

Etcd Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}

DEPENDENT etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

- JAVASCRIPT: var valueArr = JSON.parse(value); return valueArr.reduce(function(acc,obj){ return acc + parseFloat(obj['value']) },0);

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to peer with ID {#ETCD.PEER}

DEPENDENT etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from peer with ID {#ETCD.PEER}

DEPENDENT etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_bytes_total{From="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of send failures from peer with ID {#ETCD.PEER}

DEPENDENT etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Receive failures failures

The number of receive failures from the peer with ID {#ETCD.PEER}

DEPENDENT etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Zabbix_raw_items Etcd: Get node metrics

-

HTTP_AGENT etcd.get_metrics
Zabbix_raw_items Etcd: Get version

-

HTTP_AGENT etcd.get_version

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable

-

{TEMPLATE_NAME:net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"].last()}=0 AVERAGE

Manual close: YES

Etcd: Node healthcheck failed

https://etcd.io/docs/v3.4.0/op-guide/monitoring/#health-check

{TEMPLATE_NAME:etcd.health.last()}=0 AVERAGE

Depends on:

- Etcd: Service is unavailable

Etcd: Failed to fetch info data (or no data for 30m)

Zabbix has not received data for items for the last 30 minutes

{TEMPLATE_NAME:etcd.is.leader.nodata(30m)}=1 WARNING

Manual close: YES

Depends on:

- Etcd: Service is unavailable

Etcd: Member has no leader

"If a member does not have a leader, it is totally unavailable."

{TEMPLATE_NAME:etcd.has.leader.last()}=0 AVERAGE
Etcd: Instance has seen too many leader changes (over {$ETCD.LEADER.CHANGES.MAX.WARN} for 15m)'

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

{TEMPLATE_NAME:etcd.leader.changes.delta(15m)}>{$ETCD.LEADER.CHANGES.MAX.WARN} WARNING
Etcd: Too many proposal failures (over {$ETCD.PROPOSAL.FAIL.MAX.WARN} for 5m)'

"Normally related to two issues: temporary failures related to a leader election or

longer downtime caused by a loss of quorum in the cluster."

{TEMPLATE_NAME:etcd.proposals.failed.rate.min(5m)}>{$ETCD.PROPOSAL.FAIL.MAX.WARN} WARNING
Etcd: Too many proposals are queued to commit (over {$ETCD.PROPOSAL.PENDING.MAX.WARN} for 5m)'

"Rising pending proposals suggests there is a high client load or the member cannot commit proposals."

{TEMPLATE_NAME:etcd.proposals.pending.min(5m)}>{$ETCD.PROPOSAL.PENDING.MAX.WARN} WARNING
Etcd: Too many HTTP requests failures (over {$ETCD.HTTP.FAIL.MAX.WARN} for 5m)'

"Too many reqvests failed on etcd instance with 5xx HTTP code"

{TEMPLATE_NAME:etcd.http.requests.5xx.rate.min(5m)}>{$ETCD.HTTP.FAIL.MAX.WARN} WARNING
Etcd: Server version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

{TEMPLATE_NAME:etcd.server.version.diff()}=1 and {TEMPLATE_NAME:etcd.server.version.strlen()}>0 INFO

Manual close: YES

Etcd: Cluster version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

{TEMPLATE_NAME:etcd.cluster.version.diff()}=1 and {TEMPLATE_NAME:etcd.cluster.version.strlen()}>0 INFO

Manual close: YES

Etcd: has been restarted (uptime < 10m)

Uptime is less than 10 minutes

{TEMPLATE_NAME:etcd.uptime.last()}<10m INFO

Manual close: YES

Etcd: Current number of open files is too high (over {$ETCD.OPEN.FDS.MAX.WARN}% for 5m)

"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue.

If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files."

{TEMPLATE_NAME:etcd.open.fds.min(5m)}/{Template App Etcd by HTTP:etcd.max.fds.last()}*100>{$ETCD.OPEN.FDS.MAX.WARN} WARNING
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} (over {$ETCD.GRPC.ERRORS.MAX.WARN} in 5m)

-

{TEMPLATE_NAME:etcd.grpc.handled.rate[{#GRPC.CODE}].min(5m)}>{$ETCD.GRPC.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com

Articles and documentation

+ Propose new article
Add your solution