Envoy Proxy

Envoy is an open source edge and service proxy, designed for cloud-native applications.

Available solutions




This template is for Zabbix version: 7.0
Also available for: 6.4 6.2 6.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/envoy_proxy_http?at=release/7.0

Envoy Proxy by HTTP

Overview

The template to monitor Envoy Proxy by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Envoy Proxy by HTTP - collects metrics by HTTP agent from metrics endpoint {$ENVOY.METRICS.PATH} endpoint (default: /stats/prometheus).

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Envoy Proxy 1.20.2

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Internal service metrics are collected from {$ENVOY.METRICS.PATH} endpoint (default: /stats/prometheus). https://www.envoyproxy.io/docs/envoy/v1.20.0/operations/stats_overview

Don't forget to change macros {$ENVOY.URL}, {$ENVOY.METRICS.PATH}. Also, see the Macros section for a list of macros used to set trigger values.

NOTE. Some metrics may not be collected depending on your Envoy Proxy instance version and configuration.

Macros used

Name Description Default
{$ENVOY.URL}

Instance URL.

http://localhost:9901
{$ENVOY.METRICS.PATH}

The path Zabbix will scrape metrics in prometheus format from.

/stats/prometheus
{$ENVOY.CERT.MIN}

Minimum number of days before certificate expiration used for trigger expression.

7

Items

Name Description Type Key and additional info
Get node metrics

Get server metrics.

HTTP agent envoy.get_metrics

Preprocessing

  • Check for not supported value: any error

    ⛔️Custom on fail: Discard value

Server state

State of the server.

Live - (default) Server is live and serving traffic.

Draining - Server is draining listeners in response to external health checks failing.

Pre initializing - Server has not yet completed cluster manager initialization.

Initializing - Server is running the cluster manager initialization callbacks (e.g., RDS).

Dependent item envoy.server.state

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_state)

  • Discard unchanged with heartbeat: 3h

Server live

1 if the server is not currently draining, 0 otherwise.

Dependent item envoy.server.live

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_live)

  • Discard unchanged with heartbeat: 3h

Uptime

Current server uptime in seconds.

Dependent item envoy.server.uptime

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_uptime)

    ⛔️Custom on fail: Discard value

Certificate expiration, day before

Number of days until the next certificate being managed will expire.

Dependent item envoy.server.days_until_first_cert_expiring

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_days_until_first_cert_expiring)

Server concurrency

Number of worker threads.

Dependent item envoy.server.concurrency

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_concurrency)

Memory allocated

Current amount of allocated memory in bytes. Total of both new and old Envoy processes on hot restart.

Dependent item envoy.server.memory_allocated

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_memory_allocated)

Memory heap size

Current reserved heap size in bytes. New Envoy process heap size on hot restart.

Dependent item envoy.server.memory_heap_size

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_memory_heap_size)

Memory physical size

Current estimate of total bytes of the physical memory. New Envoy process physical memory size on hot restart.

Dependent item envoy.server.memory_physical_size

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_memory_physical_size)

Filesystem, flushed by timer rate

Total number of times internal flush buffers are written to a file due to flush timeout per second.

Dependent item envoy.filesystem.flushed_by_timer.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_filesystem_flushed_by_timer)

  • Change per second
Filesystem, write completed rate

Total number of times a file was written per second.

Dependent item envoy.filesystem.write_completed.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_filesystem_write_completed)

  • Change per second
Filesystem, write failed rate

Total number of times an error occurred during a file write operation per second.

Dependent item envoy.filesystem.write_failed.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_filesystem_write_failed)

  • Change per second
Filesystem, reopen failed rate

Total number of times a file was failed to be opened per second.

Dependent item envoy.filesystem.reopen_failed.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_filesystem_reopen_failed)

  • Change per second
Connections, total

Total connections of both new and old Envoy processes.

Dependent item envoy.server.total_connections

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_total_connections)

Connections, parent

Total connections of the old Envoy process on hot restart.

Dependent item envoy.server.parent_connections

Preprocessing

  • Prometheus pattern: VALUE(envoy_server_parent_connections)

Clusters, warming

Number of currently warming (not active) clusters.

Dependent item envoy.cluster_manager.warming_clusters

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_warming_clusters)

Clusters, active

Number of currently active (warmed) clusters.

Dependent item envoy.cluster_manager.active_clusters

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_active_clusters)

Clusters, added rate

Total clusters added (either via static config or CDS) per second.

Dependent item envoy.cluster_manager.cluster_added.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_cluster_added)

  • Change per second
Clusters, modified rate

Total clusters modified (via CDS) per second.

Dependent item envoy.cluster_manager.cluster_modified.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_cluster_modified)

  • Change per second
Clusters, removed rate

Total clusters removed (via CDS) per second.

Dependent item envoy.cluster_manager.cluster_removed.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_cluster_removed)

  • Change per second
Clusters, updates rate

Total cluster updates per second.

Dependent item envoy.cluster_manager.cluster_updated.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_cluster_manager_cluster_updated)

  • Change per second
Listeners, active

Number of currently active listeners.

Dependent item envoy.listener_manager.total_listeners_active

Preprocessing

  • Prometheus pattern: SUM(envoy_listener_manager_total_listeners_active)

Listeners, draining

Number of currently draining listeners.

Dependent item envoy.listener_manager.total_listeners_draining

Preprocessing

  • Prometheus pattern: SUM(envoy_listener_manager_total_listeners_draining)

Listener, warming

Number of currently warming listeners.

Dependent item envoy.listener_manager.total_listeners_warming

Preprocessing

  • Prometheus pattern: SUM(envoy_listener_manager_total_listeners_warming)

Listener manager, initialized

A boolean (1 if started and 0 otherwise) that indicates whether listeners have been initialized on workers.

Dependent item envoy.listener_manager.workers_started

Preprocessing

  • Prometheus pattern: VALUE(envoy_listener_manager_workers_started)

  • Discard unchanged with heartbeat: 3h

Listeners, create failure

Total failed listener object additions to workers per second.

Dependent item envoy.listener_manager.listener_create_failure.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_listener_manager_listener_create_failure)

  • Change per second
Listeners, create success

Total listener objects successfully added to workers per second.

Dependent item envoy.listener_manager.listener_create_success.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_listener_manager_listener_create_success)

  • Change per second
Listeners, added

Total listeners added (either via static config or LDS) per second.

Dependent item envoy.listener_manager.listener_added.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_listener_manager_listener_added)

  • Change per second
Listeners, stopped

Total listeners stopped per second.

Dependent item envoy.listener_manager.listener_stopped.rate

Preprocessing

  • Prometheus pattern: VALUE(envoy_listener_manager_listener_stopped)

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
Server state is not live last(/Envoy Proxy by HTTP/envoy.server.state) > 0 Average
Service has been restarted

Uptime is less than 10 minutes.

last(/Envoy Proxy by HTTP/envoy.server.uptime)<10m Info Manual close: Yes
Failed to fetch metrics data

Zabbix has not received data for items for the last 10 minutes.

nodata(/Envoy Proxy by HTTP/envoy.server.uptime,10m)=1 Warning Manual close: Yes
SSL certificate expires soon

Please check certificate. Less than {$ENVOY.CERT.MIN} days left until the next certificate being managed will expire.

last(/Envoy Proxy by HTTP/envoy.server.days_until_first_cert_expiring)<{$ENVOY.CERT.MIN} Warning

LLD rule Cluster metrics discovery

Name Description Type Key and additional info
Cluster metrics discovery Dependent item envoy.lld.cluster

Preprocessing

  • Prometheus to JSON: envoy_cluster_membership_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Cluster metrics discovery

Name Description Type Key and additional info
Cluster ["{#CLUSTER_NAME}"]: Membership, total

Current cluster membership total.

Dependent item envoy.cluster.membership_total["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Membership, healthy

Current cluster healthy total (inclusive of both health checking and outlier detection).

Dependent item envoy.cluster.membership_healthy["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Membership, unhealthy

Current cluster unhealthy.

Calculated envoy.cluster.membership_unhealthy["{#CLUSTER_NAME}"]
Cluster ["{#CLUSTER_NAME}"]: Membership, degraded

Current cluster degraded total.

Dependent item envoy.cluster.membership_degraded["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Connections, total

Current cluster total connections.

Dependent item envoy.cluster.upstream_cx_total["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Connections, active

Current cluster total active connections.

Dependent item envoy.cluster.upstream_cx_active["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Requests total, rate

Current cluster request total per second.

Dependent item envoy.cluster.upstream_rq_total.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests timeout, rate

Current cluster requests that timed out waiting for a response per second.

Dependent item envoy.cluster.upstream_rq_timeout.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests completed, rate

Total upstream requests completed per second.

Dependent item envoy.cluster.upstream_rq_completed.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests 2xx, rate

Aggregate HTTP response codes per second.

Dependent item envoy.cluster.upstream_rq_2x.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests 3xx, rate

Aggregate HTTP response codes per second.

Dependent item envoy.cluster.upstream_rq_3x.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests 4xx, rate

Aggregate HTTP response codes per second.

Dependent item envoy.cluster.upstream_rq_4x.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests 5xx, rate

Aggregate HTTP response codes per second.

Dependent item envoy.cluster.upstream_rq_5x.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Requests pending

Total active requests pending a connection pool connection.

Dependent item envoy.cluster.upstream_rq_pending_active["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Requests active

Total active requests.

Dependent item envoy.cluster.upstream_rq_active["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Cluster ["{#CLUSTER_NAME}"]: Upstream bytes out, rate

Total sent connection bytes per second.

Dependent item envoy.cluster.upstream_cx_tx_bytes_total.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Cluster ["{#CLUSTER_NAME}"]: Upstream bytes in, rate

Total received connection bytes per second.

Dependent item envoy.cluster.upstream_cx_rx_bytes_total.rate["{#CLUSTER_NAME}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for Cluster metrics discovery

Name Description Expression Severity Dependencies and additional info
There are unhealthy clusters last(/Envoy Proxy by HTTP/envoy.cluster.membership_unhealthy["{#CLUSTER_NAME}"]) > 0 Average

LLD rule Listeners metrics discovery

Name Description Type Key and additional info
Listeners metrics discovery Dependent item envoy.lld.listeners

Preprocessing

  • Prometheus to JSON: envoy_listener_downstream_cx_active

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for Listeners metrics discovery

Name Description Type Key and additional info
Listener ["{#LISTENER_ADDRESS}"]: Connections, active

Total active connections.

Dependent item envoy.listener.downstream_cx_active["{#LISTENER_ADDRESS}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

Listener ["{#LISTENER_ADDRESS}"]: Connections, rate

Total connections per second.

Dependent item envoy.listener.downstream_cx_total.rate["{#LISTENER_ADDRESS}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
Listener ["{#LISTENER_ADDRESS}"]: Sockets, undergoing

Sockets currently undergoing listener filter processing.

Dependent item envoy.listener.downstream_pre_cx_active["{#LISTENER_ADDRESS}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

LLD rule HTTP metrics discovery

Name Description Type Key and additional info
HTTP metrics discovery Dependent item envoy.lld.http

Preprocessing

  • Prometheus to JSON: envoy_http_downstream_rq_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 3h

Item prototypes for HTTP metrics discovery

Name Description Type Key and additional info
HTTP ["{#CONN_MANAGER}"]: Requests, rate

Total active connections per second.

Dependent item envoy.http.downstream_rq_total.rate["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
HTTP ["{#CONN_MANAGER}"]: Requests, active

Total active requests.

Dependent item envoy.http.downstream_rq_active["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HTTP ["{#CONN_MANAGER}"]: Requests timeout, rate

Total requests closed due to a timeout on the request path per second.

Dependent item envoy.http.downstream_rq_timeout["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
HTTP ["{#CONN_MANAGER}"]: Connections, rate

Total connections per second.

Dependent item envoy.http.downstream_cx_total["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
HTTP ["{#CONN_MANAGER}"]: Connections, active

Total active connections.

Dependent item envoy.http.downstream_cx_active["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

HTTP ["{#CONN_MANAGER}"]: Bytes in, rate

Total bytes received per second.

Dependent item envoy.http.downstream_cx_rx_bytes_total.rate["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second
HTTP ["{#CONN_MANAGER}"]: Bytes out, rate

Total bytes sent per second.

Dependent item envoy.http.downstream_cx_tx_bytes_tota.rate["{#CONN_MANAGER}"]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Articles and documentation

+ Propose new article

Didn't find what you are looking for?