Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/db/cockroachdb_http?at=release/7.0
CockroachDB by HTTP
Overview
The template to monitor CockroachDB nodes by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
The template collects metrics by HTTP agent from Prometheus endpoint and health endpoints.
Internal node metrics are collected from Prometheus /_status/vars endpoint. Node health metrics are collected from /health and /health?ready=1 endpoints. The template doesn't require usage of session token.
Note, that some metrics may not be collected depending on your CockroachDB version and configuration.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- CockroachDB 21.2.8
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
Set the hostname or IP address of the CockroachDB node host in the {$COCKROACHDB.API.HOST}
macro. You can also change the port in the {$COCKROACHDB.API.PORT}
macro and the scheme in the {$COCKROACHDB.API.SCHEME}
macro if necessary.
Also, see the Macros section for a list of macros used to set trigger values.
Macros used
Name | Description | Default |
---|---|---|
{$COCKROACHDB.API.HOST} | The hostname or IP address of the CockroachDB host. |
<SET COCKROACHDB HOST> |
{$COCKROACHDB.API.PORT} | The port of CockroachDB API and Prometheus endpoint. |
8080 |
{$COCKROACHDB.API.SCHEME} | Request scheme which may be http or https. |
http |
{$COCKROACHDB.STORE.USED.MIN.WARN} | The warning threshold of the available disk space in percent. |
20 |
{$COCKROACHDB.STORE.USED.MIN.CRIT} | The critical threshold of the available disk space in percent. |
10 |
{$COCKROACHDB.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors. |
80 |
{$COCKROACHDB.CERT.NODE.EXPIRY.WARN} | Number of days until the node certificate expires. |
30 |
{$COCKROACHDB.CERT.CA.EXPIRY.WARN} | Number of days until the CA certificate expires. |
90 |
{$COCKROACHDB.CLOCK.OFFSET.MAX.WARN} | Maximum clock offset of the node against the rest of the cluster in milliseconds for trigger expression. |
300 |
{$COCKROACHDB.STATEMENTS.ERRORS.MAX.WARN} | Maximum number of SQL statements errors for trigger expression. |
2 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Get metrics | Get raw metrics from the Prometheus endpoint. |
HTTP agent | cockroachdb.get_metrics Preprocessing
|
Get health | Get node /health endpoint |
HTTP agent | cockroachdb.get_health Preprocessing
|
Get readiness | Get node /health?ready=1 endpoint |
HTTP agent | cockroachdb.get_readiness Preprocessing
|
Service ping | Check if HTTP/HTTPS service accepts TCP connections. |
Simple check | net.tcp.service["{$COCKROACHDB.API.SCHEME}","{$COCKROACHDB.API.HOST}","{$COCKROACHDB.API.PORT}"] Preprocessing
|
Clock offset | Mean clock offset of the node against the rest of the cluster. |
Dependent item | cockroachdb.clock.offset Preprocessing
|
Version | Build information. |
Dependent item | cockroachdb.version Preprocessing
|
CPU: System time | System CPU time. |
Dependent item | cockroachdb.cpu.system_time Preprocessing
|
CPU: User time | User CPU time. |
Dependent item | cockroachdb.cpu.user_time Preprocessing
|
CPU: Utilization | The CPU utilization expressed in %. |
Dependent item | cockroachdb.cpu.util Preprocessing
|
Disk: IOPS in progress, rate | Number of disk IO operations currently in progress on this host. |
Dependent item | cockroachdb.disk.iops.in_progress.rate Preprocessing
|
Disk: Reads, rate | Bytes read from all disks per second since this process started |
Dependent item | cockroachdb.disk.read.rate Preprocessing
|
Disk: Read IOPS, rate | Number of disk read operations per second across all disks since this process started. |
Dependent item | cockroachdb.disk.iops.read.rate Preprocessing
|
Disk: Writes, rate | Bytes written to all disks per second since this process started. |
Dependent item | cockroachdb.disk.write.rate Preprocessing
|
Disk: Write IOPS, rate | Disk write operations per second across all disks since this process started. |
Dependent item | cockroachdb.disk.iops.write.rate Preprocessing
|
File descriptors: Limit | Open file descriptors soft limit of the process. |
Dependent item | cockroachdb.descriptors.limit Preprocessing
|
File descriptors: Open | The number of open file descriptors. |
Dependent item | cockroachdb.descriptors.open Preprocessing
|
GC: Pause time | The amount of processor time used by Go's garbage collector across all nodes. During garbage collection, application code execution is paused. |
Dependent item | cockroachdb.gc.pause_time Preprocessing
|
GC: Runs, rate | The number of times that Go's garbage collector was invoked per second across all nodes. |
Dependent item | cockroachdb.gc.runs.rate Preprocessing
|
Go: Goroutines count | Current number of Goroutines. This count should rise and fall based on load. |
Dependent item | cockroachdb.go.goroutines.count Preprocessing
|
KV transactions: Aborted, rate | Number of aborted KV transactions per second. |
Dependent item | cockroachdb.kv.transactions.aborted.rate Preprocessing
|
KV transactions: Committed, rate | Number of KV transactions (including 1PC) committed per second. |
Dependent item | cockroachdb.kv.transactions.committed.rate Preprocessing
|
Live nodes count | The number of live nodes in the cluster (will be 0 if this node is not itself live). |
Dependent item | cockroachdb.live_count Preprocessing
|
Liveness heartbeats, rate | Number of successful node liveness heartbeats per second from this node. |
Dependent item | cockroachdb.heartbeaths.success.rate Preprocessing
|
Memory: Allocated by Cgo | Current bytes of memory allocated by the C layer. |
Dependent item | cockroachdb.memory.cgo.allocated Preprocessing
|
Memory: Allocated by Go | Current bytes of memory allocated by the Go layer. |
Dependent item | cockroachdb.memory.go.allocated Preprocessing
|
Memory: Managed by Cgo | Total bytes of memory managed by the C layer. |
Dependent item | cockroachdb.memory.cgo.managed Preprocessing
|
Memory: Managed by Go | Total bytes of memory managed by the Go layer. |
Dependent item | cockroachdb.memory.go.managed Preprocessing
|
Memory: Total usage | Resident set size (RSS) of memory in use by the node. |
Dependent item | cockroachdb.memory.total Preprocessing
|
Network: Bytes received, rate | Bytes received per second on all network interfaces since this process started. |
Dependent item | cockroachdb.network.bytes.received.rate Preprocessing
|
Network: Bytes sent, rate | Bytes sent per second on all network interfaces since this process started. |
Dependent item | cockroachdb.network.bytes.sent.rate Preprocessing
|
Time series: Sample errors, rate | The number of errors encountered while attempting to write metrics to disk, per second. |
Dependent item | cockroachdb.ts.samples.errors.rate Preprocessing
|
Time series: Samples written, rate | The number of successfully written metric samples per second. |
Dependent item | cockroachdb.ts.samples.written.rate Preprocessing
|
Slow requests: DistSender RPCs | Number of RPCs stuck or retrying for a long time. |
Dependent item | cockroachdb.slow_requests.rpc Preprocessing
|
SQL: Bytes received, rate | Total amount of incoming SQL client network traffic in bytes per second. |
Dependent item | cockroachdb.sql.bytes.received.rate Preprocessing
|
SQL: Bytes sent, rate | Total amount of outgoing SQL client network traffic in bytes per second. |
Dependent item | cockroachdb.sql.bytes.sent.rate Preprocessing
|
Memory: Allocated by SQL | Current SQL statement memory usage for root. |
Dependent item | cockroachdb.memory.sql Preprocessing
|
SQL: Schema changes, rate | Total number of SQL DDL statements successfully executed per second. |
Dependent item | cockroachdb.sql.schema_changes.rate Preprocessing
|
SQL sessions: Open | Total number of open SQL sessions. |
Dependent item | cockroachdb.sql.sessions Preprocessing
|
SQL statements: Active | Total number of SQL statements currently active. |
Dependent item | cockroachdb.sql.statements.active Preprocessing
|
SQL statements: DELETE, rate | A moving average of the number of DELETE statements successfully executed per second. |
Dependent item | cockroachdb.sql.statements.delete.rate Preprocessing
|
SQL statements: Executed, rate | Number of SQL queries executed per second. |
Dependent item | cockroachdb.sql.statements.executed.rate Preprocessing
|
SQL statements: Denials, rate | The number of statements denied per second by a feature flag. |
Dependent item | cockroachdb.sql.statements.denials.rate Preprocessing
|
SQL statements: Active flows distributed, rate | The number of distributed SQL flows currently active per second. |
Dependent item | cockroachdb.sql.statements.flows.active.rate Preprocessing
|
SQL statements: INSERT, rate | A moving average of the number of INSERT statements successfully executed per second. |
Dependent item | cockroachdb.sql.statements.insert.rate Preprocessing
|
SQL statements: SELECT, rate | A moving average of the number of SELECT statements successfully executed per second. |
Dependent item | cockroachdb.sql.statements.select.rate Preprocessing
|
SQL statements: UPDATE, rate | A moving average of the number of UPDATE statements successfully executed per second. |
Dependent item | cockroachdb.sql.statements.update.rate Preprocessing
|
SQL statements: Contention, rate | Total number of SQL statements that experienced contention per second. |
Dependent item | cockroachdb.sql.statements.contention.rate Preprocessing
|
SQL statements: Errors, rate | Total number of statements which returned a planning or runtime error per second. |
Dependent item | cockroachdb.sql.statements.errors.rate Preprocessing
|
SQL transactions: Open | Total number of currently open SQL transactions. |
Dependent item | cockroachdb.sql.transactions.open Preprocessing
|
SQL transactions: Aborted, rate | Total number of SQL transaction abort errors per second. |
Dependent item | cockroachdb.sql.transactions.aborted.rate Preprocessing
|
SQL transactions: Committed, rate | Total number of SQL transaction COMMIT statements successfully executed per second. |
Dependent item | cockroachdb.sql.transactions.committed.rate Preprocessing
|
SQL transactions: Initiated, rate | Total number of SQL transaction BEGIN statements successfully executed per second. |
Dependent item | cockroachdb.sql.transactions.initiated.rate Preprocessing
|
SQL transactions: Rolled back, rate | Total number of SQL transaction ROLLBACK statements successfully executed per second. |
Dependent item | cockroachdb.sql.transactions.rollbacks.rate Preprocessing
|
Uptime | Process uptime. |
Dependent item | cockroachdb.uptime Preprocessing
|
Node certificate expiration date | Node certificate expires at that date. |
Dependent item | cockroachdb.cert.expire_date.node Preprocessing
|
CA certificate expiration date | CA certificate expires at that date. |
Dependent item | cockroachdb.cert.expire_date.ca Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Node is unhealthy | Node's /health endpoint has returned HTTP 500 Internal Server Error which indicates unhealthy mode. |
last(/CockroachDB by HTTP/cockroachdb.get_health) = 500 |
Average | Depends on:
|
Node is not ready | Node's /health?ready=1 endpoint has returned HTTP 503 Service Unavailable. Possible reasons: |
last(/CockroachDB by HTTP/cockroachdb.get_readiness) = 503 and last(/CockroachDB by HTTP/cockroachdb.uptime) > 5m |
Average | Depends on:
|
Service is down | last(/CockroachDB by HTTP/net.tcp.service["{$COCKROACHDB.API.SCHEME}","{$COCKROACHDB.API.HOST}","{$COCKROACHDB.API.PORT}"]) = 0 |
Average | ||
Clock offset is too high | Cockroach-measured clock offset is nearing limit (by default, servers kill themselves at 400ms from the mean). |
min(/CockroachDB by HTTP/cockroachdb.clock.offset,5m) > {$COCKROACHDB.CLOCK.OFFSET.MAX.WARN} * 0.001 |
Warning | |
Version has changed | last(/CockroachDB by HTTP/cockroachdb.version) <> last(/CockroachDB by HTTP/cockroachdb.version,#2) and length(last(/CockroachDB by HTTP/cockroachdb.version)) > 0 |
Info | ||
Current number of open files is too high | Getting close to open file descriptor limit. |
min(/CockroachDB by HTTP/cockroachdb.descriptors.open,10m) / last(/CockroachDB by HTTP/cockroachdb.descriptors.limit) * 100 > {$COCKROACHDB.OPEN.FDS.MAX.WARN} |
Warning | |
Node is not executing SQL | Node is not executing SQL despite having connections. |
last(/CockroachDB by HTTP/cockroachdb.sql.sessions) > 0 and last(/CockroachDB by HTTP/cockroachdb.sql.statements.executed.rate) = 0 |
Warning | |
SQL statements errors rate is too high | min(/CockroachDB by HTTP/cockroachdb.sql.statements.errors.rate,5m) > {$COCKROACHDB.STATEMENTS.ERRORS.MAX.WARN} |
Warning | ||
Node has been restarted | Uptime is less than 10 minutes. |
last(/CockroachDB by HTTP/cockroachdb.uptime) < 10m |
Info | |
Failed to fetch node data | Zabbix has not received data for items for the last 5 minutes. |
nodata(/CockroachDB by HTTP/cockroachdb.uptime,5m) = 1 |
Warning | Depends on:
|
Node certificate expires soon | Node certificate expires soon. |
(last(/CockroachDB by HTTP/cockroachdb.cert.expire_date.node) - now()) / 86400 < {$COCKROACHDB.CERT.NODE.EXPIRY.WARN} |
Warning | |
CA certificate expires soon | CA certificate expires soon. |
(last(/CockroachDB by HTTP/cockroachdb.cert.expire_date.ca) - now()) / 86400 < {$COCKROACHDB.CERT.CA.EXPIRY.WARN} |
Warning |
LLD rule Storage metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Storage metrics discovery | Discover per store metrics. |
Dependent item | cockroachdb.store.discovery Preprocessing
|
Item prototypes for Storage metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Storage [{#STORE}]: Bytes: Live | Number of logical bytes stored in live key-value pairs on this node. Live data excludes historical and deleted data. |
Dependent item | cockroachdb.storage.bytes.[{#STORE},live] Preprocessing
|
Storage [{#STORE}]: Bytes: System | Number of physical bytes stored in system key-value pairs. |
Dependent item | cockroachdb.storage.bytes.[{#STORE},system] Preprocessing
|
Storage [{#STORE}]: Capacity available | Available storage capacity. |
Dependent item | cockroachdb.storage.capacity.[{#STORE},available] Preprocessing
|
Storage [{#STORE}]: Capacity total | Total storage capacity. This value may be explicitly set using --store. If a store size has not been set, this metric displays the actual disk capacity. |
Dependent item | cockroachdb.storage.capacity.[{#STORE},total] Preprocessing
|
Storage [{#STORE}]: Capacity used | Disk space in use by CockroachDB data on this node. This excludes the Cockroach binary, operating system, and other system files. |
Dependent item | cockroachdb.storage.capacity.[{#STORE},used] Preprocessing
|
Storage [{#STORE}]: Capacity available in % | Available storage capacity in %. |
Calculated | cockroachdb.storage.capacity.[{#STORE},available_percent] |
Storage [{#STORE}]: Replication: Lease holders | Number of lease holders. |
Dependent item | cockroachdb.replication.[{#STORE},lease_holders] Preprocessing
|
Storage [{#STORE}]: Bytes: Logical | Number of logical bytes stored in key-value pairs on this node. This includes historical and deleted data. |
Dependent item | cockroachdb.storage.bytes.[{#STORE},logical] Preprocessing
|
Storage [{#STORE}]: Rebalancing: Average queries, rate | Number of kv-level requests received per second by the store, averaged over a large time period as used in rebalancing decisions. |
Dependent item | cockroachdb.rebalancing.queries.average.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Rebalancing: Average writes, rate | Number of keys written (i.e. applied by raft) per second to the store, averaged over a large time period as used in rebalancing decisions. |
Dependent item | cockroachdb.rebalancing.writes.average.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Consistency, rate | Number of replicas which failed processing in the consistency checker queue per second. |
Dependent item | cockroachdb.queue.processing_failures.consistency.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: GC, rate | Number of replicas which failed processing in the GC queue per second. |
Dependent item | cockroachdb.queue.processing_failures.gc.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Raft log, rate | Number of replicas which failed processing in the Raft log queue per second. |
Dependent item | cockroachdb.queue.processing_failures.raftlog.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Raft snapshot, rate | Number of replicas which failed processing in the Raft repair queue per second. |
Dependent item | cockroachdb.queue.processing_failures.raftsnapshot.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Replica GC, rate | Number of replicas which failed processing in the replica GC queue per second. |
Dependent item | cockroachdb.queue.processing_failures.gc_replica.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Replicate, rate | Number of replicas which failed processing in the replicate queue per second. |
Dependent item | cockroachdb.queue.processing_failures.replicate.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Split, rate | Number of replicas which failed processing in the split queue per second. |
Dependent item | cockroachdb.queue.processing_failures.split.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Queue processing failures: Time series maintenance, rate | Number of replicas which failed processing in the time series maintenance queue per second. |
Dependent item | cockroachdb.queue.processing_failures.tsmaintenance.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: Ranges count | Number of ranges. |
Dependent item | cockroachdb.ranges.[{#STORE},count] Preprocessing
|
Storage [{#STORE}]: Ranges unavailable | Number of ranges with fewer live replicas than needed for quorum. |
Dependent item | cockroachdb.ranges.[{#STORE},unavailable] Preprocessing
|
Storage [{#STORE}]: Ranges underreplicated | Number of ranges with fewer live replicas than the replication target. |
Dependent item | cockroachdb.ranges.[{#STORE},underreplicated] Preprocessing
|
Storage [{#STORE}]: RocksDB read amplification | The average number of real read operations executed per logical read operation. |
Dependent item | cockroachdb.rocksdb.[{#STORE},read_amp] Preprocessing
|
Storage [{#STORE}]: RocksDB cache hits, rate | Count of block cache hits per second. |
Dependent item | cockroachdb.rocksdb.cache.hits.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: RocksDB cache misses, rate | Count of block cache misses per second. |
Dependent item | cockroachdb.rocksdb.cache.misses.[{#STORE},rate] Preprocessing
|
Storage [{#STORE}]: RocksDB cache hit ratio | Block cache hit ratio in %. |
Calculated | cockroachdb.rocksdb.cache.[{#STORE},hit_ratio] |
Storage [{#STORE}]: Replication: Replicas | Number of replicas. |
Dependent item | cockroachdb.replication.replicas.[{#STORE},count] Preprocessing
|
Storage [{#STORE}]: Replication: Replicas quiesced | Number of quiesced replicas. |
Dependent item | cockroachdb.replication.replicas.[{#STORE},quiesced] Preprocessing
|
Storage [{#STORE}]: Slow requests: Latch acquisitions | Number of requests that have been stuck for a long time acquiring latches. |
Dependent item | cockroachdb.slow_requests.[{#STORE},latch_acquisitions] Preprocessing
|
Storage [{#STORE}]: Slow requests: Lease acquisitions | Number of requests that have been stuck for a long time acquiring a lease. |
Dependent item | cockroachdb.slow_requests.[{#STORE},lease_acquisitions] Preprocessing
|
Storage [{#STORE}]: Slow requests: Raft proposals | Number of requests that have been stuck for a long time in raft. |
Dependent item | cockroachdb.slow_requests.[{#STORE},raft_proposals] Preprocessing
|
Storage [{#STORE}]: RocksDB SSTables | The number of SSTables in use. |
Dependent item | cockroachdb.rocksdb.[{#STORE},sstables] Preprocessing
|
Trigger prototypes for Storage metrics discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Storage [{#STORE}]: Available storage capacity is low | Storage is running low on free space (less than {$COCKROACHDB.STORE.USED.MIN.WARN}% available). |
max(/CockroachDB by HTTP/cockroachdb.storage.capacity.[{#STORE},available_percent],5m) < {$COCKROACHDB.STORE.USED.MIN.WARN} |
Warning | Depends on:
|
Storage [{#STORE}]: Available storage capacity is critically low | Storage is running critically low on free space (less than {$COCKROACHDB.STORE.USED.MIN.CRIT}% available). |
max(/CockroachDB by HTTP/cockroachdb.storage.capacity.[{#STORE},available_percent],5m) < {$COCKROACHDB.STORE.USED.MIN.CRIT} |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums