TiDB

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing workloads. It is MySQL compatible and can provide horizontal scalability, strong consistency, and high availability. It is developed and supported primarily by PingCAP, Inc. and licensed under Apache 2.0.

Available solutions




This template is for Zabbix version: 6.0
Also available for: 5.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/db/tidb_http/tidb_tidb_http?at=release/6.0

TiDB by HTTP

Overview

For Zabbix version: 6.0 and higher
The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template TiDB by HTTP — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.

This template was tested on:

  • TiDB cluster, version 4.0.10

Setup

See Zabbix template operation for basic instructions.

This template works with TiDB server of TiDB cluster. Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. Also, see the Macros section for a list of macros used to set trigger values.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$TIDB.DDL.WAITING.MAX.WARN}

Maximum number of DDL tasks that are waiting

5
{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}

Maximum number of GC-related operations failures

1
{$TIDB.HEAP.USAGE.MAX.WARN}

Maximum heap memory used

10G
{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}

Minimum number of keep alive operations

10
{$TIDB.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors

90
{$TIDB.PORT}

The port of TiDB server metrics web endpoint

10080
{$TIDB.REGION_ERROR.MAX.WARN}

Maximum number of region related errors

50
{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}

Maximum number of schema lease errors

0
{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}

Maximum number of load schema errors

1
{$TIDB.TIME_JUMP_BACK.MAX.WARN}

Maximum number of times that the operating system rewinds every second

1
{$TIDB.URL}

TiDB server URL

localhost

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
GC action results discovery

Discovery GC action results metrics.

DEPENDENT tidb.tikvclient_gc_action.discovery

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_gc_action_result")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Overrides:

Failed GC-related operations trigger
- {#TYPE} MATCHES_REGEX failed
- TRIGGER_PROTOTYPE LIKE Too many failed GC-related operations - DISCOVER

KV backoff discovery

Discovery KV backoff specific metrics.

DEPENDENT tidb.tikvclient_backoff.discovery

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_backoff_total")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

KV metrics discovery

Discovery KV specific metrics.

DEPENDENT tidb.kv_ops.discovery

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Lock resolves discovery

Discovery lock resolves specific metrics.

DEPENDENT tidb.tikvclient_lock_resolver_action.discovery

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

QPS metrics discovery

Discovery QPS specific metrics.

DEPENDENT tidb.qps.discovery

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_server_query_total")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Statement metrics discovery

Discovery statement specific metrics.

DEPENDENT tidb.statement.discover

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_executor_statement_total")]

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Items collected

Group Name Description Type Key and additional info
TiDB node TiDB: Status

Status of PD instance.

DEPENDENT tidb.status

Preprocessing:

- JSONPATH: $.status

⛔️ON_FAIL: CUSTOM_VALUE -> 1

- DISCARD_UNCHANGED_HEARTBEAT: 1h

TiDB node TiDB: Total "error" server query, rate

The number of queries on TiDB instance per second with failure of command execution results.

DEPENDENT tidb.server_query.error.rate

Preprocessing:

- JSONPATH: $[?(@.name == "tidb_server_query_total" && @.labels.result == "Error")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: Total "ok" server query, rate

The number of queries on TiDB instance per second with success of command execution results.

DEPENDENT tidb.server_query.ok.rate

Preprocessing:

- JSONPATH: $[?(@.name == "tidb_server_query_total" && @.labels.result == "OK")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: Total server query, rate

The number of queries per second on TiDB instance.

DEPENDENT tidb.server_query.rate

Preprocessing:

- JSONPATH: $[?(@.name == "tidb_server_query_total")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: SQL statements, rate

The total number of SQL statements executed per second.

DEPENDENT tidb.statement_total.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_executor_statement_total")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: Failed Query, rate

The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).

DEPENDENT tidb.execute_error.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_server_execute_error_total")].value.sum()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: KV commands, rate

The number of executed KV commands per second.

DEPENDENT tidb.tikvclient_txn.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: PD TSO commands, rate

The number of TSO commands that TiDB obtains from PD per second.

DEPENDENT tidb.pd_tso_cmd.rate

Preprocessing:

- JSONPATH: $[?(@.name=="pd_client_cmd_handle_cmds_duration_seconds_count" && @.labels.type == "tso")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: PD TSO requests, rate

The number of TSO requests that TiDB obtains from PD per second.

DEPENDENT tidb.pd_tso_request.rate

Preprocessing:

- JSONPATH: $[?(@.name=="pd_client_request_handle_requests_duration_seconds_count" && @.labels.type == "tso")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: TiClient region errors, rate

The number of region related errors returned by TiKV per second.

DEPENDENT tidb.tikvclient_region_err.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: Lock resolves, rate

The number of DDL tasks that are waiting.

DEPENDENT tidb.tikvclient_lock_resolver_action.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: DDL waiting jobs

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

DEPENDENT tidb.ddl_waiting_jobs

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()

TiDB node TiDB: Load schema total, rate

The statistics of the schemas that TiDB obtains from TiKV per second.

DEPENDENT tidb.domain_load_schema.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_domain_load_schema_total")].value.sum()

- CHANGE_PER_SECOND

TiDB node TiDB: Load schema failed, rate

The total number of failures to reload the latest schema information in TiDB per second.

DEPENDENT tidb.domain_load_schema.failed.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_domain_load_schema_total && @.labels.type == "failed"")].value.first()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: Schema lease "outdate" errors , rate

The number of schema lease errors per second.

"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.

DEPENDENT tidb.session_schema_lease_error.outdate.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "outdate"")].value.first()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: Schema lease "change" errors, rate

The number of schema lease errors per second.

"change" means that the schema has changed

DEPENDENT tidb.session_schema_lease_error.change.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "change"")].value.first()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: KV backoff, rate

The number of errors returned by TiKV.

DEPENDENT tidb.tikvclient_backoff.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: Keep alive, rate

The number of times that the metrics are refreshed on TiDB instance per minute.

DEPENDENT tidb.monitor_keep_alive.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_monitor_keep_alive_total")].value.first()

⛔️ON_FAIL: DISCARD_VALUE ->

- SIMPLE_CHANGE

TiDB node TiDB: Server connections

The connection number of current TiDB instance.

DEPENDENT tidb.tidb_server_connections

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_server_connections")].value.first()

TiDB node TiDB: Heap memory usage

Number of heap bytes that are in use.

DEPENDENT tidb.heap_bytes

Preprocessing:

- JSONPATH: $[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()

TiDB node TiDB: RSS memory usage

Resident memory size in bytes.

DEPENDENT tidb.rss_bytes

Preprocessing:

- JSONPATH: $[?(@.name=="process_resident_memory_bytes")].value.first()

TiDB node TiDB: Goroutine count

The number of Goroutines on TiDB instance.

DEPENDENT tidb.goroutines

Preprocessing:

- JSONPATH: $[?(@.name=="go_goroutines")].value.first()

TiDB node TiDB: Open file descriptors

Number of open file descriptors.

DEPENDENT tidb.process_open_fds

Preprocessing:

- JSONPATH: $[?(@.name=="process_open_fds")].value.first()

TiDB node TiDB: Open file descriptors, max

Maximum number of open file descriptors.

DEPENDENT tidb.process_max_fds

Preprocessing:

- JSONPATH: $[?(@.name=="process_max_fds")].value.first()

TiDB node TiDB: CPU

Total user and system CPU usage ratio.

DEPENDENT tidb.cpu.util

Preprocessing:

- JSONPATH: $[?(@.name=="process_cpu_seconds_total")].value.first()

- CHANGE_PER_SECOND

- MULTIPLIER: 100

TiDB node TiDB: Uptime

The runtime of each TiDB instance.

DEPENDENT tidb.uptime

Preprocessing:

- JSONPATH: $[?(@.name=="process_start_time_seconds")].value.first()

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

TiDB node TiDB: Version

Version of the TiDB instance.

DEPENDENT tidb.version

Preprocessing:

- JSONPATH: $.version

- DISCARD_UNCHANGED_HEARTBEAT: 3h

TiDB node TiDB: Time jump back, rate

The number of times that the operating system rewinds every second.

DEPENDENT tidb.monitor_time_jump_back.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_monitor_time_jump_back_total")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: Server critical error, rate

The number of critical errors occurred in TiDB per second.

DEPENDENT tidb.tidb_server_critical_error_total.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_server_critical_error_total")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: Server panic, rate

The number of panics occurred in TiDB per second.

DEPENDENT tidb.tidb_server_panic_total.rate

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_server_panic_total")].value.first()

⛔️ON_FAIL: DISCARD_VALUE ->

- CHANGE_PER_SECOND

TiDB node TiDB: Server query "OK": {#TYPE}, rate

The number of queries on TiDB instance per second with success of command execution results.

DEPENDENT tidb.server_query.ok.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name == "tidb_server_query_total" && @.labels.result == "OK" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: Server query "Error": {#TYPE}, rate

The number of queries on TiDB instance per second with failure of command execution results.

DEPENDENT tidb.server_query.error.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name == "tidb_server_query_total" && @.labels.result == "Error" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: SQL statements: {#TYPE}, rate

The number of SQL statements executed per second.

DEPENDENT tidb.statement.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_executor_statement_total" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: KV Commands: {#TYPE}, rate

The number of executed KV commands per second.

DEPENDENT tidb.tikvclient_txn.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: Lock resolves: {#TYPE}, rate

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

DEPENDENT tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_lock_resolver_actions_total" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: KV backoff: {#TYPE}, rate

The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.

DEPENDENT tidb.tikvclient_backoff.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_backoff_total" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

TiDB node TiDB: GC action result: {#TYPE}, rate

The number of results of GC-related operations per second.

DEPENDENT tidb.tikvclient_gc_action.rate[{#TYPE}]

Preprocessing:

- JSONPATH: $[?(@.name=="tidb_tikvclient_gc_action_result" && @.labels.type == "{#TYPE}")].value.first()

- CHANGE_PER_SECOND

Zabbix raw items TiDB: Get instance metrics

Get TiDB instance metrics.

HTTP_AGENT tidb.get_metrics

Preprocessing:

- CHECK_NOT_SUPPORTED

⛔️ON_FAIL: DISCARD_VALUE ->

- PROMETHEUS_TO_JSON

Zabbix raw items TiDB: Get instance status

Get TiDB instance status info.

HTTP_AGENT tidb.get_status

Preprocessing:

- CHECK_NOT_SUPPORTED

⛔️ON_FAIL: CUSTOM_VALUE -> {"status": "0"}

Triggers

Name Description Expression Severity Dependencies and additional info
TiDB: Instance is not responding

-

last(/TiDB by HTTP/tidb.status)=0 AVERAGE
TiDB: Too many region related errors

-

min(/TiDB by HTTP/tidb.tikvclient_region_err.rate,5m)>{$TIDB.REGION_ERROR.MAX.WARN} AVERAGE
TiDB: Too many DDL waiting jobs

-

min(/TiDB by HTTP/tidb.ddl_waiting_jobs,5m)>{$TIDB.DDL.WAITING.MAX.WARN} WARNING
TiDB: Too many schema lease errors

-

min(/TiDB by HTTP/tidb.domain_load_schema.failed.rate,5m)>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} AVERAGE
TiDB: Too many schema lease errors

The latest schema information is not reloaded in TiDB within one lease.

min(/TiDB by HTTP/tidb.session_schema_lease_error.outdate.rate,5m)>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} AVERAGE
TiDB: Too few keep alive operations

Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.

max(/TiDB by HTTP/tidb.monitor_keep_alive.rate,5m)<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} AVERAGE
TiDB: Heap memory usage is too high

-

min(/TiDB by HTTP/tidb.heap_bytes,5m)>{$TIDB.HEAP.USAGE.MAX.WARN} WARNING
TiDB: Current number of open files is too high

Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.

min(/TiDB by HTTP/tidb.process_open_fds,5m)/last(/TiDB by HTTP/tidb.process_max_fds)*100>{$TIDB.OPEN.FDS.MAX.WARN} WARNING
TiDB: has been restarted

Uptime is less than 10 minutes

last(/TiDB by HTTP/tidb.uptime)<10m INFO

Manual close: YES

TiDB: Version has changed

TiDB version has changed. Ack to close.

last(/TiDB by HTTP/tidb.version,#1)<>last(/TiDB by HTTP/tidb.version,#2) and length(last(/TiDB by HTTP/tidb.version))>0 INFO

Manual close: YES

TiDB: Too many time jump backs

-

min(/TiDB by HTTP/tidb.monitor_time_jump_back.rate,5m)>{$TIDB.TIME_JUMP_BACK.MAX.WARN} WARNING
TiDB: There are panicked TiDB threads

When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.

last(/TiDB by HTTP/tidb.tidb_server_panic_total.rate)>0 AVERAGE
TiDB: Too many failed GC-related operations

-

min(/TiDB by HTTP/tidb.tikvclient_gc_action.rate[{#TYPE}],5m)>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template or ask for help with it at ZABBIX forums.

Articles and documentation

+ Propose new article

Didn't find integration you need?