TiDB by HTTP
Overview
The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
Template TiDB by HTTP
— collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- TiDB cluster 4.0.10, 6.5.1
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
This template works with TiDB server of TiDB cluster. Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. Also, see the Macros section for a list of macros used to set trigger values.
Macros used
Name | Description | Default |
---|---|---|
{$TIDB.PORT} | The port of TiDB server metrics web endpoint |
10080 |
{$TIDB.URL} | TiDB server URL |
localhost |
{$TIDB.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors |
90 |
{$TIDB.HEAP.USAGE.MAX.WARN} | Maximum heap memory used |
10G |
{$TIDB.DDL.WAITING.MAX.WARN} | Maximum number of DDL tasks that are waiting |
5 |
{$TIDB.TIME_JUMP_BACK.MAX.WARN} | Maximum number of times that the operating system rewinds every second |
1 |
{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} | Maximum number of schema lease errors |
0 |
{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} | Maximum number of load schema errors |
1 |
{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} | Maximum number of GC-related operations failures |
1 |
{$TIDB.REGION_ERROR.MAX.WARN} | Maximum number of region related errors |
50 |
{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} | Minimum number of keep alive operations |
10 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Get instance metrics | Get TiDB instance metrics. |
HTTP agent | tidb.get_metrics Preprocessing
|
Get instance status | Get TiDB instance status info. |
HTTP agent | tidb.get_status Preprocessing
|
Status | Status of PD instance. |
Dependent item | tidb.status Preprocessing
|
Get total server query metrics | Get information about server queries. |
Dependent item | tidb.server_query.get_metrics Preprocessing
|
Total "error" server query, rate | The number of queries on TiDB instance per second with failure of command execution results. |
Dependent item | tidb.server_query.error.rate Preprocessing
|
Total "ok" server query, rate | The number of queries on TiDB instance per second with success of command execution results. |
Dependent item | tidb.server_query.ok.rate Preprocessing
|
Total server query, rate | The number of queries per second on TiDB instance. |
Dependent item | tidb.server_query.rate Preprocessing
|
Get SQL statements metrics | Get SQL statements metrics. |
Dependent item | tidb.statement_total.get_metrics Preprocessing
|
SQL statements, rate | The total number of SQL statements executed per second. |
Dependent item | tidb.statement_total.rate Preprocessing
|
Failed Query, rate | The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts). |
Dependent item | tidb.execute_error.rate Preprocessing
|
Get TiKV client metrics | Get TiKV client metrics. |
Dependent item | tidb.tikvclient.get_metrics Preprocessing
|
KV commands, rate | The number of executed KV commands per second. |
Dependent item | tidb.tikvclient_txn.rate Preprocessing
|
PD TSO commands, rate | The number of TSO commands that TiDB obtains from PD per second. |
Dependent item | tidb.pd_tso_cmd.rate Preprocessing
|
PD TSO requests, rate | The number of TSO requests that TiDB obtains from PD per second. |
Dependent item | tidb.pd_tso_request.rate Preprocessing
|
TiClient region errors, rate | The number of region related errors returned by TiKV per second. |
Dependent item | tidb.tikvclient_region_err.rate Preprocessing
|
Lock resolves, rate | The number of DDL tasks that are waiting. |
Dependent item | tidb.tikvclient_lock_resolver_action.rate Preprocessing
|
DDL waiting jobs | The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock. |
Dependent item | tidb.ddl_waiting_jobs Preprocessing
|
Load schema total, rate | The statistics of the schemas that TiDB obtains from TiKV per second. |
Dependent item | tidb.domain_load_schema.rate Preprocessing
|
Load schema failed, rate | The total number of failures to reload the latest schema information in TiDB per second. |
Dependent item | tidb.domain_load_schema.failed.rate Preprocessing
|
Schema lease "outdate" errors , rate | The number of schema lease errors per second. "outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert. |
Dependent item | tidb.session_schema_lease_error.outdate.rate Preprocessing
|
Schema lease "change" errors, rate | The number of schema lease errors per second. "change" means that the schema has changed |
Dependent item | tidb.session_schema_lease_error.change.rate Preprocessing
|
KV backoff, rate | The number of errors returned by TiKV. |
Dependent item | tidb.tikvclient_backoff.rate Preprocessing
|
Keep alive, rate | The number of times that the metrics are refreshed on TiDB instance per minute. |
Dependent item | tidb.monitor_keep_alive.rate Preprocessing
|
Server connections | The connection number of current TiDB instance. |
Dependent item | tidb.tidb_server_connections Preprocessing
|
Heap memory usage | Number of heap bytes that are in use. |
Dependent item | tidb.heap_bytes Preprocessing
|
RSS memory usage | Resident memory size in bytes. |
Dependent item | tidb.rss_bytes Preprocessing
|
Goroutine count | The number of Goroutines on TiDB instance. |
Dependent item | tidb.goroutines Preprocessing
|
Open file descriptors | Number of open file descriptors. |
Dependent item | tidb.process_open_fds Preprocessing
|
Open file descriptors, max | Maximum number of open file descriptors. |
Dependent item | tidb.process_max_fds Preprocessing
|
CPU | Total user and system CPU usage ratio. |
Dependent item | tidb.cpu.util Preprocessing
|
Uptime | The runtime of each TiDB instance. |
Dependent item | tidb.uptime Preprocessing
|
Version | Version of the TiDB instance. |
Dependent item | tidb.version Preprocessing
|
Time jump back, rate | The number of times that the operating system rewinds every second. |
Dependent item | tidb.monitor_time_jump_back.rate Preprocessing
|
Server critical error, rate | The number of critical errors occurred in TiDB per second. |
Dependent item | tidb.tidb_server_critical_error_total.rate Preprocessing
|
Server panic, rate | The number of panics occurred in TiDB per second. |
Dependent item | tidb.tidb_server_panic_total.rate Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
TiDB: Instance is not responding | last(/TiDB by HTTP/tidb.status)=0 |
Average | ||
TiDB: Too many region related errors | min(/TiDB by HTTP/tidb.tikvclient_region_err.rate,5m)>{$TIDB.REGION_ERROR.MAX.WARN} |
Average | ||
TiDB: Too many DDL waiting jobs | min(/TiDB by HTTP/tidb.ddl_waiting_jobs,5m)>{$TIDB.DDL.WAITING.MAX.WARN} |
Warning | ||
TiDB: Too many schema lease errors | min(/TiDB by HTTP/tidb.domain_load_schema.failed.rate,5m)>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} |
Average | ||
TiDB: Too many schema lease errors | The latest schema information is not reloaded in TiDB within one lease. |
min(/TiDB by HTTP/tidb.session_schema_lease_error.outdate.rate,5m)>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} |
Average | |
TiDB: Too few keep alive operations | Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered. |
max(/TiDB by HTTP/tidb.monitor_keep_alive.rate,5m)<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} |
Average | |
TiDB: Heap memory usage is too high | min(/TiDB by HTTP/tidb.heap_bytes,5m)>{$TIDB.HEAP.USAGE.MAX.WARN} |
Warning | ||
TiDB: Current number of open files is too high | Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue. |
min(/TiDB by HTTP/tidb.process_open_fds,5m)/last(/TiDB by HTTP/tidb.process_max_fds)*100>{$TIDB.OPEN.FDS.MAX.WARN} |
Warning | |
TiDB: has been restarted | Uptime is less than 10 minutes. |
last(/TiDB by HTTP/tidb.uptime)<10m |
Info | Manual close: Yes |
TiDB: Version has changed | TiDB version has changed. Acknowledge to close the problem manually. |
last(/TiDB by HTTP/tidb.version,#1)<>last(/TiDB by HTTP/tidb.version,#2) and length(last(/TiDB by HTTP/tidb.version))>0 |
Info | Manual close: Yes |
TiDB: Too many time jump backs | min(/TiDB by HTTP/tidb.monitor_time_jump_back.rate,5m)>{$TIDB.TIME_JUMP_BACK.MAX.WARN} |
Warning | ||
TiDB: There are panicked TiDB threads | When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart. |
last(/TiDB by HTTP/tidb.tidb_server_panic_total.rate)>0 |
Average |
LLD rule QPS metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
QPS metrics discovery | Discovery QPS specific metrics. |
Dependent item | tidb.qps.discovery Preprocessing
|
Item prototypes for QPS metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Get QPS metrics: {#TYPE} | Get QPS metrics of {#TYPE}. |
Dependent item | tidb.qps.get_metrics[{#TYPE}] Preprocessing
|
Server query "OK": {#TYPE}, rate | The number of queries on TiDB instance per second with success of command execution results. |
Dependent item | tidb.server_query.ok.rate[{#TYPE}] Preprocessing
|
Server query "Error": {#TYPE}, rate | The number of queries on TiDB instance per second with failure of command execution results. |
Dependent item | tidb.server_query.error.rate[{#TYPE}] Preprocessing
|
LLD rule Statement metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Statement metrics discovery | Discovery statement specific metrics. |
Dependent item | tidb.statement.discover Preprocessing
|
Item prototypes for Statement metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
SQL statements: {#TYPE}, rate | The number of SQL statements executed per second. |
Dependent item | tidb.statement.rate[{#TYPE}] Preprocessing
|
LLD rule KV metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
KV metrics discovery | Discovery KV specific metrics. |
Dependent item | tidb.kv_ops.discovery Preprocessing
|
Item prototypes for KV metrics discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
KV Commands: {#TYPE}, rate | The number of executed KV commands per second. |
Dependent item | tidb.tikvclient_txn.rate[{#TYPE}] Preprocessing
|
LLD rule Lock resolves discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Lock resolves discovery | Discovery lock resolves specific metrics. |
Dependent item | tidb.tikvclient_lock_resolver_action.discovery Preprocessing
|
Item prototypes for Lock resolves discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Lock resolves: {#TYPE}, rate | The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock. |
Dependent item | tidb.tikvclient_lock_resolver_action.rate[{#TYPE}] Preprocessing
|
LLD rule KV backoff discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
KV backoff discovery | Discovery KV backoff specific metrics. |
Dependent item | tidb.tikvclient_backoff.discovery Preprocessing
|
Item prototypes for KV backoff discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
KV backoff: {#TYPE}, rate | The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock. |
Dependent item | tidb.tikvclient_backoff.rate[{#TYPE}] Preprocessing
|
LLD rule GC action results discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
GC action results discovery | Discovery GC action results metrics. |
Dependent item | tidb.tikvclient_gc_action.discovery Preprocessing
|
Item prototypes for GC action results discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
GC action result: {#TYPE}, rate | The number of results of GC-related operations per second. |
Dependent item | tidb.tikvclient_gc_action.rate[{#TYPE}] Preprocessing
|
Trigger prototypes for GC action results discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
TiDB: Too many failed GC-related operations | min(/TiDB by HTTP/tidb.tikvclient_gc_action.rate[{#TYPE}],5m)>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} |
Warning |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums