Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.4
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 7.4 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
Hadoop: ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
Hadoop: NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
Hadoop: NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
Hadoop: NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.2
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 7.2 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
Hadoop: ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
Hadoop: NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
Hadoop: NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
Hadoop: NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.0
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
Hadoop: ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
Hadoop: NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
Hadoop: NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
Hadoop: NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
Hadoop: NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
Hadoop: NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
Hadoop: NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Hadoop: {#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.4
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 6.4 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop: Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop: Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Hadoop: Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Hadoop: Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.2
Hadoop by HTTP
Overview
For Zabbix version: 6.2 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts.
It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing.
Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs.
All metrics are collected at once, thanks to the Zabbix bulk data collection.
This template was tested on:
- Hadoop, version 3.1 and later
Setup
See Zabbix template operation for basic instructions.
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
Template links
There are no template links in this template.
Discovery rules
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | - |
HTTP_AGENT | hadoop.datanode.discovery Preprocessing: - JAVASCRIPT: |
Node manager discovery | - |
HTTP_AGENT | hadoop.nodemanager.discovery Preprocessing: - JAVASCRIPT: |
Items collected
Group | Name | Description | Type | Key and additional info |
---|---|---|---|---|
Hadoop | ResourceManager: Service status | Hadoop ResourceManager API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Service response time | Hadoop ResourceManager API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop | ResourceManager: Uptime | - |
DEPENDENT | hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Active NMs | Number of Active NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Lost NMs | Number of Lost NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: |
Hadoop | NameNode: Service status | Hadoop NameNode API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Service response time | Hadoop NameNode API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop | NameNode: Uptime | - |
DEPENDENT | hadoop.namenode.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | NameNode: Block Pool Renaming | - |
DEPENDENT | hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: |
Hadoop | NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
DEPENDENT | hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: |
Hadoop | NameNode: Percent capacity remaining | Available capacity in percent. |
DEPENDENT | hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Capacity remaining | Available capacity. |
DEPENDENT | hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: |
Hadoop | NameNode: Corrupt blocks | Number of corrupt blocks. |
DEPENDENT | hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Missing blocks | Number of missing blocks. |
DEPENDENT | hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Failed volumes | Number of failed volumes. |
DEPENDENT | hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Alive DataNodes | Count of alive DataNodes. |
DEPENDENT | hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Dead DataNodes | Count of dead DataNodes. |
DEPENDENT | hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
DEPENDENT | hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Total files | Total count of files tracked by the NameNode. |
DEPENDENT | hadoop.namenode.files_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
DEPENDENT | hadoop.namenode.total_load Preprocessing: - JSONPATH: |
Hadoop | NameNode: Blocks allocable | Maximum number of blocks allocable. |
DEPENDENT | hadoop.namenode.block_capacity Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total blocks | Count of blocks tracked by NameNode. |
DEPENDENT | hadoop.namenode.blocks_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
DEPENDENT | hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Container launch avg duration | - |
DEPENDENT | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | - |
DEPENDENT | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | {#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
DEPENDENT | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Version | - |
DEPENDENT | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Number of containers | - |
DEPENDENT | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Used memory | - |
DEPENDENT | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Available memory | - |
DEPENDENT | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Remaining | Remaining disk space. |
DEPENDENT | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Used | Used disk space. |
DEPENDENT | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
DEPENDENT | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | - |
DEPENDENT | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | {#HOSTNAME}: Version | DataNode software version. |
DEPENDENT | hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Admin state | Administrative state. |
DEPENDENT | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Oper state | Operational state. |
DEPENDENT | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Zabbix raw items | Get ResourceManager stats | - |
HTTP_AGENT | hadoop.resourcemanager.get |
Zabbix raw items | Get NameNode stats | - |
HTTP_AGENT | hadoop.namenode.get |
Zabbix raw items | Get NodeManagers states | - |
HTTP_AGENT | hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: |
Zabbix raw items | Get DataNodes states | - |
HTTP_AGENT | hadoop.datanodes.get Preprocessing: - JAVASCRIPT: |
Zabbix raw items | Hadoop NodeManager {#HOSTNAME}: Get stats | - |
HTTP_AGENT | hadoop.nodemanager.get[{#HOSTNAME}] |
Zabbix raw items | Hadoop DataNode {#HOSTNAME}: Get stats | - |
HTTP_AGENT | hadoop.datanode.get[{#HOSTNAME}] |
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | - |
last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
AVERAGE | Manual close: YES |
ResourceManager: Service response time is too high | - |
min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
INFO | Manual close: YES |
ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
HIGH | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
AVERAGE | |
NameNode: Service is unavailable | - |
last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
AVERAGE | Manual close: YES |
NameNode: Service response time is too high | - |
min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
INFO | Manual close: YES |
NameNode: Failed to fetch NameNode API page | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
WARNING | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
AVERAGE | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
AVERAGE | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
AVERAGE | |
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. |
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
AVERAGE | |
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}. |
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
AVERAGE |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template or ask for help with it at ZABBIX forums.
References
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.0
Hadoop by HTTP
Overview
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.
Requirements
Zabbix version: 6.0 and higher.
Tested versions
This template has been tested on:
- Hadoop 3.1 and later
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
ResourceManager: Service status | Hadoop ResourceManager API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing
|
ResourceManager: Service response time | Hadoop ResourceManager API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop: Get ResourceManager stats | HTTP agent | hadoop.resourcemanager.get | |
ResourceManager: Uptime | Dependent item | hadoop.resourcemanager.uptime Preprocessing
|
|
ResourceManager: Get info | Dependent item | hadoop.resourcemanager.info Preprocessing
|
|
ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing
|
ResourceManager: Active NMs | Number of Active NodeManagers. |
Dependent item | hadoop.resourcemanager.num_active_nm Preprocessing
|
ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioning_nm Preprocessing
|
ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
Dependent item | hadoop.resourcemanager.num_decommissioned_nm Preprocessing
|
ResourceManager: Lost NMs | Number of Lost NodeManagers. |
Dependent item | hadoop.resourcemanager.num_lost_nm Preprocessing
|
ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
Dependent item | hadoop.resourcemanager.num_unhealthy_nm Preprocessing
|
ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
Dependent item | hadoop.resourcemanager.num_rebooted_nm Preprocessing
|
ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
Dependent item | hadoop.resourcemanager.num_shutdown_nm Preprocessing
|
NameNode: Service status | Hadoop NameNode API port availability. |
Simple check | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing
|
NameNode: Service response time | Hadoop NameNode API performance. |
Simple check | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop: Get NameNode stats | HTTP agent | hadoop.namenode.get | |
NameNode: Uptime | Dependent item | hadoop.namenode.uptime Preprocessing
|
|
NameNode: Get info | Dependent item | hadoop.namenode.info Preprocessing
|
|
NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.namenode.rpc_processing_time_avg Preprocessing
|
NameNode: Block Pool Renaming | Dependent item | hadoop.namenode.percent_block_pool_used Preprocessing
|
|
NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
Dependent item | hadoop.namenode.transactions_since_last_checkpoint Preprocessing
|
NameNode: Percent capacity remaining | Available capacity in percent. |
Dependent item | hadoop.namenode.percent_remaining Preprocessing
|
NameNode: Capacity remaining | Available capacity. |
Dependent item | hadoop.namenode.capacity_remaining Preprocessing
|
NameNode: Corrupt blocks | Number of corrupt blocks. |
Dependent item | hadoop.namenode.corrupt_blocks Preprocessing
|
NameNode: Missing blocks | Number of missing blocks. |
Dependent item | hadoop.namenode.missing_blocks Preprocessing
|
NameNode: Failed volumes | Number of failed volumes. |
Dependent item | hadoop.namenode.volume_failures_total Preprocessing
|
NameNode: Alive DataNodes | Count of alive DataNodes. |
Dependent item | hadoop.namenode.num_live_data_nodes Preprocessing
|
NameNode: Dead DataNodes | Count of dead DataNodes. |
Dependent item | hadoop.namenode.num_dead_data_nodes Preprocessing
|
NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
Dependent item | hadoop.namenode.num_stale_data_nodes Preprocessing
|
NameNode: Total files | Total count of files tracked by the NameNode. |
Dependent item | hadoop.namenode.files_total Preprocessing
|
NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
Dependent item | hadoop.namenode.total_load Preprocessing
|
NameNode: Blocks allocable | Maximum number of blocks allocable. |
Dependent item | hadoop.namenode.block_capacity Preprocessing
|
NameNode: Total blocks | Count of blocks tracked by NameNode. |
Dependent item | hadoop.namenode.blocks_total Preprocessing
|
NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
Dependent item | hadoop.namenode.under_replicated_blocks Preprocessing
|
Hadoop: Get NodeManagers states | HTTP agent | hadoop.nodemanagers.get Preprocessing
|
|
Hadoop: Get DataNodes states | HTTP agent | hadoop.datanodes.get Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
Average | Manual close: Yes | |
ResourceManager: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
ResourceManager: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
Info | Manual close: Yes |
ResourceManager: Failed to fetch ResourceManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
High | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
Average | |
NameNode: Service is unavailable | last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
Average | Manual close: Yes | |
NameNode: Service response time is too high | min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
Warning | Manual close: Yes Depends on:
|
|
NameNode: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
Info | Manual close: Yes |
NameNode: Failed to fetch NameNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
Warning | Manual close: Yes Depends on:
|
NameNode: Cluster capacity remaining is low | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
Warning | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
Average | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
Average | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
Average |
LLD rule Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | HTTP agent | hadoop.nodemanager.discovery Preprocessing
|
Item prototypes for Node manager discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP agent | hadoop.nodemanager.get[{#HOSTNAME}] | |
{#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
Dependent item | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Container launch avg duration | Dependent item | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop NodeManager {#HOSTNAME}: Get raw info | Dependent item | hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
Dependent item | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Version | Dependent item | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Number of containers | Dependent item | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Used memory | Dependent item | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Available memory | Dependent item | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Node manager discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch NodeManager API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
Average |
LLD rule Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Data node discovery | HTTP agent | hadoop.datanode.discovery Preprocessing
|
Item prototypes for Data node discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Hadoop DataNode {#HOSTNAME}: Get stats | HTTP agent | hadoop.datanode.get[{#HOSTNAME}] | |
{#HOSTNAME}: Remaining | Remaining disk space. |
Dependent item | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Used | Used disk space. |
Dependent item | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
Dependent item | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Threads | The number of JVM threads. |
Dependent item | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
Dependent item | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
Dependent item | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Uptime | Dependent item | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing
|
|
Hadoop DataNode {#HOSTNAME}: Get raw info | Dependent item | hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing
|
|
{#HOSTNAME}: Version | DataNode software version. |
Dependent item | hadoop.datanode.version[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Admin state | Administrative state. |
Dependent item | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing
|
{#HOSTNAME}: Oper state | Operational state. |
Dependent item | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing
|
Trigger prototypes for Data node discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
{#HOSTNAME}: Service has been restarted | Uptime is less than 10 minutes. |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
Info | Manual close: Yes |
{#HOSTNAME}: Failed to fetch DataNode API page | Zabbix has not received any data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
Warning | Manual close: Yes Depends on:
|
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
Average |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/5.4
Hadoop by HTTP
Overview
For Zabbix version: 5.4 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts.
It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing.
Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs.
All metrics are collected at once, thanks to the Zabbix bulk data collection.
This template was tested on:
- Hadoop, version 3.1 and later
Setup
See Zabbix template operation for basic instructions.
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
Template links
There are no template links in this template.
Discovery rules
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | - |
HTTP_AGENT | hadoop.nodemanager.discovery Preprocessing: - JAVASCRIPT: |
Data node discovery | - |
HTTP_AGENT | hadoop.datanode.discovery Preprocessing: - JAVASCRIPT: |
Items collected
Group | Name | Description | Type | Key and additional info |
---|---|---|---|---|
Hadoop | ResourceManager: Service status | Hadoop ResourceManager API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Service response time | Hadoop ResourceManager API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop | ResourceManager: Uptime | - |
DEPENDENT | hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Active NMs | Number of Active NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Lost NMs | Number of Lost NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: |
Hadoop | NameNode: Service status | Hadoop NameNode API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Service response time | Hadoop NameNode API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop | NameNode: Uptime | - |
DEPENDENT | hadoop.namenode.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | NameNode: Block Pool Renaming | - |
DEPENDENT | hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: |
Hadoop | NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
DEPENDENT | hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: |
Hadoop | NameNode: Percent capacity remaining | Available capacity in percent. |
DEPENDENT | hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Capacity remaining | Available capacity. |
DEPENDENT | hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: |
Hadoop | NameNode: Corrupt blocks | Number of corrupt blocks. |
DEPENDENT | hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Missing blocks | Number of missing blocks. |
DEPENDENT | hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Failed volumes | Number of failed volumes. |
DEPENDENT | hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Alive DataNodes | Count of alive DataNodes. |
DEPENDENT | hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Dead DataNodes | Count of dead DataNodes. |
DEPENDENT | hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
DEPENDENT | hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Total files | Total count of files tracked by the NameNode. |
DEPENDENT | hadoop.namenode.files_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
DEPENDENT | hadoop.namenode.total_load Preprocessing: - JSONPATH: |
Hadoop | NameNode: Blocks allocable | Maximum number of blocks allocable. |
DEPENDENT | hadoop.namenode.block_capacity Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total blocks | Count of blocks tracked by NameNode. |
DEPENDENT | hadoop.namenode.blocks_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
DEPENDENT | hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Container launch avg duration | - |
DEPENDENT | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | - |
DEPENDENT | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | {#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
DEPENDENT | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Version | - |
DEPENDENT | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Number of containers | - |
DEPENDENT | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Used memory | - |
DEPENDENT | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Available memory | - |
DEPENDENT | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Remaining | Remaining disk space. |
DEPENDENT | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Used | Used disk space. |
DEPENDENT | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
DEPENDENT | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | - |
DEPENDENT | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
Hadoop | {#HOSTNAME}: Version | DataNode software version. |
DEPENDENT | hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Admin state | Administrative state. |
DEPENDENT | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Oper state | Operational state. |
DEPENDENT | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Zabbix_raw_items | Get ResourceManager stats | - |
HTTP_AGENT | hadoop.resourcemanager.get |
Zabbix_raw_items | Get NameNode stats | - |
HTTP_AGENT | hadoop.namenode.get |
Zabbix_raw_items | Get NodeManagers states | - |
HTTP_AGENT | hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: |
Zabbix_raw_items | Get DataNodes states | - |
HTTP_AGENT | hadoop.datanodes.get Preprocessing: - JAVASCRIPT: |
Zabbix_raw_items | Hadoop NodeManager {#HOSTNAME}: Get stats | - |
HTTP_AGENT | hadoop.nodemanager.get[{#HOSTNAME}] |
Zabbix_raw_items | Hadoop DataNode {#HOSTNAME}: Get stats | - |
HTTP_AGENT | hadoop.datanode.get[{#HOSTNAME}] |
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | - |
last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 |
AVERAGE | Manual close: YES |
ResourceManager: Service response time is too high (over {$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} for 5m) | - |
min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m |
INFO | Manual close: YES |
ResourceManager: Failed to fetch ResourceManager API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 |
HIGH | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 |
AVERAGE | |
NameNode: Service is unavailable | - |
last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 |
AVERAGE | Manual close: YES |
NameNode: Service response time is too high (over {$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} for 5m) | - |
min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m |
INFO | Manual close: YES |
NameNode: Failed to fetch NameNode API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Cluster capacity remaining is low (below {$HADOOP.CAPACITY_REMAINING.MIN.WARN}% for 15m) | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
WARNING | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 |
AVERAGE | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 |
AVERAGE | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 |
AVERAGE | |
{#HOSTNAME}: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch NodeManager API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. |
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" |
AVERAGE | |
{#HOSTNAME}: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch DataNode API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}. |
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" |
AVERAGE |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.
References
Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/5.0
Template App Hadoop by HTTP
Overview
For Zabbix version: 5.0 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts.
It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing.
Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs.
All metrics are collected at once, thanks to the Zabbix bulk data collection.
This template was tested on:
- Zabbix, version 5.0 and later
- Hadoop, version 3.1 and later
Setup
See Zabbix template operation for basic instructions.
You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
Name | Description | Default |
---|---|---|
{$HADOOP.CAPACITY_REMAINING.MIN.WARN} | The Hadoop cluster capacity remaining percent for trigger expression. |
20 |
{$HADOOP.NAMENODE.HOST} | The Hadoop NameNode host IP address or FQDN. |
NameNode |
{$HADOOP.NAMENODE.PORT} | The Hadoop NameNode Web-UI port. |
9870 |
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} | The Hadoop NameNode API page maximum response time in seconds for trigger expression. |
10s |
{$HADOOP.RESOURCEMANAGER.HOST} | The Hadoop ResourceManager host IP address or FQDN. |
ResourceManager |
{$HADOOP.RESOURCEMANAGER.PORT} | The Hadoop ResourceManager Web-UI port. |
8088 |
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} | The Hadoop ResourceManager API page maximum response time in seconds for trigger expression. |
10s |
Template links
There are no template links in this template.
Discovery rules
Name | Description | Type | Key and additional info |
---|---|---|---|
Node manager discovery | - |
HTTP_AGENT | hadoop.nodemanager.discovery Preprocessing: - JAVASCRIPT: |
Data node discovery | - |
HTTP_AGENT | hadoop.datanode.discovery Preprocessing: - JAVASCRIPT: |
Items collected
Group | Name | Description | Type | Key and additional info |
---|---|---|---|---|
Hadoop | ResourceManager: Service status | Hadoop ResourceManager API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Service response time | Hadoop ResourceManager API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] |
Hadoop | ResourceManager: Uptime | DEPENDENT | hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
|
Hadoop | ResourceManager: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Active NMs | Number of Active NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioning NMs | Number of Decommissioning NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Decommissioned NMs | Number of Decommissioned NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Lost NMs | Number of Lost NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | ResourceManager: Unhealthy NMs | Number of Unhealthy NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Rebooted NMs | Number of Rebooted NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: |
Hadoop | ResourceManager: Shutdown NMs | Number of Shutdown NodeManagers. |
DEPENDENT | hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: |
Hadoop | NameNode: Service status | Hadoop NameNode API port availability. |
SIMPLE | net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Service response time | Hadoop NameNode API performance. |
SIMPLE | net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] |
Hadoop | NameNode: Uptime | DEPENDENT | hadoop.namenode.uptime Preprocessing: - JSONPATH: - MULTIPLIER: |
|
Hadoop | NameNode: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: |
Hadoop | NameNode: Block Pool Renaming | DEPENDENT | hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: |
|
Hadoop | NameNode: Transactions since last checkpoint | Total number of transactions since last checkpoint. |
DEPENDENT | hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: |
Hadoop | NameNode: Percent capacity remaining | Available capacity in percent. |
DEPENDENT | hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Capacity remaining | Available capacity. |
DEPENDENT | hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: |
Hadoop | NameNode: Corrupt blocks | Number of corrupt blocks. |
DEPENDENT | hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Missing blocks | Number of missing blocks. |
DEPENDENT | hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: |
Hadoop | NameNode: Failed volumes | Number of failed volumes. |
DEPENDENT | hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Alive DataNodes | Count of alive DataNodes. |
DEPENDENT | hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Dead DataNodes | Count of dead DataNodes. |
DEPENDENT | hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Stale DataNodes | DataNodes that do not send a heartbeat within 30 seconds are marked as "stale". |
DEPENDENT | hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | NameNode: Total files | Total count of files tracked by the NameNode. |
DEPENDENT | hadoop.namenode.files_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total load | The current number of concurrent file accesses (read/write) across all DataNodes. |
DEPENDENT | hadoop.namenode.total_load Preprocessing: - JSONPATH: |
Hadoop | NameNode: Blocks allocable | Maximum number of blocks allocable. |
DEPENDENT | hadoop.namenode.block_capacity Preprocessing: - JSONPATH: |
Hadoop | NameNode: Total blocks | Count of blocks tracked by NameNode. |
DEPENDENT | hadoop.namenode.blocks_total Preprocessing: - JSONPATH: |
Hadoop | NameNode: Under-replicated blocks | The number of blocks with insufficient replication. |
DEPENDENT | hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: RPC queue & processing time | Average time spent on processing RPC requests. |
DEPENDENT | hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Container launch avg duration | DEPENDENT | hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: |
|
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | DEPENDENT | hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
|
Hadoop | {#HOSTNAME}: State | State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN. |
DEPENDENT | hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Version | DEPENDENT | hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
|
Hadoop | {#HOSTNAME}: Number of containers | DEPENDENT | hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: |
|
Hadoop | {#HOSTNAME}: Used memory | DEPENDENT | hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
|
Hadoop | {#HOSTNAME}: Available memory | DEPENDENT | hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: |
|
Hadoop | {#HOSTNAME}: Remaining | Remaining disk space. |
DEPENDENT | hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Used | Used disk space. |
DEPENDENT | hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Number of failed volumes | Number of failed storage volumes. |
DEPENDENT | hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Threads | The number of JVM threads. |
DEPENDENT | hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Garbage collection time | The JVM garbage collection time in milliseconds. |
DEPENDENT | hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: JVM Heap usage | The JVM heap usage in MBytes. |
DEPENDENT | hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: |
Hadoop | {#HOSTNAME}: Uptime | DEPENDENT | hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: - MULTIPLIER: |
|
Hadoop | {#HOSTNAME}: Version | DataNode software version. |
DEPENDENT | hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Admin state | Administrative state. |
DEPENDENT | hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Hadoop | {#HOSTNAME}: Oper state | Operational state. |
DEPENDENT | hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: - DISCARD_UNCHANGED_HEARTBEAT: |
Zabbix_raw_items | Get ResourceManager stats | - |
HTTP_AGENT | hadoop.resourcemanager.get |
Zabbix_raw_items | Get NameNode stats | - |
HTTP_AGENT | hadoop.namenode.get |
Zabbix_raw_items | Get NodeManagers states | - |
HTTP_AGENT | hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: |
Zabbix_raw_items | Get DataNodes states | - |
HTTP_AGENT | hadoop.datanodes.get Preprocessing: - JAVASCRIPT: |
Zabbix_raw_items | Hadoop NodeManager {#HOSTNAME}: Get stats | HTTP_AGENT | hadoop.nodemanager.get[{#HOSTNAME}] | |
Zabbix_raw_items | Hadoop DataNode {#HOSTNAME}: Get stats | HTTP_AGENT | hadoop.datanode.get[{#HOSTNAME}] |
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
ResourceManager: Service is unavailable | - |
{TEMPLATE_NAME:net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"].last()}=0 |
AVERAGE | Manual close: YES |
ResourceManager: Service response time is too high (over {$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} for 5m) | - |
{TEMPLATE_NAME:net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"].min(5m)}>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
{TEMPLATE_NAME:hadoop.resourcemanager.uptime.last()}<10m |
INFO | Manual close: YES |
ResourceManager: Failed to fetch ResourceManager API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
{TEMPLATE_NAME:hadoop.resourcemanager.uptime.nodata(30m)}=1 |
WARNING | Manual close: YES Depends on: - ResourceManager: Service is unavailable |
ResourceManager: Cluster has no active NodeManagers | Cluster is unable to execute any jobs without at least one NodeManager. |
{TEMPLATE_NAME:hadoop.resourcemanager.num_active_nm.max(5m)}=0 |
HIGH | |
ResourceManager: Cluster has unhealthy NodeManagers | YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved. |
{TEMPLATE_NAME:hadoop.resourcemanager.num_unhealthy_nm.min(15m)}>0 |
AVERAGE | |
NameNode: Service is unavailable | - |
{TEMPLATE_NAME:net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"].last()}=0 |
AVERAGE | Manual close: YES |
NameNode: Service response time is too high (over {$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} for 5m) | - |
{TEMPLATE_NAME:net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"].min(5m)}>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
{TEMPLATE_NAME:hadoop.namenode.uptime.last()}<10m |
INFO | Manual close: YES |
NameNode: Failed to fetch NameNode API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
{TEMPLATE_NAME:hadoop.namenode.uptime.nodata(30m)}=1 |
WARNING | Manual close: YES Depends on: - NameNode: Service is unavailable |
NameNode: Cluster capacity remaining is low (below {$HADOOP.CAPACITY_REMAINING.MIN.WARN}% for 15m) | A good practice is to ensure that disk use never exceeds 80 percent capacity. |
{TEMPLATE_NAME:hadoop.namenode.percent_remaining.max(15m)}<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} |
WARNING | |
NameNode: Cluster has missing blocks | A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica. |
{TEMPLATE_NAME:hadoop.namenode.missing_blocks.min(15m)}>0 |
AVERAGE | |
NameNode: Cluster has volume failures | HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice. |
{TEMPLATE_NAME:hadoop.namenode.volume_failures_total.min(15m)}>0 |
AVERAGE | |
NameNode: Cluster has DataNodes in Dead state | The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes. |
{TEMPLATE_NAME:hadoop.namenode.num_dead_data_nodes.min(5m)}>0 |
AVERAGE | |
{#HOSTNAME}: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
{TEMPLATE_NAME:hadoop.nodemanager.uptime[{#HOSTNAME}].last()}<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch NodeManager API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
{TEMPLATE_NAME:hadoop.nodemanager.uptime[{#HOSTNAME}].nodata(30m)}=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}. |
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}. | The state is different from normal. |
{TEMPLATE_NAME:hadoop.nodemanager.state[{#HOSTNAME}].last()}<>"RUNNING" |
AVERAGE | |
{#HOSTNAME}: Service has been restarted (uptime < 10m) | Uptime is less than 10 minutes |
{TEMPLATE_NAME:hadoop.datanode.uptime[{#HOSTNAME}].last()}<10m |
INFO | Manual close: YES |
{#HOSTNAME}: Failed to fetch DataNode API page (or no data for 30m) | Zabbix has not received data for items for the last 30 minutes. |
{TEMPLATE_NAME:hadoop.datanode.uptime[{#HOSTNAME}].nodata(30m)}=1 |
WARNING | Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}. |
{#HOSTNAME}: DataNode has state {ITEM.VALUE}. | The state is different from normal. |
{TEMPLATE_NAME:hadoop.datanode.oper_state[{#HOSTNAME}].last()}<>"Live" |
AVERAGE |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.