This template is for Zabbix version: 7.4

Also available for: 7.2 7.0 6.4 6.2 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.4

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 7.4 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
Hadoop: ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
Hadoop: ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
Hadoop: NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
Hadoop: NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
Hadoop: NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
Hadoop: NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
Hadoop: NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
Hadoop: NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery

Name Description Type Key and additional info

Data node discovery

HTTP agent

Name	Description	Type	Key and additional info
Data node discovery		HTTP agent	hadoop.datanode.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.datanode.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name	Description	Type	Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats		HTTP agent	hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining	Remaining disk space.	Dependent item	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Used	Used disk space.	Dependent item	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	Dependent item	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop DataNode {#HOSTNAME}: Get raw info		Dependent item	hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: Version	DataNode software version.	Dependent item	hadoop.datanode.version[{#HOSTNAME}] Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Admin state	Administrative state.	Dependent item	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing JSON Path: `$.adminState` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Oper state	Operational state.	Dependent item	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing JSON Path: `$.operState` Discard unchanged with heartbeat: `6h`

Trigger prototypes for Data node discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.2

Also available for: 7.4 7.0 6.4 6.2 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.2

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
Hadoop: ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
Hadoop: ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
Hadoop: NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
Hadoop: NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
Hadoop: NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
Hadoop: NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
Hadoop: NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
Hadoop: NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery

Name Description Type Key and additional info

Data node discovery

HTTP agent

Name	Description	Type	Key and additional info
Data node discovery		HTTP agent	hadoop.datanode.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.datanode.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name	Description	Type	Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats		HTTP agent	hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining	Remaining disk space.	Dependent item	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Used	Used disk space.	Dependent item	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	Dependent item	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop DataNode {#HOSTNAME}: Get raw info		Dependent item	hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: Version	DataNode software version.	Dependent item	hadoop.datanode.version[{#HOSTNAME}] Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Admin state	Administrative state.	Dependent item	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing JSON Path: `$.adminState` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Oper state	Operational state.	Dependent item	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing JSON Path: `$.operState` Discard unchanged with heartbeat: `6h`

Trigger prototypes for Data node discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.0

Also available for: 7.4 7.2 6.4 6.2 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.0

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
Hadoop: ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
Hadoop: ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
Hadoop: NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
Hadoop: NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
Hadoop: NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
Hadoop: NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
Hadoop: NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
Hadoop: NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery

Name Description Type Key and additional info

Data node discovery

HTTP agent

Name	Description	Type	Key and additional info
Data node discovery		HTTP agent	hadoop.datanode.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.datanode.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name	Description	Type	Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats		HTTP agent	hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining	Remaining disk space.	Dependent item	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Used	Used disk space.	Dependent item	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	Dependent item	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop DataNode {#HOSTNAME}: Get raw info		Dependent item	hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: Version	DataNode software version.	Dependent item	hadoop.datanode.version[{#HOSTNAME}] Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Admin state	Administrative state.	Dependent item	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing JSON Path: `$.adminState` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Oper state	Operational state.	Dependent item	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing JSON Path: `$.operState` Discard unchanged with heartbeat: `6h`

Trigger prototypes for Data node discovery

Name	Description	Expression	Severity	Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.4

Also available for: 7.4 7.2 7.0 6.2 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.4

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 6.4 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop: Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop: Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Hadoop: Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Hadoop: Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
{#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery

Name Description Type Key and additional info

Data node discovery

HTTP agent

Name	Description	Type	Key and additional info
Data node discovery		HTTP agent	hadoop.datanode.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.datanode.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name	Description	Type	Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats		HTTP agent	hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining	Remaining disk space.	Dependent item	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Used	Used disk space.	Dependent item	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	Dependent item	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop DataNode {#HOSTNAME}: Get raw info		Dependent item	hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: Version	DataNode software version.	Dependent item	hadoop.datanode.version[{#HOSTNAME}] Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Admin state	Administrative state.	Dependent item	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing JSON Path: `$.adminState` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Oper state	Operational state.	Dependent item	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing JSON Path: `$.operState` Discard unchanged with heartbeat: `6h`

Trigger prototypes for Data node discovery

Name	Description	Expression	Severity	Dependencies and additional info
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
{#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
{#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.2

Also available for: 7.4 7.2 7.0 6.4 6.0 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.2

Hadoop by HTTP

Overview

For Zabbix version: 6.2 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

This template was tested on:

Hadoop, version 3.1 and later

Setup

See Zabbix template operation for basic instructions.

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name	Description	Default
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info

Data node discovery

Name	Description	Type	Key and additional info
Data node discovery	-	HTTP_AGENT	hadoop.datanode.discovery Preprocessing: - JAVASCRIPT: `The text is too long. Please see the template.`
Node manager discovery	-	HTTP_AGENT	hadoop.nodemanager.discovery Preprocessing: - JAVASCRIPT: `The text is too long. Please see the template.`

-

HTTP_AGENT

hadoop.datanode.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Node manager discovery

-

HTTP_AGENT

hadoop.nodemanager.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Items collected

Group	Name	Description	Type	Key and additional info
Hadoop	ResourceManager: Service status	Hadoop ResourceManager API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	ResourceManager: Service response time	Hadoop ResourceManager API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop	ResourceManager: Uptime	-	DEPENDENT	hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=RpcActivityForPort8031')].RpcProcessingTimeAvgTime.first()`
Hadoop	ResourceManager: Active NMs	Number of Active NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumActiveNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissioningNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissionedNMs.first()`
Hadoop	ResourceManager: Lost NMs	Number of Lost NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumLostNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumUnhealthyNMs.first()`
Hadoop	ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumRebootedNMs.first()`
Hadoop	ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumShutdownNMs.first()`
Hadoop	NameNode: Service status	Hadoop NameNode API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	NameNode: Service response time	Hadoop NameNode API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop	NameNode: Uptime	-	DEPENDENT	hadoop.namenode.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=RpcActivityForPort9000')].RpcProcessingTimeAvgTime.first()`
Hadoop	NameNode: Block Pool Renaming	-	DEPENDENT	hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentBlockPoolUsed.first()`
Hadoop	NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	DEPENDENT	hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TransactionsSinceLastCheckpoint.first()`
Hadoop	NameNode: Percent capacity remaining	Available capacity in percent.	DEPENDENT	hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentRemaining.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Capacity remaining	Available capacity.	DEPENDENT	hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CapacityRemaining.first()`
Hadoop	NameNode: Corrupt blocks	Number of corrupt blocks.	DEPENDENT	hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CorruptBlocks.first()`
Hadoop	NameNode: Missing blocks	Number of missing blocks.	DEPENDENT	hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].MissingBlocks.first()`
Hadoop	NameNode: Failed volumes	Number of failed volumes.	DEPENDENT	hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].VolumeFailuresTotal.first()`
Hadoop	NameNode: Alive DataNodes	Count of alive DataNodes.	DEPENDENT	hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumLiveDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Dead DataNodes	Count of dead DataNodes.	DEPENDENT	hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumDeadDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	DEPENDENT	hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].StaleDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Total files	Total count of files tracked by the NameNode.	DEPENDENT	hadoop.namenode.files_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].FilesTotal.first()`
Hadoop	NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	DEPENDENT	hadoop.namenode.total_load Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TotalLoad.first()`
Hadoop	NameNode: Blocks allocable	Maximum number of blocks allocable.	DEPENDENT	hadoop.namenode.block_capacity Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlockCapacity.first()`
Hadoop	NameNode: Total blocks	Count of blocks tracked by NameNode.	DEPENDENT	hadoop.namenode.blocks_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlocksTotal.first()`
Hadoop	NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	DEPENDENT	hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].UnderReplicatedBlocks.first()`
Hadoop	{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=RpcActivityForPort8040')].RpcProcessingTimeAvgTime.first()`
Hadoop	{#HOSTNAME}: Container launch avg duration	-	DEPENDENT	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=NodeManagerMetrics')].ContainerLaunchDurationAvgTime.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime	-	DEPENDENT	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	DEPENDENT	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].State.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Version	-	DEPENDENT	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NodeManagerVersion.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Number of containers	-	DEPENDENT	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NumContainers.first()`
Hadoop	{#HOSTNAME}: Used memory	-	DEPENDENT	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].UsedMemoryMB.first()`
Hadoop	{#HOSTNAME}: Available memory	-	DEPENDENT	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].AvailableMemoryMB.first()`
Hadoop	{#HOSTNAME}: Remaining	Remaining disk space.	DEPENDENT	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].Remaining.first()`
Hadoop	{#HOSTNAME}: Used	Used disk space.	DEPENDENT	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].DfsUsed.first()`
Hadoop	{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	DEPENDENT	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].NumFailedVolumes.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime	-	DEPENDENT	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: Version	DataNode software version.	DEPENDENT	hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].version.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Admin state	Administrative state.	DEPENDENT	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].adminState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Oper state	Operational state.	DEPENDENT	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].operState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Zabbix raw items	Get ResourceManager stats	-	HTTP_AGENT	hadoop.resourcemanager.get
Zabbix raw items	Get NameNode stats	-	HTTP_AGENT	hadoop.namenode.get
Zabbix raw items	Get NodeManagers states	-	HTTP_AGENT	hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: `return JSON.stringify(JSON.parse(JSON.parse(value).beans[0].LiveNodeManagers))`
Zabbix raw items	Get DataNodes states	-	HTTP_AGENT	hadoop.datanodes.get Preprocessing: - JAVASCRIPT: `The text is too long. Please see the template.`
Zabbix raw items	Hadoop NodeManager {#HOSTNAME}: Get stats	-	HTTP_AGENT	hadoop.nodemanager.get[{#HOSTNAME}]
Zabbix raw items	Hadoop DataNode {#HOSTNAME}: Get stats	-	HTTP_AGENT	hadoop.datanode.get[{#HOSTNAME}]

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable	-	`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	AVERAGE	Manual close: YES
ResourceManager: Service response time is too high	-	`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	INFO	Manual close: YES
ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	HIGH
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	AVERAGE
NameNode: Service is unavailable	-	`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	AVERAGE	Manual close: YES
NameNode: Service response time is too high	-	`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	INFO	Manual close: YES
NameNode: Failed to fetch NameNode API page	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	WARNING
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	AVERAGE
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	AVERAGE
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	AVERAGE
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	AVERAGE
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
{#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	AVERAGE

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template or ask for help with it at ZABBIX forums.

References

https://hadoop.apache.org/docs/current/

This template is for Zabbix version: 6.0

Also available for: 7.4 7.2 7.0 6.4 6.2 5.4 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/6.0

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 6.0 and higher.

Tested versions

This template has been tested on:

Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name	Description	Default
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`

Items

Name	Description	Type	Key and additional info
ResourceManager: Service status	Hadoop ResourceManager API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
ResourceManager: Service response time	Hadoop ResourceManager API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop: Get ResourceManager stats		HTTP agent	hadoop.resourcemanager.get
ResourceManager: Uptime		Dependent item	hadoop.resourcemanager.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
ResourceManager: Get info		Dependent item	hadoop.resourcemanager.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]` ⛔️Custom on fail: Set value to: `[]`
ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Active NMs	Number of Active NodeManagers.	Dependent item	hadoop.resourcemanager.num_active_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioning_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	Dependent item	hadoop.resourcemanager.num_decommissioned_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Lost NMs	Number of Lost NodeManagers.	Dependent item	hadoop.resourcemanager.num_lost_nm Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	Dependent item	hadoop.resourcemanager.num_unhealthy_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	Dependent item	hadoop.resourcemanager.num_rebooted_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	Dependent item	hadoop.resourcemanager.num_shutdown_nm Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Service status	Hadoop NameNode API port availability.	Simple check	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing Discard unchanged with heartbeat: `10m`
NameNode: Service response time	Hadoop NameNode API performance.	Simple check	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop: Get NameNode stats		HTTP agent	hadoop.namenode.get
NameNode: Uptime		Dependent item	hadoop.namenode.uptime Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
NameNode: Get info		Dependent item	hadoop.namenode.info Preprocessing JSON Path: `$.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]` ⛔️Custom on fail: Set value to: `[]`
NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.namenode.rpc_processing_time_avg Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Block Pool Renaming		Dependent item	hadoop.namenode.percent_block_pool_used Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	Dependent item	hadoop.namenode.transactions_since_last_checkpoint Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Percent capacity remaining	Available capacity in percent.	Dependent item	hadoop.namenode.percent_remaining Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Capacity remaining	Available capacity.	Dependent item	hadoop.namenode.capacity_remaining Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Corrupt blocks	Number of corrupt blocks.	Dependent item	hadoop.namenode.corrupt_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Missing blocks	Number of missing blocks.	Dependent item	hadoop.namenode.missing_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Failed volumes	Number of failed volumes.	Dependent item	hadoop.namenode.volume_failures_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Alive DataNodes	Count of alive DataNodes.	Dependent item	hadoop.namenode.num_live_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Dead DataNodes	Count of dead DataNodes.	Dependent item	hadoop.namenode.num_dead_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	Dependent item	hadoop.namenode.num_stale_data_nodes Preprocessing JSON Path: `The text is too long. Please see the template.` Discard unchanged with heartbeat: `6h`
NameNode: Total files	Total count of files tracked by the NameNode.	Dependent item	hadoop.namenode.files_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	Dependent item	hadoop.namenode.total_load Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Blocks allocable	Maximum number of blocks allocable.	Dependent item	hadoop.namenode.block_capacity Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Total blocks	Count of blocks tracked by NameNode.	Dependent item	hadoop.namenode.blocks_total Preprocessing JSON Path: `The text is too long. Please see the template.`
NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	Dependent item	hadoop.namenode.under_replicated_blocks Preprocessing JSON Path: `The text is too long. Please see the template.`
Hadoop: Get NodeManagers states		HTTP agent	hadoop.nodemanagers.get Preprocessing JavaScript: `The text is too long. Please see the template.`
Hadoop: Get DataNodes states		HTTP agent	hadoop.datanodes.get Preprocessing JavaScript: `The text is too long. Please see the template.`

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	Average	Manual close: Yes
ResourceManager: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	Info	Manual close: Yes
ResourceManager: Failed to fetch ResourceManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	Warning	Manual close: Yes Depends on: ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	High
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	Average
NameNode: Service is unavailable		`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	Average	Manual close: Yes
NameNode: Service response time is too high		`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	Info	Manual close: Yes
NameNode: Failed to fetch NameNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	Warning	Manual close: Yes Depends on: NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	Warning
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	Average
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	Average
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	Average

LLD rule Node manager discovery

Name Description Type Key and additional info

Node manager discovery

HTTP agent

Name	Description	Type	Key and additional info
Node manager discovery		HTTP agent	hadoop.nodemanager.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.nodemanager.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name	Description	Type	Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP agent	hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	Dependent item	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Container launch avg duration		Dependent item	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop NodeManager {#HOSTNAME}: Get raw info		Dependent item	hadoop.nodemanager.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	Dependent item	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing JSON Path: `$.State` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Version		Dependent item	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing JSON Path: `$.NodeManagerVersion` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Number of containers		Dependent item	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing JSON Path: `$.NumContainers`
{#HOSTNAME}: Used memory		Dependent item	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing JSON Path: `$.UsedMemoryMB`
{#HOSTNAME}: Available memory		Dependent item	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing JSON Path: `$.AvailableMemoryMB`

Trigger prototypes for Node manager discovery

Name	Description	Expression	Severity	Dependencies and additional info
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
{#HOSTNAME}: Failed to fetch NodeManager API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	Average

LLD rule Data node discovery

Name Description Type Key and additional info

Data node discovery

HTTP agent

Name	Description	Type	Key and additional info
Data node discovery		HTTP agent	hadoop.datanode.discovery Preprocessing JavaScript: `The text is too long. Please see the template.`

hadoop.datanode.discovery

Preprocessing

JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name	Description	Type	Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats		HTTP agent	hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining	Remaining disk space.	Dependent item	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Used	Used disk space.	Dependent item	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	Dependent item	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Threads	The number of JVM threads.	Dependent item	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	Dependent item	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	Dependent item	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing JSON Path: `The text is too long. Please see the template.`
{#HOSTNAME}: Uptime		Dependent item	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing JSON Path: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` Custom multiplier: `0.001`
Hadoop DataNode {#HOSTNAME}: Get raw info		Dependent item	hadoop.datanode.raw_info[{#HOSTNAME}] Preprocessing JSON Path: `$.[?(@.HostName=='{#HOSTNAME}')].first()` ⛔️Custom on fail: Discard value
{#HOSTNAME}: Version	DataNode software version.	Dependent item	hadoop.datanode.version[{#HOSTNAME}] Preprocessing JSON Path: `$.version` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Admin state	Administrative state.	Dependent item	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing JSON Path: `$.adminState` Discard unchanged with heartbeat: `6h`
{#HOSTNAME}: Oper state	Operational state.	Dependent item	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing JSON Path: `$.operState` Discard unchanged with heartbeat: `6h`

Trigger prototypes for Data node discovery

Name	Description	Expression	Severity	Dependencies and additional info
{#HOSTNAME}: Service has been restarted	Uptime is less than 10 minutes.	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	Info	Manual close: Yes
{#HOSTNAME}: Failed to fetch DataNode API page	Zabbix has not received any data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	Warning	Manual close: Yes Depends on: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
{#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 5.4

Also available for: 7.4 7.2 7.0 6.4 6.2 6.0 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/5.4

Hadoop by HTTP

Overview

For Zabbix version: 5.4 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

This template was tested on:

Hadoop, version 3.1 and later

Setup

See Zabbix template operation for basic instructions.

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name	Description	Default
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info

Node manager discovery

-

HTTP_AGENT

hadoop.nodemanager.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Data node discovery

-

HTTP_AGENT

hadoop.datanode.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Items collected

Group	Name	Description	Type	Key and additional info
Hadoop	ResourceManager: Service status	Hadoop ResourceManager API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	ResourceManager: Service response time	Hadoop ResourceManager API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop	ResourceManager: Uptime	-	DEPENDENT	hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=RpcActivityForPort8031')].RpcProcessingTimeAvgTime.first()`
Hadoop	ResourceManager: Active NMs	Number of Active NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumActiveNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissioningNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissionedNMs.first()`
Hadoop	ResourceManager: Lost NMs	Number of Lost NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumLostNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumUnhealthyNMs.first()`
Hadoop	ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumRebootedNMs.first()`
Hadoop	ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumShutdownNMs.first()`
Hadoop	NameNode: Service status	Hadoop NameNode API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	NameNode: Service response time	Hadoop NameNode API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop	NameNode: Uptime	-	DEPENDENT	hadoop.namenode.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=RpcActivityForPort9000')].RpcProcessingTimeAvgTime.first()`
Hadoop	NameNode: Block Pool Renaming	-	DEPENDENT	hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentBlockPoolUsed.first()`
Hadoop	NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	DEPENDENT	hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TransactionsSinceLastCheckpoint.first()`
Hadoop	NameNode: Percent capacity remaining	Available capacity in percent.	DEPENDENT	hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentRemaining.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Capacity remaining	Available capacity.	DEPENDENT	hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CapacityRemaining.first()`
Hadoop	NameNode: Corrupt blocks	Number of corrupt blocks.	DEPENDENT	hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CorruptBlocks.first()`
Hadoop	NameNode: Missing blocks	Number of missing blocks.	DEPENDENT	hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].MissingBlocks.first()`
Hadoop	NameNode: Failed volumes	Number of failed volumes.	DEPENDENT	hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].VolumeFailuresTotal.first()`
Hadoop	NameNode: Alive DataNodes	Count of alive DataNodes.	DEPENDENT	hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumLiveDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Dead DataNodes	Count of dead DataNodes.	DEPENDENT	hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumDeadDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	DEPENDENT	hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].StaleDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Total files	Total count of files tracked by the NameNode.	DEPENDENT	hadoop.namenode.files_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].FilesTotal.first()`
Hadoop	NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	DEPENDENT	hadoop.namenode.total_load Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TotalLoad.first()`
Hadoop	NameNode: Blocks allocable	Maximum number of blocks allocable.	DEPENDENT	hadoop.namenode.block_capacity Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlockCapacity.first()`
Hadoop	NameNode: Total blocks	Count of blocks tracked by NameNode.	DEPENDENT	hadoop.namenode.blocks_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlocksTotal.first()`
Hadoop	NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	DEPENDENT	hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].UnderReplicatedBlocks.first()`
Hadoop	{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=RpcActivityForPort8040')].RpcProcessingTimeAvgTime.first()`
Hadoop	{#HOSTNAME}: Container launch avg duration	-	DEPENDENT	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=NodeManagerMetrics')].ContainerLaunchDurationAvgTime.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime	-	DEPENDENT	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	DEPENDENT	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].State.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Version	-	DEPENDENT	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NodeManagerVersion.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Number of containers	-	DEPENDENT	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NumContainers.first()`
Hadoop	{#HOSTNAME}: Used memory	-	DEPENDENT	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].UsedMemoryMB.first()`
Hadoop	{#HOSTNAME}: Available memory	-	DEPENDENT	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].AvailableMemoryMB.first()`
Hadoop	{#HOSTNAME}: Remaining	Remaining disk space.	DEPENDENT	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].Remaining.first()`
Hadoop	{#HOSTNAME}: Used	Used disk space.	DEPENDENT	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].DfsUsed.first()`
Hadoop	{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	DEPENDENT	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].NumFailedVolumes.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime	-	DEPENDENT	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: Version	DataNode software version.	DEPENDENT	hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].version.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Admin state	Administrative state.	DEPENDENT	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].adminState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Oper state	Operational state.	DEPENDENT	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].operState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Zabbix_raw_items	Get ResourceManager stats	-	HTTP_AGENT	hadoop.resourcemanager.get
Zabbix_raw_items	Get NameNode stats	-	HTTP_AGENT	hadoop.namenode.get
Zabbix_raw_items	Get NodeManagers states	-	HTTP_AGENT	hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: `return JSON.stringify(JSON.parse(JSON.parse(value).beans[0].LiveNodeManagers))`
Zabbix_raw_items	Get DataNodes states	-	HTTP_AGENT	hadoop.datanodes.get Preprocessing: - JAVASCRIPT: `The text is too long. Please see the template.`
Zabbix_raw_items	Hadoop NodeManager {#HOSTNAME}: Get stats	-	HTTP_AGENT	hadoop.nodemanager.get[{#HOSTNAME}]
Zabbix_raw_items	Hadoop DataNode {#HOSTNAME}: Get stats	-	HTTP_AGENT	hadoop.datanode.get[{#HOSTNAME}]

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable	-	`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0`	AVERAGE	Manual close: YES
ResourceManager: Service response time is too high (over {$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} for 5m)	-	`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m`	INFO	Manual close: YES
ResourceManager: Failed to fetch ResourceManager API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0`	HIGH
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0`	AVERAGE
NameNode: Service is unavailable	-	`last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0`	AVERAGE	Manual close: YES
NameNode: Service response time is too high (over {$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} for 5m)	-	`min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m`	INFO	Manual close: YES
NameNode: Failed to fetch NameNode API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low (below {$HADOOP.CAPACITY_REMAINING.MIN.WARN}% for 15m)	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	WARNING
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0`	AVERAGE
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0`	AVERAGE
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0`	AVERAGE
{#HOSTNAME}: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch NodeManager API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING"`	AVERAGE
{#HOSTNAME}: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch DataNode API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
{#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live"`	AVERAGE

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.

References

https://hadoop.apache.org/docs/current/

This template is for Zabbix version: 5.0

Also available for: 7.4 7.2 7.0 6.4 6.2 6.0 5.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/5.0

Template App Hadoop by HTTP

Overview

For Zabbix version: 5.0 and higher
The template for monitoring Hadoop over HTTP that works without any external scripts.
It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing.
Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs.
All metrics are collected at once, thanks to the Zabbix bulk data collection.

This template was tested on:

Zabbix, version 5.0 and later
Hadoop, version 3.1 and later

Setup

See Zabbix template operation for basic instructions.

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name	Description	Default
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}	The Hadoop cluster capacity remaining percent for trigger expression.	`20`
{$HADOOP.NAMENODE.HOST}	The Hadoop NameNode host IP address or FQDN.	`NameNode`
{$HADOOP.NAMENODE.PORT}	The Hadoop NameNode Web-UI port.	`9870`
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}	The Hadoop NameNode API page maximum response time in seconds for trigger expression.	`10s`
{$HADOOP.RESOURCEMANAGER.HOST}	The Hadoop ResourceManager host IP address or FQDN.	`ResourceManager`
{$HADOOP.RESOURCEMANAGER.PORT}	The Hadoop ResourceManager Web-UI port.	`8088`
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}	The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.	`10s`

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info

Node manager discovery

-

HTTP_AGENT

hadoop.nodemanager.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Data node discovery

-

HTTP_AGENT

hadoop.datanode.discovery

Preprocessing:

- JAVASCRIPT: The text is too long. Please see the template.

Items collected

Group	Name	Description	Type	Key and additional info
Hadoop	ResourceManager: Service status	Hadoop ResourceManager API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	ResourceManager: Service response time	Hadoop ResourceManager API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Hadoop	ResourceManager: Uptime		DEPENDENT	hadoop.resourcemanager.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	ResourceManager: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.resourcemanager.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=RpcActivityForPort8031')].RpcProcessingTimeAvgTime.first()`
Hadoop	ResourceManager: Active NMs	Number of Active NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_active_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumActiveNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioning NMs	Number of Decommissioning NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioning_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissioningNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Decommissioned NMs	Number of Decommissioned NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_decommissioned_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumDecommissionedNMs.first()`
Hadoop	ResourceManager: Lost NMs	Number of Lost NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_lost_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumLostNMs.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	ResourceManager: Unhealthy NMs	Number of Unhealthy NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_unhealthy_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumUnhealthyNMs.first()`
Hadoop	ResourceManager: Rebooted NMs	Number of Rebooted NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_rebooted_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumRebootedNMs.first()`
Hadoop	ResourceManager: Shutdown NMs	Number of Shutdown NodeManagers.	DEPENDENT	hadoop.resourcemanager.num_shutdown_nm Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=ResourceManager,name=ClusterMetrics')].NumShutdownNMs.first()`
Hadoop	NameNode: Service status	Hadoop NameNode API port availability.	SIMPLE	net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"] Preprocessing: - DISCARD_UNCHANGED_HEARTBEAT: `10m`
Hadoop	NameNode: Service response time	Hadoop NameNode API performance.	SIMPLE	net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Hadoop	NameNode: Uptime		DEPENDENT	hadoop.namenode.uptime Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	NameNode: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.namenode.rpc_processing_time_avg Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=RpcActivityForPort9000')].RpcProcessingTimeAvgTime.first()`
Hadoop	NameNode: Block Pool Renaming		DEPENDENT	hadoop.namenode.percent_block_pool_used Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentBlockPoolUsed.first()`
Hadoop	NameNode: Transactions since last checkpoint	Total number of transactions since last checkpoint.	DEPENDENT	hadoop.namenode.transactions_since_last_checkpoint Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TransactionsSinceLastCheckpoint.first()`
Hadoop	NameNode: Percent capacity remaining	Available capacity in percent.	DEPENDENT	hadoop.namenode.percent_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=NameNodeInfo')].PercentRemaining.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Capacity remaining	Available capacity.	DEPENDENT	hadoop.namenode.capacity_remaining Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CapacityRemaining.first()`
Hadoop	NameNode: Corrupt blocks	Number of corrupt blocks.	DEPENDENT	hadoop.namenode.corrupt_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].CorruptBlocks.first()`
Hadoop	NameNode: Missing blocks	Number of missing blocks.	DEPENDENT	hadoop.namenode.missing_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].MissingBlocks.first()`
Hadoop	NameNode: Failed volumes	Number of failed volumes.	DEPENDENT	hadoop.namenode.volume_failures_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].VolumeFailuresTotal.first()`
Hadoop	NameNode: Alive DataNodes	Count of alive DataNodes.	DEPENDENT	hadoop.namenode.num_live_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumLiveDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Dead DataNodes	Count of dead DataNodes.	DEPENDENT	hadoop.namenode.num_dead_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].NumDeadDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Stale DataNodes	DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".	DEPENDENT	hadoop.namenode.num_stale_data_nodes Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].StaleDataNodes.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	NameNode: Total files	Total count of files tracked by the NameNode.	DEPENDENT	hadoop.namenode.files_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].FilesTotal.first()`
Hadoop	NameNode: Total load	The current number of concurrent file accesses (read/write) across all DataNodes.	DEPENDENT	hadoop.namenode.total_load Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].TotalLoad.first()`
Hadoop	NameNode: Blocks allocable	Maximum number of blocks allocable.	DEPENDENT	hadoop.namenode.block_capacity Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlockCapacity.first()`
Hadoop	NameNode: Total blocks	Count of blocks tracked by NameNode.	DEPENDENT	hadoop.namenode.blocks_total Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].BlocksTotal.first()`
Hadoop	NameNode: Under-replicated blocks	The number of blocks with insufficient replication.	DEPENDENT	hadoop.namenode.under_replicated_blocks Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NameNode,name=FSNamesystem')].UnderReplicatedBlocks.first()`
Hadoop	{#HOSTNAME}: RPC queue & processing time	Average time spent on processing RPC requests.	DEPENDENT	hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=RpcActivityForPort8040')].RpcProcessingTimeAvgTime.first()`
Hadoop	{#HOSTNAME}: Container launch avg duration		DEPENDENT	hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=NodeManagerMetrics')].ContainerLaunchDurationAvgTime.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.nodemanager.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=NodeManager,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime		DEPENDENT	hadoop.nodemanager.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: State	State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.	DEPENDENT	hadoop.nodemanager.state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].State.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Version		DEPENDENT	hadoop.nodemanager.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NodeManagerVersion.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Number of containers		DEPENDENT	hadoop.nodemanager.numcontainers[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].NumContainers.first()`
Hadoop	{#HOSTNAME}: Used memory		DEPENDENT	hadoop.nodemanager.usedmemory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].UsedMemoryMB.first()`
Hadoop	{#HOSTNAME}: Available memory		DEPENDENT	hadoop.nodemanager.availablememory[{#HOSTNAME}] Preprocessing: - JSONPATH: `$[?(@.HostName=='{#HOSTNAME}')].AvailableMemoryMB.first()`
Hadoop	{#HOSTNAME}: Remaining	Remaining disk space.	DEPENDENT	hadoop.datanode.remaining[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].Remaining.first()`
Hadoop	{#HOSTNAME}: Used	Used disk space.	DEPENDENT	hadoop.datanode.dfs_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].DfsUsed.first()`
Hadoop	{#HOSTNAME}: Number of failed volumes	Number of failed storage volumes.	DEPENDENT	hadoop.datanode.numfailedvolumes[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=FSDatasetState')].NumFailedVolumes.first()`
Hadoop	{#HOSTNAME}: JVM Threads	The number of JVM threads.	DEPENDENT	hadoop.datanode.jvm.threads[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Threading')].ThreadCount.first()`
Hadoop	{#HOSTNAME}: JVM Garbage collection time	The JVM garbage collection time in milliseconds.	DEPENDENT	hadoop.datanode.jvm.gc_time[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].GcTimeMillis.first()`
Hadoop	{#HOSTNAME}: JVM Heap usage	The JVM heap usage in MBytes.	DEPENDENT	hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='Hadoop:service=DataNode,name=JvmMetrics')].MemHeapUsedM.first()`
Hadoop	{#HOSTNAME}: Uptime		DEPENDENT	hadoop.datanode.uptime[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()` - MULTIPLIER: `0.001`
Hadoop	{#HOSTNAME}: Version	DataNode software version.	DEPENDENT	hadoop.datanode.version[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].version.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Admin state	Administrative state.	DEPENDENT	hadoop.datanode.admin_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].adminState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Hadoop	{#HOSTNAME}: Oper state	Operational state.	DEPENDENT	hadoop.datanode.oper_state[{#HOSTNAME}] Preprocessing: - JSONPATH: `$.[?(@.HostName=='{#HOSTNAME}')].operState.first()` - DISCARD_UNCHANGED_HEARTBEAT: `6h`
Zabbix_raw_items	Get ResourceManager stats	-	HTTP_AGENT	hadoop.resourcemanager.get
Zabbix_raw_items	Get NameNode stats	-	HTTP_AGENT	hadoop.namenode.get
Zabbix_raw_items	Get NodeManagers states	-	HTTP_AGENT	hadoop.nodemanagers.get Preprocessing: - JAVASCRIPT: `return JSON.stringify(JSON.parse(JSON.parse(value).beans[0].LiveNodeManagers))`
Zabbix_raw_items	Get DataNodes states	-	HTTP_AGENT	hadoop.datanodes.get Preprocessing: - JAVASCRIPT: `The text is too long. Please see the template.`
Zabbix_raw_items	Hadoop NodeManager {#HOSTNAME}: Get stats		HTTP_AGENT	hadoop.nodemanager.get[{#HOSTNAME}]
Zabbix_raw_items	Hadoop DataNode {#HOSTNAME}: Get stats		HTTP_AGENT	hadoop.datanode.get[{#HOSTNAME}]

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
ResourceManager: Service is unavailable	-	`{TEMPLATE_NAME:net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"].last()}=0`	AVERAGE	Manual close: YES
ResourceManager: Service response time is too high (over {$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} for 5m)	-	`{TEMPLATE_NAME:net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"].min(5m)}>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`{TEMPLATE_NAME:hadoop.resourcemanager.uptime.last()}<10m`	INFO	Manual close: YES
ResourceManager: Failed to fetch ResourceManager API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`{TEMPLATE_NAME:hadoop.resourcemanager.uptime.nodata(30m)}=1`	WARNING	Manual close: YES Depends on: - ResourceManager: Service is unavailable
ResourceManager: Cluster has no active NodeManagers	Cluster is unable to execute any jobs without at least one NodeManager.	`{TEMPLATE_NAME:hadoop.resourcemanager.num_active_nm.max(5m)}=0`	HIGH
ResourceManager: Cluster has unhealthy NodeManagers	YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.	`{TEMPLATE_NAME:hadoop.resourcemanager.num_unhealthy_nm.min(15m)}>0`	AVERAGE
NameNode: Service is unavailable	-	`{TEMPLATE_NAME:net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"].last()}=0`	AVERAGE	Manual close: YES
NameNode: Service response time is too high (over {$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} for 5m)	-	`{TEMPLATE_NAME:net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"].min(5m)}>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`{TEMPLATE_NAME:hadoop.namenode.uptime.last()}<10m`	INFO	Manual close: YES
NameNode: Failed to fetch NameNode API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`{TEMPLATE_NAME:hadoop.namenode.uptime.nodata(30m)}=1`	WARNING	Manual close: YES Depends on: - NameNode: Service is unavailable
NameNode: Cluster capacity remaining is low (below {$HADOOP.CAPACITY_REMAINING.MIN.WARN}% for 15m)	A good practice is to ensure that disk use never exceeds 80 percent capacity.	`{TEMPLATE_NAME:hadoop.namenode.percent_remaining.max(15m)}<{$HADOOP.CAPACITY_REMAINING.MIN.WARN}`	WARNING
NameNode: Cluster has missing blocks	A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.	`{TEMPLATE_NAME:hadoop.namenode.missing_blocks.min(15m)}>0`	AVERAGE
NameNode: Cluster has volume failures	HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.	`{TEMPLATE_NAME:hadoop.namenode.volume_failures_total.min(15m)}>0`	AVERAGE
NameNode: Cluster has DataNodes in Dead state	The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.	`{TEMPLATE_NAME:hadoop.namenode.num_dead_data_nodes.min(5m)}>0`	AVERAGE
{#HOSTNAME}: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`{TEMPLATE_NAME:hadoop.nodemanager.uptime[{#HOSTNAME}].last()}<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch NodeManager API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`{TEMPLATE_NAME:hadoop.nodemanager.uptime[{#HOSTNAME}].nodata(30m)}=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
{#HOSTNAME}: NodeManager has state {ITEM.VALUE}.	The state is different from normal.	`{TEMPLATE_NAME:hadoop.nodemanager.state[{#HOSTNAME}].last()}<>"RUNNING"`	AVERAGE
{#HOSTNAME}: Service has been restarted (uptime < 10m)	Uptime is less than 10 minutes	`{TEMPLATE_NAME:hadoop.datanode.uptime[{#HOSTNAME}].last()}<10m`	INFO	Manual close: YES
{#HOSTNAME}: Failed to fetch DataNode API page (or no data for 30m)	Zabbix has not received data for items for the last 30 minutes.	`{TEMPLATE_NAME:hadoop.datanode.uptime[{#HOSTNAME}].nodata(30m)}=1`	WARNING	Manual close: YES Depends on: - {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
{#HOSTNAME}: DataNode has state {ITEM.VALUE}.	The state is different from normal.	`{TEMPLATE_NAME:hadoop.datanode.oper_state[{#HOSTNAME}].last()}<>"Live"`	AVERAGE

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.

References

https://hadoop.apache.org/docs/current/

Link	Source	Compatibility	Type, Technology	Created Updated	Rating
Hadoop monitoring github.com/Staroon/zabbix-hadoop-template [cn]	GitHub 54		Template, External script Python	2017-08-01 1 y	Popular
Hadoop Zabbix templates Different metrics github.com/npoggi/zabbix-hadoop	GitHub 2		Template, External script JMX	2013-11-20 6 y

Zabbix 7.4 - Less work. More depth.

Try Zabbix Cloud with a free trial

Become a Zabbix Partner

Zabbix Summit 2025 in Riga, October

Join our global team!

Zabbix + Hadoop

Hadoop

Available solutions

Hadoop by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Node manager discovery

Item prototypes for Node manager discovery

Trigger prototypes for Node manager discovery

LLD rule Data node discovery

Item prototypes for Data node discovery

Trigger prototypes for Data node discovery

Feedback

Hadoop by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Node manager discovery

Item prototypes for Node manager discovery

Trigger prototypes for Node manager discovery

LLD rule Data node discovery

Item prototypes for Data node discovery

Trigger prototypes for Data node discovery

Feedback

Hadoop by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Node manager discovery

Item prototypes for Node manager discovery

Trigger prototypes for Node manager discovery

LLD rule Data node discovery

Item prototypes for Data node discovery

Trigger prototypes for Data node discovery

Feedback

Hadoop by HTTP

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule Node manager discovery

Item prototypes for Node manager discovery

Trigger prototypes for Node manager discovery

LLD rule Data node discovery

Item prototypes for Data node discovery

Trigger prototypes for Data node discovery

Feedback

Hadoop by HTTP

Overview

Setup

Zabbix configuration

Macros used

Template links

Discovery rules

Items collected