Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/kafka_jmx?at=release/7.0
Apache Kafka by JMX
Overview
This template is designed for the effortless deployment of Apache Kafka monitoring by Zabbix via JMX and doesn't require any external scripts.
Requirements
Zabbix version: 7.0 and higher.
Tested versions
This template has been tested on:
- Apache Kafka 2.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
Metrics are collected by JMX.
- Enable and configure JMX access to Apache Kafka. See documentation for instructions.
- Set the user name and password in host macros {$KAFKA.USER} and {$KAFKA.PASSWORD}.
Macros used
Name | Description | Default |
---|---|---|
{$KAFKA.USER} | zabbix |
|
{$KAFKA.PASSWORD} | zabbix |
|
{$KAFKA.TOPIC.MATCHES} | Filter of discoverable topics |
.* |
{$KAFKA.TOPIC.NOT_MATCHES} | Filter to exclude discovered topics |
__consumer_offsets |
{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN} | The minimum Network processor average idle percent for trigger expression. |
30 |
{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN} | The minimum Request handler average idle percent for trigger expression. |
30 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Leader election per second | Number of leader elections per second. |
JMX agent | jmx["kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs","Count"] |
Unclean leader election per second | Number of “unclean” elections per second. |
JMX agent | jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"] Preprocessing
|
Controller state on broker | One indicates that the broker is the controller for the cluster. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ActiveControllerCount","Value"] Preprocessing
|
Ineligible pending replica deletes | The number of ineligible pending replica deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount","Value"] |
Pending replica deletes | The number of pending replica deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=ReplicasToDeleteCount","Value"] |
Ineligible pending topic deletes | The number of ineligible pending topic deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount","Value"] |
Pending topic deletes | The number of pending topic deletes. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=TopicsToDeleteCount","Value"] |
Offline log directory count | The number of offline log directories (for example, after a hardware failure). |
JMX agent | jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"] |
Offline partitions count | Number of partitions that don't have an active leader. |
JMX agent | jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"] |
Bytes out per second | The rate at which data is fetched and read from the broker by consumers. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec","Count"] Preprocessing
|
Bytes in per second | The rate at which data sent from producers is consumed by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec","Count"] Preprocessing
|
Messages in per second | The rate at which individual messages are consumed by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec","Count"] Preprocessing
|
Bytes rejected per second | The rate at which bytes rejected per second by the broker. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec","Count"] Preprocessing
|
Client fetch request failed per second | Number of client fetch request failures per second. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec","Count"] Preprocessing
|
Produce requests failed per second | Number of failed produce requests per second. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec","Count"] Preprocessing
|
Request handler average idle percent | Indicates the percentage of time that the request handler (IO) threads are not in use. |
JMX agent | jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"] Preprocessing
|
Fetch-Consumer response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","Mean"] |
Fetch-Consumer response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","95thPercentile"] |
Fetch-Consumer response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","99thPercentile"] |
Fetch-Follower response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","Mean"] |
Fetch-Follower response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","95thPercentile"] |
Fetch-Follower response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","99thPercentile"] |
Produce response send time, mean | Average time taken, in milliseconds, to send the response. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","Mean"] |
Produce response send time, p95 | The time taken, in milliseconds, to send the response for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","95thPercentile"] |
Produce response send time, p99 | The time taken, in milliseconds, to send the response for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","99thPercentile"] |
Fetch-Consumer request total time, mean | Average time in ms to serve the Fetch-Consumer request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","Mean"] |
Fetch-Consumer request total time, p95 | Time in ms to serve the Fetch-Consumer request for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","95thPercentile"] |
Fetch-Consumer request total time, p99 | Time in ms to serve the specified Fetch-Consumer for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","99thPercentile"] |
Fetch-Follower request total time, mean | Average time in ms to serve the Fetch-Follower request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","Mean"] |
Fetch-Follower request total time, p95 | Time in ms to serve the Fetch-Follower request for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","95thPercentile"] |
Fetch-Follower request total time, p99 | Time in ms to serve the Fetch-Follower request for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","99thPercentile"] |
Produce request total time, mean | Average time in ms to serve the Produce request. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","Mean"] |
Produce request total time, p95 | Time in ms to serve the Produce requests for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","95thPercentile"] |
Produce request total time, p99 | Time in ms to serve the Produce requests for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","99thPercentile"] |
Fetch-Consumer request total time, mean | Average time for a request to update metadata. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","Mean"] |
UpdateMetadata request total time, p95 | Time for update metadata requests for 95th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","95thPercentile"] |
UpdateMetadata request total time, p99 | Time for update metadata requests for 99th percentile. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","99thPercentile"] |
Temporary memory size in bytes (Fetch), max | The maximum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Max"] |
Temporary memory size in bytes (Fetch), min | The minimum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Mean"] |
Temporary memory size in bytes (Produce), max | The maximum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Max"] |
Temporary memory size in bytes (Produce), avg | The amount of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Mean"] |
Temporary memory size in bytes (Produce), min | The minimum of temporary memory used for converting message formats and decompressing messages. |
JMX agent | jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Min"] |
Network processor average idle percent | The average percentage of time that the network processors are idle. |
JMX agent | jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"] Preprocessing
|
Requests in producer purgatory | Number of requests waiting in producer purgatory. |
JMX agent | jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch","Value"] |
Requests in fetch purgatory | Number of requests waiting in fetch purgatory. |
JMX agent | jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce","Value"] |
Replication maximum lag | The maximum lag between the time that messages are received by the leader replica and by the follower replicas. |
JMX agent | jmx["kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica","Value"] |
Under minimum ISR partition count | The number of partitions under the minimum In-Sync Replica (ISR) count. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"] |
Under replicated partitions | The number of partitions that have not been fully replicated in the follower replicas (the number of non-reassigning replicas - the number of ISR > 0). |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"] |
ISR expands per second | The rate at which the number of ISRs in the broker increases. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=IsrExpandsPerSec","Count"] Preprocessing
|
ISR shrink per second | Rate of replicas leaving the ISR pool. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=IsrShrinksPerSec","Count"] Preprocessing
|
Leader count | The number of replicas for which this broker is the leader. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=LeaderCount","Value"] |
Partition count | The number of partitions in the broker. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=PartitionCount","Value"] |
Number of reassigning partitions | The number of reassigning leader partitions on a broker. |
JMX agent | jmx["kafka.server:type=ReplicaManager,name=ReassigningPartitions","Value"] |
Request queue size | The size of the delay queue. |
JMX agent | jmx["kafka.server:type=Request","queue-size"] |
Version | Current version of broker. |
JMX agent | jmx["kafka.server:type=app-info","version"] Preprocessing
|
Uptime | The service uptime expressed in seconds. |
JMX agent | jmx["kafka.server:type=app-info","start-time-ms"] Preprocessing
|
ZooKeeper client request latency | Latency in milliseconds for ZooKeeper requests from broker. |
JMX agent | jmx["kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs","Count"] |
ZooKeeper connection status | Connection status of broker's ZooKeeper session. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"] Preprocessing
|
ZooKeeper disconnect rate | ZooKeeper client disconnect per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperDisconnectsPerSec","Count"] Preprocessing
|
ZooKeeper session expiration rate | ZooKeeper client session expiration per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec","Count"] Preprocessing
|
ZooKeeper readonly rate | ZooKeeper client readonly per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperReadOnlyConnectsPerSec","Count"] Preprocessing
|
ZooKeeper sync rate | ZooKeeper client sync per second. |
JMX agent | jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperSyncConnectsPerSec","Count"] Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Unclean leader election detected | Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. |
last(/Apache Kafka by JMX/jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"])>0 |
Average | |
There are offline log directories | The offline log directory count metric indicate the number of log directories which are offline (due to a hardware failure for example) so that the broker cannot store incoming messages anymore. |
last(/Apache Kafka by JMX/jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"]) > 0 |
Warning | |
One or more partitions have no leader | Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available. |
last(/Apache Kafka by JMX/jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"]) > 0 |
Warning | |
Request handler average idle percent is too low | The request handler idle ratio metric indicates the percentage of time the request handlers are not in use. The lower this number, the more loaded the broker is. |
max(/Apache Kafka by JMX/jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"],15m)<{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN} |
Average | |
Network processor average idle percent is too low | The network processor idle ratio metric indicates the percentage of time the network processor are not in use. The lower this number, the more loaded the broker is. |
max(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)<{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN} |
Average | |
Failed to fetch info data | Zabbix has not received data for items for the last 15 minutes |
nodata(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)=1 |
Warning | |
There are partitions under the min ISR | The Under min ISR partitions metric displays the number of partitions, where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified. The two most common causes of under-min ISR partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers are falling behind. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"])>0 |
Average | |
There are under replicated partitions | The Under replicated partitions metric displays the number of partitions that do not have enough replicas to meet the desired replication factor. A partition will also be considered under-replicated if the correct number of replicas exist, but one or more of the replicas have fallen significantly behind the partition leader. The two most common causes of under-replicated partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers have fallen behind. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"])>0 |
Average | |
Version has changed | The Kafka version has changed. Acknowledge to close the problem manually. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#1)<>last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#2) and length(last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"]))>0 |
Info | Manual close: Yes |
Kafka service has been restarted | Uptime is less than 10 minutes. |
last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","start-time-ms"])<10m |
Info | Manual close: Yes |
Broker is not connected to ZooKeeper | find(/Apache Kafka by JMX/jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"],,"regexp","CONNECTED")=0 |
Average |
LLD rule Topic Metrics (write)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (write) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=*"] |
Item prototypes for Topic Metrics (write)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Messages in per second | The rate at which individual messages are consumed by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
Kafka {#JMXTOPIC}: Bytes in per second | The rate at which data sent from producers is consumed by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
LLD rule Topic Metrics (read)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (read) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=*"] |
Item prototypes for Topic Metrics (read)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Bytes out per second | The rate at which data is fetched and read from the broker by consumers (by topic). |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
LLD rule Topic Metrics (errors)
Name | Description | Type | Key and additional info |
---|---|---|---|
Topic Metrics (errors) | JMX agent | jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=*"] |
Item prototypes for Topic Metrics (errors)
Name | Description | Type | Key and additional info |
---|---|---|---|
Kafka {#JMXTOPIC}: Bytes rejected per second | Rejected bytes rate by topic. |
JMX agent | jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic={#JMXTOPIC}","Count"] Preprocessing
|
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums