Kafka

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Available solutions




This template is for Zabbix version: 7.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/kafka_jmx?at=release/7.0

Apache Kafka by JMX

Overview

This template is designed for the effortless deployment of Apache Kafka monitoring by Zabbix via JMX and doesn't require any external scripts.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Apache Kafka 2.6.0

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Metrics are collected by JMX.

  1. Enable and configure JMX access to Apache Kafka. See documentation for instructions.
  2. Set the user name and password in host macros {$KAFKA.USER} and {$KAFKA.PASSWORD}.

Macros used

Name Description Default
{$KAFKA.USER} zabbix
{$KAFKA.PASSWORD} zabbix
{$KAFKA.TOPIC.MATCHES}

Filter of discoverable topics

.*
{$KAFKA.TOPIC.NOT_MATCHES}

Filter to exclude discovered topics

__consumer_offsets
{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN}

The minimum Network processor average idle percent for trigger expression.

30
{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN}

The minimum Request handler average idle percent for trigger expression.

30

Items

Name Description Type Key and additional info
Leader election per second

Number of leader elections per second.

JMX agent jmx["kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs","Count"]
Unclean leader election per second

Number of “unclean” elections per second.

JMX agent jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"]

Preprocessing

  • Change per second
Controller state on broker

One indicates that the broker is the controller for the cluster.

JMX agent jmx["kafka.controller:type=KafkaController,name=ActiveControllerCount","Value"]

Preprocessing

  • Discard unchanged with heartbeat: 1h

Ineligible pending replica deletes

The number of ineligible pending replica deletes.

JMX agent jmx["kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount","Value"]
Pending replica deletes

The number of pending replica deletes.

JMX agent jmx["kafka.controller:type=KafkaController,name=ReplicasToDeleteCount","Value"]
Ineligible pending topic deletes

The number of ineligible pending topic deletes.

JMX agent jmx["kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount","Value"]
Pending topic deletes

The number of pending topic deletes.

JMX agent jmx["kafka.controller:type=KafkaController,name=TopicsToDeleteCount","Value"]
Offline log directory count

The number of offline log directories (for example, after a hardware failure).

JMX agent jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"]
Offline partitions count

Number of partitions that don't have an active leader.

JMX agent jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"]
Bytes out per second

The rate at which data is fetched and read from the broker by consumers.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec","Count"]

Preprocessing

  • Change per second
Bytes in per second

The rate at which data sent from producers is consumed by the broker.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec","Count"]

Preprocessing

  • Change per second
Messages in per second

The rate at which individual messages are consumed by the broker.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec","Count"]

Preprocessing

  • Change per second
Bytes rejected per second

The rate at which bytes rejected per second by the broker.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec","Count"]

Preprocessing

  • Change per second
Client fetch request failed per second

Number of client fetch request failures per second.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec","Count"]

Preprocessing

  • Change per second
Produce requests failed per second

Number of failed produce requests per second.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec","Count"]

Preprocessing

  • Change per second
Request handler average idle percent

Indicates the percentage of time that the request handler (IO) threads are not in use.

JMX agent jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"]

Preprocessing

  • Custom multiplier: 100

Fetch-Consumer response send time, mean

Average time taken, in milliseconds, to send the response.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","Mean"]
Fetch-Consumer response send time, p95

The time taken, in milliseconds, to send the response for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","95thPercentile"]
Fetch-Consumer response send time, p99

The time taken, in milliseconds, to send the response for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer","99thPercentile"]
Fetch-Follower response send time, mean

Average time taken, in milliseconds, to send the response.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","Mean"]
Fetch-Follower response send time, p95

The time taken, in milliseconds, to send the response for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","95thPercentile"]
Fetch-Follower response send time, p99

The time taken, in milliseconds, to send the response for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower","99thPercentile"]
Produce response send time, mean

Average time taken, in milliseconds, to send the response.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","Mean"]
Produce response send time, p95

The time taken, in milliseconds, to send the response for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","95thPercentile"]
Produce response send time, p99

The time taken, in milliseconds, to send the response for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce","99thPercentile"]
Fetch-Consumer request total time, mean

Average time in ms to serve the Fetch-Consumer request.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","Mean"]
Fetch-Consumer request total time, p95

Time in ms to serve the Fetch-Consumer request for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","95thPercentile"]
Fetch-Consumer request total time, p99

Time in ms to serve the specified Fetch-Consumer for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer","99thPercentile"]
Fetch-Follower request total time, mean

Average time in ms to serve the Fetch-Follower request.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","Mean"]
Fetch-Follower request total time, p95

Time in ms to serve the Fetch-Follower request for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","95thPercentile"]
Fetch-Follower request total time, p99

Time in ms to serve the Fetch-Follower request for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower","99thPercentile"]
Produce request total time, mean

Average time in ms to serve the Produce request.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","Mean"]
Produce request total time, p95

Time in ms to serve the Produce requests for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","95thPercentile"]
Produce request total time, p99

Time in ms to serve the Produce requests for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce","99thPercentile"]
Fetch-Consumer request total time, mean

Average time for a request to update metadata.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","Mean"]
UpdateMetadata request total time, p95

Time for update metadata requests for 95th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","95thPercentile"]
UpdateMetadata request total time, p99

Time for update metadata requests for 99th percentile.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TotalTimeMs,request=UpdateMetadata","99thPercentile"]
Temporary memory size in bytes (Fetch), max

The maximum of temporary memory used for converting message formats and decompressing messages.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Max"]
Temporary memory size in bytes (Fetch), min

The minimum of temporary memory used for converting message formats and decompressing messages.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Fetch","Mean"]
Temporary memory size in bytes (Produce), max

The maximum of temporary memory used for converting message formats and decompressing messages.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Max"]
Temporary memory size in bytes (Produce), avg

The amount of temporary memory used for converting message formats and decompressing messages.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Mean"]
Temporary memory size in bytes (Produce), min

The minimum of temporary memory used for converting message formats and decompressing messages.

JMX agent jmx["kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request=Produce","Min"]
Network processor average idle percent

The average percentage of time that the network processors are idle.

JMX agent jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"]

Preprocessing

  • Custom multiplier: 100

Requests in producer purgatory

Number of requests waiting in producer purgatory.

JMX agent jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch","Value"]
Requests in fetch purgatory

Number of requests waiting in fetch purgatory.

JMX agent jmx["kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce","Value"]
Replication maximum lag

The maximum lag between the time that messages are received by the leader replica and by the follower replicas.

JMX agent jmx["kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica","Value"]
Under minimum ISR partition count

The number of partitions under the minimum In-Sync Replica (ISR) count.

JMX agent jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"]
Under replicated partitions

The number of partitions that have not been fully replicated in the follower replicas (the number of non-reassigning replicas - the number of ISR > 0).

JMX agent jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"]
ISR expands per second

The rate at which the number of ISRs in the broker increases.

JMX agent jmx["kafka.server:type=ReplicaManager,name=IsrExpandsPerSec","Count"]

Preprocessing

  • Change per second
ISR shrink per second

Rate of replicas leaving the ISR pool.

JMX agent jmx["kafka.server:type=ReplicaManager,name=IsrShrinksPerSec","Count"]

Preprocessing

  • Change per second
Leader count

The number of replicas for which this broker is the leader.

JMX agent jmx["kafka.server:type=ReplicaManager,name=LeaderCount","Value"]
Partition count

The number of partitions in the broker.

JMX agent jmx["kafka.server:type=ReplicaManager,name=PartitionCount","Value"]
Number of reassigning partitions

The number of reassigning leader partitions on a broker.

JMX agent jmx["kafka.server:type=ReplicaManager,name=ReassigningPartitions","Value"]
Request queue size

The size of the delay queue.

JMX agent jmx["kafka.server:type=Request","queue-size"]
Version

Current version of broker.

JMX agent jmx["kafka.server:type=app-info","version"]

Preprocessing

  • Discard unchanged with heartbeat: 1h

Uptime

The service uptime expressed in seconds.

JMX agent jmx["kafka.server:type=app-info","start-time-ms"]

Preprocessing

  • JavaScript: The text is too long. Please see the template.

ZooKeeper client request latency

Latency in milliseconds for ZooKeeper requests from broker.

JMX agent jmx["kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs","Count"]
ZooKeeper connection status

Connection status of broker's ZooKeeper session.

JMX agent jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"]

Preprocessing

  • Discard unchanged with heartbeat: 1h

ZooKeeper disconnect rate

ZooKeeper client disconnect per second.

JMX agent jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperDisconnectsPerSec","Count"]

Preprocessing

  • Change per second
ZooKeeper session expiration rate

ZooKeeper client session expiration per second.

JMX agent jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec","Count"]

Preprocessing

  • Change per second
ZooKeeper readonly rate

ZooKeeper client readonly per second.

JMX agent jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperReadOnlyConnectsPerSec","Count"]

Preprocessing

  • Change per second
ZooKeeper sync rate

ZooKeeper client sync per second.

JMX agent jmx["kafka.server:type=SessionExpireListener,name=ZooKeeperSyncConnectsPerSec","Count"]

Preprocessing

  • Change per second

Triggers

Name Description Expression Severity Dependencies and additional info
Unclean leader election detected

Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability.

last(/Apache Kafka by JMX/jmx["kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec","Count"])>0 Average
There are offline log directories

The offline log directory count metric indicate the number of log directories which are offline (due to a hardware failure for example) so that the broker cannot store incoming messages anymore.

last(/Apache Kafka by JMX/jmx["kafka.log:type=LogManager,name=OfflineLogDirectoryCount","Value"]) > 0 Warning
One or more partitions have no leader

Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available.

last(/Apache Kafka by JMX/jmx["kafka.controller:type=KafkaController,name=OfflinePartitionsCount","Value"]) > 0 Warning
Request handler average idle percent is too low

The request handler idle ratio metric indicates the percentage of time the request handlers are not in use. The lower this number, the more loaded the broker is.

max(/Apache Kafka by JMX/jmx["kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent","OneMinuteRate"],15m)<{$KAFKA.REQUEST_HANDLER_AVG_IDLE.MIN.WARN} Average
Network processor average idle percent is too low

The network processor idle ratio metric indicates the percentage of time the network processor are not in use. The lower this number, the more loaded the broker is.

max(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)<{$KAFKA.NET_PROC_AVG_IDLE.MIN.WARN} Average
Failed to fetch info data

Zabbix has not received data for items for the last 15 minutes

nodata(/Apache Kafka by JMX/jmx["kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent","Value"],15m)=1 Warning
There are partitions under the min ISR

The Under min ISR partitions metric displays the number of partitions, where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified. The two most common causes of under-min ISR partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers are falling behind.

last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount","Value"])>0 Average
There are under replicated partitions

The Under replicated partitions metric displays the number of partitions that do not have enough replicas to meet the desired replication factor. A partition will also be considered under-replicated if the correct number of replicas exist, but one or more of the replicas have fallen significantly behind the partition leader. The two most common causes of under-replicated partitions are that one or more brokers is unresponsive, or the cluster is experiencing performance issues and one or more brokers have fallen behind.

last(/Apache Kafka by JMX/jmx["kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions","Value"])>0 Average
Version has changed

The Kafka version has changed. Acknowledge to close the problem manually.

last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#1)<>last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"],#2) and length(last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","version"]))>0 Info Manual close: Yes
Kafka service has been restarted

Uptime is less than 10 minutes.

last(/Apache Kafka by JMX/jmx["kafka.server:type=app-info","start-time-ms"])<10m Info Manual close: Yes
Broker is not connected to ZooKeeper find(/Apache Kafka by JMX/jmx["kafka.server:type=SessionExpireListener,name=SessionState","Value"],,"regexp","CONNECTED")=0 Average

LLD rule Topic Metrics (write)

Name Description Type Key and additional info
Topic Metrics (write) JMX agent jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=*"]

Item prototypes for Topic Metrics (write)

Name Description Type Key and additional info
Kafka {#JMXTOPIC}: Messages in per second

The rate at which individual messages are consumed by topic.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={#JMXTOPIC}","Count"]

Preprocessing

  • Change per second
Kafka {#JMXTOPIC}: Bytes in per second

The rate at which data sent from producers is consumed by topic.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={#JMXTOPIC}","Count"]

Preprocessing

  • Change per second

LLD rule Topic Metrics (read)

Name Description Type Key and additional info
Topic Metrics (read) JMX agent jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=*"]

Item prototypes for Topic Metrics (read)

Name Description Type Key and additional info
Kafka {#JMXTOPIC}: Bytes out per second

The rate at which data is fetched and read from the broker by consumers (by topic).

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={#JMXTOPIC}","Count"]

Preprocessing

  • Change per second

LLD rule Topic Metrics (errors)

Name Description Type Key and additional info
Topic Metrics (errors) JMX agent jmx.discovery[beans,"kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=*"]

Item prototypes for Topic Metrics (errors)

Name Description Type Key and additional info
Kafka {#JMXTOPIC}: Bytes rejected per second

Rejected bytes rate by topic.

JMX agent jmx["kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic={#JMXTOPIC}","Count"]

Preprocessing

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Didn't find what you are looking for?