DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Accepted
Vote thread: https://lists.apache.org/thread/kyrvq8ps35d83xco9xr4j9m0qx794t7t
Discussion thread: https://lists.apache.org/thread/8ky7t8xybgy2omkqld1fbtk16op9p5qo
JIRA: KAFKA-19467
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
The controller thread is responsible for critical cluster coordination tasks, including leadership elections, topic metadata updates, partition assignments, and Raft-based state changes. While we have metrics for event queue and processing durations — kafka.controller:type=ControllerEventManager,name=EventQueueTimeMs tracks the time an event waits in the queue before processing, and kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs measures the time an event takes to be processed after taking've been retrieved from the event queue. However, there is no metric to measure how much time the controller is idle either waiting for new events or waiting Raft callbacks while processing time. This lack of visibility makes it challenging to assess the controller’s performance, detect potential bottlenecks, or optimize resource allocation in high-load scenarios.
This KIP proposes introducing a new metric to monitor the controller thread’s idleness ratio.
Public Interfaces
Monitoring
| Name | Type | Description |
|---|---|---|
| kafka.controller:type=ControllerEventManager,name=AvgIdleRatio | TimeRatio | The idle ratio measures the proportion of time the controller thread is not actively processing an event using the existing metric org.apache.kafka.raft.internals.TimeRatio which tracks idleness over the actual measurement interval.The metric is calculated using the following formula and value ranges from 0 to 1, where 0 indicates the controller thread is constantly processing without breaks, 1 signifies the controller spends most of its time waiting for work: controller idle ratio = idle_time/(active_time+idle_time) The components are defined as follows:
|
Compatibility, Deprecation, and Migration Plan
This KIP introduces a single new metric, with no changes to existing behavior, APIs, or protocols. There will be no impact on existing Kafka versions
Test Plan
We will add unit tests to verify the correctness of the metric.
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.