Status

Current state: Accepted

Vote thread: https://lists.apache.org/thread/kyrvq8ps35d83xco9xr4j9m0qx794t7t

Discussion thread: https://lists.apache.org/thread/8ky7t8xybgy2omkqld1fbtk16op9p5qo

JIRA: KAFKA-19467 

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The controller thread is responsible for critical cluster coordination tasks, including leadership elections, topic metadata updates, partition assignments, and Raft-based state changes. While we have metrics for event queue and processing durations kafka.controller:type=ControllerEventManager,name=EventQueueTimeMs tracks the time an event waits in the queue before processing, and kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs measures the time an event takes to be processed after taking've been retrieved from the event queue. However, there is no metric to measure how much time the controller is idle either waiting for new events or waiting Raft callbacks while processing time. This lack of visibility makes it challenging to assess the controller’s performance, detect potential bottlenecks, or optimize resource allocation in high-load scenarios.

This KIP proposes introducing a new metric to monitor the controller thread’s idleness ratio.

Public Interfaces

Monitoring

NameTypeDescription
kafka.controller:type=ControllerEventManager,name=AvgIdleRatioTimeRatioThe idle ratio measures the proportion of time the controller thread is not actively processing an event using the existing metric org.apache.kafka.raft.internals.TimeRatio which tracks idleness over the actual measurement interval.

The metric is calculated using the following formula and value ranges from 0 to 1, where 0 indicates the controller thread is constantly processing without breaks, 1 signifies the controller spends most of its time waiting for work:


                      controller idle ratio = idle_time/(active_time+idle_time)


The components are defined as follows:

  • active_time: This is the total time the controller thread actively spends processing events or performing computation work rather than waiting for callbacks or external operations
  • idle_time: This is the total time the controller thread is not executing any instructions while waiting for either new events to appear in the queue or Raft callback. 


Compatibility, Deprecation, and Migration Plan

This KIP introduces a single new metric, with no changes to existing behavior, APIs, or protocols. There will be no impact on existing Kafka versions

Test Plan

We will add unit tests to verify the correctness of the metric.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • No labels