Status

Current state: Accepted

Discussion thread: here

Vote thread: here

JIRA: KAFKA-19606 - Getting issue details... STATUS

Motivation

JMX metrics RequestHandlerAvgIdlePercent reports a value close to 2 in combined kraft mode but it's expected to be b/w 0 and 1.

This is an issue with combined mode specifically because both controller + broker are using the same Meter object in combined mode, defined in RequestThreadIdleMeter#requestThreadIdleMeter, but the controller and broker are using separate KafkaRequestHandlerPool objects, where each object's threadPoolSize == KafkaConfig.numIoThreads. This means when calculating idle time, each pool divides by its own numIoThreads value before reporting to the shared meter and  RequestHandlerAvgIdlePercent calculates the final result by accumulating all the values reported by all threads. However, since there are actually 2 × numIoThreads total threads contributing to the metric, the denominator should be doubled to get the correct average.

Public Interfaces

Proposed Changes

We want to have two different metrics for the broker's request handler versus the controller's request handler to fix that issue:

  1. RequestHandlerAvgIdlePercent which reports the thread idle ratio for all of the request pools (what we have now but the calculation needs to be changed)
  2. BrokerRequestHandlerAvgIdlePercent which reports the thread idle ratio for the broker request pool
  3. ControllerRequestHandlerAvgIdlePercent which reports the thread idle ration for the controller request pool.

they use the same way to measure Idle ratio : For a given time window, ratio ≈ idleTimeNanos / (elapsedTimeNanos × totalRequestThreads).

For each handler thread, per select loop:  idleTimeNanos = endTimeNanos - startSelectTimeNanos

The meter is marked with idleTimeNanos / normalizationDenominator.

  1. Broker pool metric: normalizationDenominator = brokerPoolThreadCount.
  2. Controller pool metric: normalizationDenominator = controllerPoolThreadCount.
  3. Combined aggregate metric:
    Combined mode: normalizationDenominator = brokerPoolThreadCount + controllerPoolThreadCount (sum across pools).
    Isolated mode: normalizationDenominator = poolThreadCount (only the single pool in the process).

Monitor

NameTypeDescription
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentDouble(meter rate)Aggregate idle ratio across all request handler pools on the node. In KRaft combined mode, denominator uses the sum of threads across broker and controller pools. In isolated mode, this equals the single pool’s ratio on that node.
kafka.server:type=KafkaRequestHandlerPool,name=BrokerRequestHandlerAvgIdlePercentDouble(meter rate)Idle ratio for the broker request handler pool only

This metric is only created and exposed on brokers if the broker request handler pool exists. Brokers must not expose the controller request handler idle metric.
kafka.server:type=KafkaRequestHandlerPool,name=ControllerRequestHandlerAvgIdlePercentDouble(meter rate)Idle ratio for the controller request handler pool only.

This metric is only created and exposed on controllers if the controller request handler pool exists. Controllers must not expose the broker request handler idle metric.


Compatibility, Deprecation, and Migration Plan

Impact is observability-only: in KRaft combined mode, RequestHandlerAvgIdlePercent will be normalized by the sum of broker+controller threads, so values will generally between 0–1 as expected;

isolated mode is effectively unchanged. We’ll add BrokerRequestHandlerAvgIdlePercent and ControllerRequestHandlerAvgIdlePercent for per-pool visibility. 

Test Plan

We will add unit tests to verify the new metrics and also monitor the RequestHandlerAvgIdlePercent metric to make sure it's within 0 and 1

Alternatives

Only fix the existing RequestHandlerAvgIdlePercent

We don't add new metrics and just fix the original metric RequestHandlerAvgIdlePercent 

we keep track of the number of threads in a pool and the number of threads across all of the request handler pools. The metrics would use the global count when computing the average thread idle ratio
PR : https://github.com/apache/kafka/pull/20356/files
Result when testing locally:



  • No labels