Status

Current state: Accepted

Discussion thread: https://lists.apache.org/thread/3syt5jqcg6f4kq2kzxjv3s0hf5xfbcxb

JIRA: KAFKA-19691 - Getting issue details... STATUS

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When Kafka Streams uses the classic consumer rebalance protocol, it also uses the consumer rebalance listener, which provides metrics to understand the latency of handling lost, revoked and assigned partitions. With the move to a dedicated rebalance protocol (KIP-1071), Kafka Streams does not use the consumer rebalance listener anymore, but instead uses a streams-specific implementation. This KIP proposes to add equivalent metrics for the streams-specific rebalance listener.

During rebalancing, Kafka Streams executes several critical callbacks:

  • Tasks Revoked: When partitions are revoked from a consumer
  • Tasks Assigned: When new partitions are assigned to a consumer
  • Tasks Lost: When a consumer loses tasks unexpectedly

This KIP proposes adding latency metrics for each rebalance callback to provide operators with the observability needed to effectively monitor and optimize Kafka Streams applications in production environments.

Public Interfaces

New Metrics

This KIP introduces the following new metrics at the thread level:

1. tasks-revoked-latency
  • Metric Nametasks-revoked-latency-avg and tasks-revoked-latency-max
  • Level: Thread
  • Groupstream-metrics
  • Description:
    • avg: The average time taken for tasks-revoked rebalance listener callback
    • max: The max time taken for tasks-revoked rebalance listener callback
  • Recording Level: INFO
  • Unit: milliseconds
2. tasks-assigned-latency
  • Metric Nametasks-assigned-latency-avg and tasks-assigned-latency-max
  • Level: Thread
  • Groupstream-metrics
  • Description:
    • avg: The average time taken for tasks-assigned rebalance listener callback
    • max: The max time taken for tasks-assigned rebalance listener callback
  • Recording Level: INFO
  • Unit: milliseconds
3. tasks-lost-latency
  • Metric Nametasks-lost-latency-avg and tasks-lost-latency-max
  • Level: Thread
  • Groupstream-metrics
  • Description:
    • avg: The average time taken for tasks-lost rebalance listener callback
    • max: The max time taken for tasks-lost rebalance listener callback
  • Recording Level: INFO
  • Unit: milliseconds

Proposed Changes


The implementation adds latency metrics to the DefaultStreamsRebalanceListener class by wrapping each rebalance callback (onTasksRevokedonTasksAssignedonTasksLost) with timing logic. A new RebalanceMetrics utility class encapsulates the creation of the metric sensors. 

All the metrics are recorded at the thread level to provide per-thread visibility into rebalance performance.

Compatibility, Deprecation, and Migration Plan

This change is fully backward compatible. Users who depend on consumer rebalance metrics and switch to the new protocol (KIP-1071) will need to update their alerts and dashboards to use the new streams-specific metrics. We will include a corresponding notice in the upgrade guide.

Test Plan

The changes will be covered with unit tests that verify metric recording during rebalance operations and validate the accuracy of recorded latencies:

  1. DefaultStreamsRebalanceListenerTest
    • Existed
  2. RebalanceMetricsTest
    • New

Rejected Alternatives

None.

  • No labels