DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Accepted
Discussion thread: https://lists.apache.org/thread/3syt5jqcg6f4kq2kzxjv3s0hf5xfbcxb
JIRA:
KAFKA-19691
-
Getting issue details...
STATUS
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
When Kafka Streams uses the classic consumer rebalance protocol, it also uses the consumer rebalance listener, which provides metrics to understand the latency of handling lost, revoked and assigned partitions. With the move to a dedicated rebalance protocol (KIP-1071), Kafka Streams does not use the consumer rebalance listener anymore, but instead uses a streams-specific implementation. This KIP proposes to add equivalent metrics for the streams-specific rebalance listener.
During rebalancing, Kafka Streams executes several critical callbacks:
- Tasks Revoked: When partitions are revoked from a consumer
- Tasks Assigned: When new partitions are assigned to a consumer
- Tasks Lost: When a consumer loses tasks unexpectedly
This KIP proposes adding latency metrics for each rebalance callback to provide operators with the observability needed to effectively monitor and optimize Kafka Streams applications in production environments.
Public Interfaces
New Metrics
This KIP introduces the following new metrics at the thread level:
1. tasks-revoked-latency
- Metric Name:
tasks-revoked-latency-avgandtasks-revoked-latency-max - Level: Thread
- Group:
stream-metrics - Description:
avg: The average time taken for tasks-revoked rebalance listener callbackmax: The max time taken for tasks-revoked rebalance listener callback
- Recording Level: INFO
- Unit: milliseconds
2. tasks-assigned-latency
- Metric Name:
tasks-assigned-latency-avgandtasks-assigned-latency-max - Level: Thread
- Group:
stream-metrics - Description:
avg: The average time taken for tasks-assigned rebalance listener callbackmax: The max time taken for tasks-assigned rebalance listener callback
- Recording Level: INFO
- Unit: milliseconds
3. tasks-lost-latency
- Metric Name:
tasks-lost-latency-avgandtasks-lost-latency-max - Level: Thread
- Group:
stream-metrics - Description:
avg: The average time taken for tasks-lost rebalance listener callbackmax: The max time taken for tasks-lost rebalance listener callback
- Recording Level: INFO
- Unit: milliseconds
Proposed Changes
The implementation adds latency metrics to the DefaultStreamsRebalanceListener class by wrapping each rebalance callback (onTasksRevoked, onTasksAssigned, onTasksLost) with timing logic. A new RebalanceMetrics utility class encapsulates the creation of the metric sensors.
All the metrics are recorded at the thread level to provide per-thread visibility into rebalance performance.
Compatibility, Deprecation, and Migration Plan
This change is fully backward compatible. Users who depend on consumer rebalance metrics and switch to the new protocol (KIP-1071) will need to update their alerts and dashboards to use the new streams-specific metrics. We will include a corresponding notice in the upgrade guide.
Test Plan
The changes will be covered with unit tests that verify metric recording during rebalance operations and validate the accuracy of recorded latencies:
- DefaultStreamsRebalanceListenerTest
- Existed
- RebalanceMetricsTest
- New
Rejected Alternatives
None.