Status

Current state:  Discussion

Discussion thread: https://lists.apache.org/thread/7xn8bd6obobnr1j284d13r993cb4nlt8

Vote thread:  here

JIRA: https://issues.apache.org/jira/browse/KAFKA-20402

PR:  https://github.com/apache/kafka/pull/21961

Motivation


Consumer group rebalance is one of the most critical lifecycle events for Kafka consumers, because it directly affects users' message consume behavior. At the same time, Kafka is recommending and adopting server-side rebalance.

Kafka clients already provide rebalance-related callbacks, but relying on client-side behavior is operationally fragile. In practice, callback availability and behavior depend on SDK adoption and upgrade cycles, which are difficult to control sometimes. Even when teams implemented the callback with log only. users may reduce or disable the logs by mistake or some other reasons. This creates a common failure mode: when incidents occur, critical rebalance evidence is missing at the client side, making diagnosis slow and uncertain so that we have to query the log in kafka broker side.

However, today server-side observability is limited: coordinator logs can indicate rebalance activity, but logs alone are not a reliable integration surface for automated handling, and existing metrics do not carry the most actionable identifier (consume group id). As a result, operators can observe aggregate rebalance trends, but cannot lightweightly attach custom processing to specific groups without log parsing or invasive client changes.

This KIP proposes a lightweight broker-side rebalance callback capability to expose key rebalance context (including consume group id) on Kafka broker. 

In short, the goal is to provide a stable, centrally controlled extension point for observability and operational automation, without forcing client SDK upgrades and without introducing high-cardinality metric tags. 

With the interface. New usage can be exposed . For example: AI trouble shooting for consume

Public Interfaces

This KIP introduces one new broker-side extension interface and one new broker configuration.

New Interface

/**
* A callback interface that users can implement to get notified when a consumer group rebalance occurs.
* <p>
* This listener is invoked on the broker side (GroupCoordinator) when a consumer group's epoch is bumped,
* which indicates that a rebalance has occurred.
* <p>
* Implementations of this interface can be used for monitoring, alerting, or logging purposes
* to track which consumer groups are experiencing rebalances.
*/
public interface ConsumerGroupRebalanceListener {

/**
* Called when a consumer group rebalance occurs.
*
* @param groupId The ID of the consumer group that rebalanced
* @param groupType The type of group that rebalanced
* @param reason The rebalance reason
* @param eventTimeMs The event timestamp in milliseconds
*/
void onConsumerGroupRebalance(String groupId, String groupType, String reason, long eventTimeMs);
}

New Broker Configuration

group.consumer.rebalance.listener.classes (LIST, default: empty)

  • A list of fully qualified class names implementing ConsumerGroupRebalanceListener.
  • Classes are instantiated and configured by the broker at startup.
  • Empty value means callback feature is disabled


Note:  Dynamic updates are not planed in this scope out for now so that we can keep the KIP small, low-risk, and validate real usage first.

If adoption proves strong, we can add dynamic reconfiguration later without changing the callback interface.

Proposed Changes

Add a lightweight broker-side callback path for consumer-group rebalance events, without changing client APIs and without adding high-cardinality metrics.

Server load the group.consumer.rebalance.listener.classes when startuping and trigger the callback interface when rebalance happened.

Refer to the PR example:

Note:  Cover all non-classic consume group due to the classic consume group will be deprecated soon.

Compatibility, Deprecation, and Migration Plan 

  • Backward Compatibility
    New interface, No backward compatibility

  • Deprecation
    N/A
  • Migration for Existing Deployments
    N/A. The feature is optional

Test Plan

We can use follow tests to cover the change:

Integration Tests:  Add one log goal's implement for the interface and test with deployment

Unit test:  Test cover the exception handle part.

Rejected Alternatives

Alternative 1:  Use the client consume rebalance listener
Relying on client-side behavior is operationally fragile. In practice, callback availability and behavior depend on SDK adoption and upgrade cycles, which are difficult to control sometimes. Even when teams implemented the callback with log only. users may reduce or disable the logs by mistake or some other reasons. This creates a common failure mode: when incidents occur, critical rebalance evidence is missing at the client side, making diagnosis slow and uncertain

Alternative 2:  Enhance the existed server side metric

Enhancing existing coordinator metrics is not sufficient for this use case because the key troubleshooting dimension is group.id, while server metrics are intentionally aggregate. If we add group.id as a metric label/tag, it introduces high-cardinality time series (potentially thousands of groups), which increases memory cost on broker and can degrade kafka performance