JoinGroupRequest => GroupId SessionTimeout MemberId ProtocolsProtocolType GroupProtocols GroupId => String SessionTimeout => int32 MemberId => String ProtocolType => String GroupProtocols => [Protocol MemberMetadata] Protocol => String MemberMetadata => bytes JoinGroupResponse => ErrorCode GroupGenerationId GroupLeaderId MemberId Members ErrorCode => int16 GroupGenerationId => int32 GroupLeaderIdGroupProtocol => String GroupProtocolGroupLeaderId => String MemberId => String Members => [MemberId MemberMetadata] MemberId => String MemberMetadata => bytes
SyncGroupRequest => GroupId GroupGenerationId MemberId
GroupId => String
GroupGenerationId => int32
SyncErrorCode => int16
GroupState => [MemberId MemberState]
MemberId => String
MemberState => bytes
SyncGroupResponse => ErrorCode MemberState
ErrorCode => int16
MemberState => bytes
Above we outlined the generalized JoinGroup protocol that the consumer will use. Next we show how we will implement consumer semantics on top of this protocol. Other use cases for the join group protocol would be implemented similarly.
The two phases of the group protocol correspond to subscription and assignment for the new consumer. Each member of the group submits their subscription as member metadata. The leader of the group collects all subscription in its
JoinGroup response and sends the assignment as member state in
SyncGroup. There are several advantages to having a single assignor:
- Since the leader makes the assignment for the full group, it is the single source of truth for the metadata used in its decision making. This avoids the need to synchronize metadata among all members that is required in a multi-assignor approach.
- The leader of the group can enforce its own policy for controlling the rate of rebalancing. It doesn't have to rebalance after every metadata change, but can "batch" changes together to reduce the impact of metadata churn.
- The leader is the only member that needs to receive the metadata from all members of the group. This reduces the overhead of the protocol.
The group protocol used by the consumer in the JoinGroup request corresponds to the assignment strategy that the leader will use to determine partition assignment. This allows the consumer to upgrade from one assignment strategy to another without downtime. The metadata corresponding to the assignment strategy can be strategy-specific, but generally it will include the group subscriptions for the member. The state returned to members in the SyncGroup will include the partitions assigned to that member.
For all assignment strategies, group members provide their subscriptions as an array of strings. This subscription can either be a list of topics or regular expressions (TODO: do we need distinguisher field to tell the difference? how about regex compatibility?). Partition assignments are provided in the SyncGroup response as an array of topics and partitions. The protocol supports custom data in both the subscription and assignment as a generic array of bytes to allow for custom assignor implementations. For example, a rack-aware assignor will generally need to propagate the rackId of each member to the leader in its subscription so that it can take it into account for assignment.
ProtocolType => "consumer"
GroupProtocol => AssignmentStrategy
AssignmentStrategy => String
MemberMetadata => Version Subscription AssignmentStrategies
Version => int16
Subscription => Topics UserData
Topics => [String]
UserData => Bytes
MemberState => Version Assignment
Version => int16
Assignment => TopicPartitions UserData
TopicPartitions => [Topic Partitions]
Topic => String
Partitions => [int32]
UserData => Bytes
Protocol: Briefly, this is how the protocol works for the consumer.
- Members JoinGroup with their respective subscriptions.
- The leader collects member subscriptions from its JoinGroup response and performs the group assignment.
- All members (including the leader) send SyncGroup to find their assignment.
- Once created, there are two cases which can trigger reassignment:
- Topic metadata changes which have no impact on subscriptions cause resync. The leader computes the new assignment and sends SyncGroup.
- Membership or subscription changes cause rejoin.
Rolling Upgrades: To support rolling upgrades without downtime, there are two cases to consider:
- Changes affecting subscription: the protocol directly supports differing subscriptions, so there is no need for special handling. Members will only be assigned partitions compatible with their subscription.
- Assignment strategy changes: to support a change to the assignment strategy, new versions must enable support both for the old assignment strategy and the new one. The coordinator will choose the old assignment strategy until all members have been updated. Then it will choose the new strategy. This preference is implicit in the order of the strategies in the JoinGroup request.
Handling Coordinator Failures: This proposal largely shares the coordinator failure cases and recovery mechanism from the initial protocol documented in Kafka 0.9 Consumer Rewrite Design. The recovery process depends on whether group state is persisted (e.g. in Zookeeper). With no persistence, then group members will generally have to rejoin the group when the new coordinator becomes active. Below, we assume persistence and show how failures are treated at the various stages of the protocol.
- All members send JoinGroup.
- Generation metadata is persisted.
- All members send SyncGroup.
- Sync metadata is persisted.
Coordinator failures at these steps are handled in the following ways:
- If the coordinator fails before all members have joined the group or before group metadata has been persisted, then all members will resend their JoinGroup requests once the new coordinator is stable.
- The coordinator may fail after group metadata has been persisted, but before all members of the group have received a response to their JoinGroup requests. The problem here is that once the new coordinator is stable, the leader may try to immediately synchronize while other members are still trying to join. However, when the coordinator receives a JoinGroup request from any member, it must abort any active synchronization and force all members to rejoin.
- If the coordinator fails before a pending synchronization has been persisted, then all members will re-initiate the SyncGroup once the new coordinator is ready.
- If the coordinator fails after the metadata has been persisted, but before all members have received the SyncGroup response, then those members will initiate SyncGroup upon failover. Assuming synchronized state is persisted, then the coordinator can return that member's state immediately without forcing other members to resync. If it is not persisted, then a full group resync is required.
Other Interesting Cases:
- Leader Failures: The leader of each group is responsible for initiating group synchronization when topic metadata changes. A leader failure is detected by the coordinator through the expiration of its session timeout. The coordinator will respond by forcing all members to rejoin, which will allow a new leader to be elected.
- Assignment Failure: As mentioned above, there are several ways that the synchronization/assignment phase can fail. Generally, they are handled by having group members rejoin the group. The most interesting case is when the leader encounters an unrecoverable error when it computes the group's assignment. This could happen, for example, if group members don't agree on the assignment strategy to use. In this case, the assignment failure is forwarded to the broker which can then propagate it to awaiting members.
- Subscription Change: If a member changes its subscription, then it must force the group to be recreated by sending a JoinGroup request to the coordinator. This will cause the coordinator to reply to the other member's heartbeats with an error indicating that rejoin is needed, which will cause them to also send JoinGroup requests.
- Topic Metadata Change: The leader is responsible for detecting topic metadata changes which affect the group's subscription. When it finds a change, it can immediately compute the new assignment and initiate a SyncGroup with the coordinator.
To support client-side assignment, we'd have to make the following changes:
- Migrate existing assignment strategies from the broker to the client. Since the assignment interface is nearly the same, this should be straightforward.
- Modify client/server for the new join group protocol. Since we're not really changing the protocol (just the information that is passed through it), this should also be straightforward.
- Remove offset validation from the consumer coordinator. Just a couple lines to remove for this.
- Add support for assignment versioning (if we decide we need it). Depending on what we do, may or may not be trivial.