...
Section | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
We add a new quorum state Note: This adds a new invariant that only Prospective state can transition to Candidate state.
|
New New ProspectiveState
A follower will now transition to Prospective
instead of Candidate
when its fetch timeout expires. Servers will only be able to transition to Candidate
state from the Prospective
state.
...
- nothing changes and the replica is unable to receive enough vote responses from the quorum before randomElectionTimeoutMs, the replica won't increase its epoch.
- PreVote is rejected, the replica won't increase its epoch and will transition to Unattached or Follower in attempt to reach leader.
- PreVote is granted (which indicates replica is replica is able to communicate with at least majority of quorum) and replica transitions to Candidate with disruptive epoch bump. We cannot assume the new election will be granted, but we had a good indication that the replica had a chance for being able to communicate with at least majority of the quorum) and replica transitions to Candidate with disruptive epoch bump. We cannot assume the new election will be granted, but we had a good indication that the replica had a chance for being able to communicate with majority of the quorum, and that the majority would grant the vote.
For the scenario of receiving majority rejected votes, it also makes sense for Candidate
state to have a backoff or to wait the remainder of the random election timeout (as suggested by the Raft paper). However, we arguably do not need an exponentially increasing backoff. Candidate
will transition to Prospective
on loss of the election, which provides a buffer against another disruptive epoch increase. Keeping the exponential backoff behavior adds bloat to Prospective
state and unneeded complexity (e.g. tracking the number of times a replica has transitioned back and forth between Candidate
and Prospective
state, exponential calculation is hard to read). However, we will take changing the backoff behavior in this scenario as out-of-scope as it is not immediately obvious what would be a better alternative (e.g. smaller uniformly random election backoff which means deprecating max election timeout ms, or finish waiting rest of the random election timeout which means potentially longer unavailability of quorum)
FollowerState changes
- , and that the majority would grant the vote.
For the scenario of receiving majority rejected votes, it also makes sense for Candidate
state to have a backoff or to wait the remainder of the random election timeout (as suggested by the Raft paper). However, we arguably do not need an exponentially increasing backoff. Candidate
will transition to Prospective
on loss of the election, which provides a buffer against another disruptive epoch increase. Keeping the exponential backoff behavior adds bloat to Prospective
state and unneeded complexity (e.g. tracking the number of times a replica has transitioned back and forth between Candidate
and Prospective
state, exponential calculation is hard to read). However, we will take changing the backoff behavior in this scenario as out-of-scope as it is not immediately obvious what would be a better alternative (e.g. smaller uniformly random election backoff which means deprecating max election timeout ms, or finish waiting rest of the random election timeout which means potentially longer unavailability of quorum)
FollowerState changes
Followers
now track votedKey
. This change is not a needed feature of the KIP, but we should not drop persisted state during quorumstate transitions in the same epoch. (In the past, we would lose this information on transitions from Unattached
with votedKey to Follower
in the same epoch). Now, it is also possible that the transition from Prospective
with votedKey to Follower
in the same epoch occurs.
ResignedState changes
Resigned
voters used to transition directly to Candidate
after waiting an election timeout (observers would transition to UnattachedState
with epoch + 1). If we simply replace the transitionToCandidate
with transitionToProspective
, a cordoned leader in epoch 5 could resign in epoch 5, transition to prospective in epoch 5 (with leaderId=localId), fail election and then attempt to become follower of itself in epoch 5. To address this, when Resigned
transitions it must increase its epoch.
We can simplify the transition further to have Resigned
always transition to Unattached
with epoch + 1 after the election timeout (no matter if it is a voter or observer), and have transitionToUnattached
initialize the new electionTimeoutMs to the resignedState's remainingElectionTimeoutMs if it is a voter. This effectively causes Resigned
voters to transition immediately to Prospective
after an election timeout.
(For more discussion about alternatives and why this option was chosen, see https://github.com/apache/kafka/pull/18240#discussion_r1899341945)Followers
now track votedKey
. This change is not a needed feature of the KIP, but we should not drop persisted state during quorumstate transitions in the same epoch. (In the past, we would lose this information on transitions from Unattached
with votedKey to Follower
in the same epoch). Now, it is also possible that the transition from Prospective
with votedKey to Follower
in the same epoch occurs.
Observers
Similar to how Observers
cannot transition to Candidate
, they can not transition to Prospective
.
...