Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Section
bordertrue

We add a new quorum state Prospective for servers which are sending Pre-Vote requests as well as new state transitions. The original (left) and new states (right) are below for comparison.

Note: This adds a new invariant that only Prospective state can transition to Candidate state.

Column
width40%
 * Unattached|Resigned transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Voted: After granting a vote to a candidate
 *    Candidate: After expiration of the election timeout
 *    Follower: After discovering a leader with an equal or larger epoch
*
  * CandidateUnattached|Resigned transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Candidate Voted: After expirationgranting a ofvote theto electiona timeoutcandidate
 *    Leader Candidate: After receivingexpiration a majority of votes
 *
 * Leader transitions to:the election timeout
 *    Unattached Follower: After learningdiscovering ofa aleader newwith electionan withequal aor higherlarger epoch
 *    Resigned: When shutting down gracefully
 *
 * FollowerCandidate transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Candidate: After expiration of the fetchelection timeout
 *    Follower Leader: After discoveringreceiving a leadermajority with a larger epoch
Column
width60%
 * Unattached|Resignedof votes
 *
 * Leader transitions to:
 *   Unattached Unattached: After learning of a annew election with a higher epoch
 *    Resigned: When Voted:shutting After granting a standard vote to a candidate * Prospectivedown gracefully
 *
 * Follower transitions to:
 *    Unattached: After expirationlearning of thea new election timeout * Follower with a higher epoch
 *    Candidate: After expiration of the fetch timeout
*    Follower: After discovering a leader with an equal ora larger epoch  epoch


Column
width60%
 *
 *Resigned Prospective transitions to: 
* Unattached: After learning of an election with a higher epoch, or nodeexpiration didof notthe have last known leader and loses/times out electionelection timeout
 *    CandidateFollower: After receivingdiscovering a majority of pre-votes
 *    Follower: After discovering a leader withleader with a larger epoch, or node had a last known leader and loses/times out election
*
* CandidateUnattached transitions to:  
* Unattached: After learning of aan candidateelection with a higher epoch * Voted:
After granting a standard vote to a candidate * Prospective: After expiration of the election timeout or loss of election * Leader: After receiving a majority of standard votes * Follower: After discovering a leader with an equal or larger epoch 
epoch (missed in original docs)
 *
 * LeaderProspective transitions to: 
* Unattached: After learning of aan candidateelection with a higher epoch,
or *node  did  Resigned:not Whenhave shuttinglast downknown gracefully
leader *and loses/times  out election
* Follower transitions to:
Candidate: After receiving a majority of pre-votes * UnattachedFollower: After learning ofdiscovering a candidateleader with a higherlarger epoch, * Prospective: After expiration of the fetch timeout or node had a last known leader and loses/times out election
* * Candidate transitions to:  
* FollowerUnattached: After discoveringlearning of a leadercandidate with a largerhigher epoch
* Prospective: After expiration of the election timeout or loss of election * Leader: After receiving a majority of standard votes * Follower: After discovering a leader with an equal or larger epoch (missed in original docs) * * Leader transitions to:
* Unattached: After learning of a candidate with a higher epoch
*    Resigned: When shutting down gracefully
*   * Follower transitions to:
* Unattached: After learning of a candidate with a higher epoch * Prospective: After expiration of the fetch timeout * Follower: After discovering a leader with a larger epoch


New New ProspectiveState

A follower will now transition to Prospective instead of Candidate when its fetch timeout expires. Servers will only be able to transition to Candidate state from the Prospective state.

...

  • nothing changes and the replica is unable to receive enough vote responses from the quorum before randomElectionTimeoutMs, the replica won't increase its epoch.
  • PreVote is rejected, the replica won't increase its epoch and will transition to Unattached or Follower in attempt to reach leader.
  • PreVote is granted (which indicates replica is replica is able to communicate with at least majority of quorum) and replica transitions to Candidate with disruptive epoch bump. We cannot assume the new election will be granted, but we had a good indication that the replica had a chance for being able to communicate with at least majority of the quorum) and replica transitions to Candidate with disruptive epoch bump. We cannot assume the new election will be granted, but we had a good indication that the replica had a chance for being able to communicate with majority of the quorum, and that the majority would grant the vote. 

For the scenario of receiving majority rejected votes, it also makes sense for Candidate state to have a backoff or to wait the remainder of the random election timeout (as suggested by the Raft paper). However, we arguably do not need an exponentially increasing backoff. Candidate will transition to Prospective on loss of the election, which provides a buffer against another disruptive epoch increase. Keeping the exponential backoff behavior adds bloat to Prospective state and unneeded complexity (e.g. tracking the number of times a replica has transitioned back and forth between Candidate and Prospective state, exponential calculation is hard to read). However, we will take changing the backoff behavior in this scenario as out-of-scope as it is not immediately obvious what would be a better alternative (e.g. smaller uniformly random election backoff which means deprecating max election timeout ms, or finish waiting rest of the random election timeout which means potentially longer unavailability of quorum)

FollowerState changes

  • , and that the majority would grant the vote. 

For the scenario of receiving majority rejected votes, it also makes sense for Candidate state to have a backoff or to wait the remainder of the random election timeout (as suggested by the Raft paper). However, we arguably do not need an exponentially increasing backoff. Candidate will transition to Prospective on loss of the election, which provides a buffer against another disruptive epoch increase. Keeping the exponential backoff behavior adds bloat to Prospective state and unneeded complexity (e.g. tracking the number of times a replica has transitioned back and forth between Candidate and Prospective state, exponential calculation is hard to read). However, we will take changing the backoff behavior in this scenario as out-of-scope as it is not immediately obvious what would be a better alternative (e.g. smaller uniformly random election backoff which means deprecating max election timeout ms, or finish waiting rest of the random election timeout which means potentially longer unavailability of quorum)

FollowerState changes

Followers now track votedKey. This change is not a needed feature of the KIP, but we should not drop persisted state during quorumstate transitions in the same epoch. (In the past, we would lose this information on transitions from Unattached with votedKey to Follower in the same epoch). Now, it is also possible that the transition from Prospective with votedKey to Follower in the same epoch occurs.

ResignedState changes

Resigned voters used to transition directly to Candidate after waiting an election timeout (observers would transition to UnattachedState with epoch + 1). If we simply replace the transitionToCandidate with transitionToProspective, a cordoned leader in epoch 5 could resign in epoch 5, transition to prospective in epoch 5 (with leaderId=localId), fail election and then attempt to become follower of itself in epoch 5. To address this, when Resigned transitions it must increase its epoch.

We can simplify the transition further to have Resigned always transition to Unattached with epoch + 1 after the election timeout (no matter if it is a voter or observer), and have transitionToUnattached initialize the new electionTimeoutMs to the resignedState's remainingElectionTimeoutMs if it is a voter. This effectively causes Resigned voters to transition immediately to Prospective after an election timeout. 

(For more discussion about alternatives and why this option was chosen, see https://github.com/apache/kafka/pull/18240#discussion_r1899341945)Followers now track votedKey. This change is not a needed feature of the KIP, but we should not drop persisted state during quorumstate transitions in the same epoch. (In the past, we would lose this information on transitions from Unattached with votedKey to Follower in the same epoch). Now, it is also possible that the transition from Prospective with votedKey to Follower in the same epoch occurs.

Observers

Similar to how Observers cannot transition to Candidate, they can not transition to Prospective.

...