DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Accepted
Discussion thread: https://lists.apache.org/thread/ko478l71jf9hqhhg328tcdr46nj2wcz9
Vote thread: https://lists.apache.org/thread/1bhgq12dpc4s20pfos88t1wgz3l4y7lk
JIRA:
KAFKA-19400
-
Getting issue details...
STATUS
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
This KIP provides a simple solution to an availability issue exposed by the auto-join feature proposed in KIP-853 and how AddRaftVoterRequest RPC is currently handled. The problem is because the active controller does not send a response to complete the AddRaftVoterRequest until after the new voter set is committed, and that KRaft (and Kafka in general) only support one in-flight request to a node. Consider the following scenario:
Some controller A that is automatically joining by sending the AddRaftVoterRequest RPC is the same controller whose Fetch is needed to commit the new voter set. A clear example of this is when bootstrapping with --standalone and having controllers auto-join, as the first controller to auto-join will increase the voter set size from 1 to 2. The active controller needs controller A to complete a Fetch RPC to complete the AddRaftVoterRequest RPC, but controller A cannot send a FetchRequest until its AddRaftVoterRequest returns or times out. The general case being described here is going from a voter set of size X to size X + 1, where a minority of X + 1 nodes from the new voter set are unavailable.
The reason this scenario causes unavailability is as follows: the current in-flight request for controller A is the AddRaftVoterRequest RPC, which cannot complete until after controller A first replicates the new voter set, and sends another fetch to the leader. This state will expire the leader's checkQuorumTimer and cause it to resign, since the majority of nodes (which includes controller A) will not be able to fetch in time. Thus, the current implementation of AddRaftVoterRequest will cause an unnecessary leadership failover and election when running the auto-join feature.
It is important to note that this scenario does not apply when handling the RemoveRaftVoterRequest RPC as part of auto-join, because we need a majority of the new voter set to commit the new VotersRecord , which does not include the voter being removed.
Additionally, this unavailability scenario does not apply when adding controllers to the voter set manually (e.g. via AdminClient command). This is because the controller node being added (i.e. controller A from the example above) is not the one sending AddRaftVoterRequest RPC in this situation, so the controller is free to fetch and commit the new voter set.
Proposed Changes
The proposed change is to update the AddRaftVoterRequest RPC with a boolean flag that tells the active controller when to send a response for the RPC: after the new voter set is committed, or after the new voter set is written locally.
When the request is coming from another controller as a part of auto-join, the active controller will send a response after it appends the new voter set to its local log, rather than after that voter set is committed. This allows the "joining" replica to actually fetch the new voter set in time.
This change is sufficient because the main motivation behind not completing the RPC until the new voter set was committed was for an intuitive UX for the operator, since adding voters was done manually. The observer controllers that send AddRaftVoterRequest as a part of auto-join do not care if the new voter set was committed, since they will retry the request on a timer until it completes successfully.
Public Interfaces
Introduce a new version 1 to the AddRaftVoterRequest RPC:
--- a/clients/src/main/resources/common/message/AddRaftVoterRequest.json
+++ b/clients/src/main/resources/common/message/AddRaftVoterRequest.json
@@ -18,7 +18,8 @@
"type": "request",
"listeners": ["controller", "broker"],
"name": "AddRaftVoterRequest",
- "validVersions": "0",
+ // Version 1 adds the AckWhenCommitted field.
+ "validVersions": "0-1",
"flexibleVersions": "0+",
"fields": [
{ "name": "ClusterId", "type": "string", "versions": "0+", "nullableVersions": "0+",
@@ -37,6 +38,8 @@
"about": "The hostname." },
{ "name": "Port", "type": "uint16", "versions": "0+",
"about": "The port." }
- ]}
+ ]},
+ { "name": "AckWhenCommitted", "type": "bool", "versions": "1+", "default": "true",
+ "about": "When true, return a response after the new voter set is committed. Otherwise, return after the leader writes the changes locally." }
]
The default value of AckWhenCommitted will be true to preserve the intuitive UX of the RPC for the operator. When AckWhenCommitted is set to false , the AddRaftVoterRequest RPC does not guarantee that the new voter set has been committed upon receiving a response, and may require retries.
Compatibility, Deprecation, and Migration Plan
The main compatibility case to handle is a follower sending this request with version 1 to a leader which only supports version 0.
The follower should not send the AddRaftVoter request if the leader doesn't support the version, because we do not want to cause the unavailability scenario described above. Therefore, the new field should not be ignored and the sending replica should handle the unsupported version exception and error. We used a similar mechanism when designing and implementing the KIP-996: Pre-Vote.
This means that both the leader and the follower need to support version 1 of AddRaftVoter for the auto-join feature to work. The advantage of this solution is that it doesn't cause any unavailability in the KRaft partition (the cluster metadata partition).
Test Plan
Add a unit test to test for compatibility.
Rejected Alternatives
We explored trying to get KRaft to support multiple in-flight requests, but there were some significant issues with this appraoch:
- Kafka doesn't support multiple in-flight requests on the "server" side. What this means exactly is that the receiver of a request mutes the connection of the socket it reads from, making it unable to process another in-flight request on that connection until it sends a response back for the first.
- Adding support for this has much larger implications outside of KRaft and would require another KIP.
- Because of the point above, this means KRaft would need to establish 2 connections. One for
AddRaftVoterRequestand one for essentially everything else. We found this solution to be overkill for our auto-join, and felt that hardcoding two connections is bad design.
For compatibility:
To make this change backwards compatible, we can make this field ignorable. This ensures compatibility between a controller that is adding itself with the new AddRaftVoterRequest via auto-join, and an old active controller that does not understand this field (i.e. does not have the auto-join feature). In that case, the unavailability issue described above is possible.