Status

Current state: Accepted

Discussion thread: https://lists.apache.org/thread/ko478l71jf9hqhhg328tcdr46nj2wcz9

Vote thread: https://lists.apache.org/thread/1bhgq12dpc4s20pfos88t1wgz3l4y7lk

JIRA: KAFKA-19400 - Getting issue details... STATUS

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

This KIP provides a simple solution to an availability issue exposed by the auto-join feature proposed in KIP-853 and how AddRaftVoterRequest  RPC is currently handled. The problem is because the active controller does not send a response to complete the AddRaftVoterRequest until after the new voter set is committed, and that KRaft (and Kafka in general) only support one in-flight request to a node. Consider the following scenario:

Some controller A that is automatically joining by sending the AddRaftVoterRequest  RPC is the same controller whose Fetch is needed to commit the new voter set. A clear example of this is when bootstrapping with --standalone and having controllers auto-join, as the first controller to auto-join will increase the voter set size from 1 to 2. The active controller needs controller A to complete a Fetch  RPC to complete the AddRaftVoterRequest  RPC, but controller A cannot send a FetchRequest  until its AddRaftVoterRequest  returns or times out. The general case being described here is going from a voter set of size X to size X + 1, where a minority of X + 1 nodes from the new voter set are unavailable.

The reason this scenario causes unavailability is as follows: the current in-flight request for controller A is the AddRaftVoterRequest  RPC, which cannot complete until after controller A first replicates the new voter set, and sends another fetch to the leader. This state will expire the leader's checkQuorumTimer and cause it to resign, since the majority of nodes (which includes controller A) will not be able to fetch in time. Thus, the current implementation of AddRaftVoterRequest  will cause an unnecessary leadership failover and election when running the auto-join feature.

It is important to note that this scenario does not apply when handling the RemoveRaftVoterRequest  RPC as part of auto-join, because we need a majority of the new voter set to commit the new VotersRecord , which does not include the voter being removed.

Additionally, this unavailability scenario does not apply when adding controllers to the voter set manually (e.g. via AdminClient command). This is because the controller node being added (i.e. controller A from the example above) is not the one sending AddRaftVoterRequest  RPC in this situation, so the controller is free to fetch and commit the new voter set.

Proposed Changes

The proposed change is to update the AddRaftVoterRequest  RPC with a boolean flag that tells the active controller when to send a response for the RPC: after the new voter set is committed, or after the new voter set is written locally.

When the request is coming from another controller as a part of auto-join, the active controller will send a response after it appends the new voter set to its local log, rather than after that voter set is committed. This allows the "joining" replica to actually fetch the new voter set in time.

This change is sufficient because the main motivation behind not completing the RPC until the new voter set was committed was for an intuitive UX for the operator, since adding voters was done manually. The observer controllers that send AddRaftVoterRequest  as a part of auto-join do not care if the new voter set was committed, since they will retry the request on a timer until it completes successfully.

Public Interfaces

Introduce a new version 1 to the AddRaftVoterRequest  RPC:

--- a/clients/src/main/resources/common/message/AddRaftVoterRequest.json
+++ b/clients/src/main/resources/common/message/AddRaftVoterRequest.json
@@ -18,7 +18,8 @@
   "type": "request",
   "listeners": ["controller", "broker"],
   "name": "AddRaftVoterRequest",
-  "validVersions": "0",
+  // Version 1 adds the AckWhenCommitted field.
+  "validVersions": "0-1",
   "flexibleVersions": "0+",
   "fields": [
     { "name": "ClusterId", "type": "string", "versions": "0+", "nullableVersions": "0+",
@@ -37,6 +38,8 @@
         "about": "The hostname." },
       { "name": "Port", "type": "uint16", "versions": "0+",
         "about": "The port." }
-    ]}
+    ]},
+    { "name": "AckWhenCommitted", "type": "bool", "versions": "1+", "default": "true",
+      "about": "When true, return a response after the new voter set is committed. Otherwise, return after the leader writes the changes locally." }
 ]

The default value of AckWhenCommitted  will be true to preserve the intuitive UX of the RPC for the operator. When AckWhenCommitted  is set to false , the AddRaftVoterRequest  RPC does not guarantee that the new voter set has been committed upon receiving a response, and may require retries.

Compatibility, Deprecation, and Migration Plan

The main compatibility case to handle is a follower sending this request with version 1 to a leader which only supports version 0.

The follower should not send the AddRaftVoter request if the leader doesn't support the version, because we do not want to cause the unavailability scenario described above. Therefore, the new field should not be ignored and the sending replica should handle the unsupported version exception and error. We used a similar mechanism when designing and implementing the KIP-996: Pre-Vote.

This means that both the leader and the follower need to support version 1 of AddRaftVoter for the auto-join feature to work. The advantage of this solution is that it doesn't cause any unavailability in the KRaft partition (the cluster metadata partition).

Test Plan

Add a unit test to test for compatibility.

Rejected Alternatives

We explored trying to get KRaft to support multiple in-flight requests, but there were some significant issues with this appraoch:

  • Kafka doesn't support multiple in-flight requests on the "server" side. What this means exactly is that the receiver of a request mutes the connection of the socket it reads from, making it unable to process another in-flight request on that connection until it sends a response back for the first.
    • Adding support for this has much larger implications outside of KRaft and would require another KIP.
  • Because of the point above, this means KRaft would need to establish 2 connections. One for AddRaftVoterRequest and one for essentially everything else. We found this solution to be overkill for our auto-join, and felt that hardcoding two connections is bad design.

For compatibility:
To make this change backwards compatible, we can make this field ignorable. This ensures compatibility between a controller that is adding itself with the new AddRaftVoterRequest via auto-join, and an old active controller that does not understand this field (i.e. does not have the auto-join feature). In that case, the unavailability issue described above is possible.

  • No labels