DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: "Accepted"
Discussion thread: https://lists.apache.org/thread/sx5z18qrw5j2bmfbzgobxxycnrxngmdc
Vote Thread: https://lists.apache.org/thread/yglpr9nnvrcxk0knf5xvzhqrdfdzg3f7
JIRA:
KAFKA-19541
-
Getting issue details...
STATUS
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
There is a liveness problem in FetchSnapshot and Fetch requests. If a new controller is added to a cluster where there is a reliable but “slow” network connection between the new controller and the active controller; the new controller may never be able to become part of the quorum. If it is not possible to transmit the full amount of data in a FetchSnapshot request before the fetch timeout (controller.quorum.fetch.timeout.ms) then KRaft protocol dictates that the request must be cancelled and the controller transition to a Candidate/Prospective state (see KIP-996: Pre-Vote ). The new controller, being unable to successfully complete a FetchSnapshot request, continuously cycles between Unattached, Follower and Prospective which violates a liveness property. It will never able to become part of the quorum without manual intervention (increasing fetch timeout). The new controller, by becoming a candidate and increasing the epoch will start to trigger frequent leadership elections which it can never win causing further chaos in the cluster. This problem is the cause of KAFKA-19541. Currently; there is an 8MiB default byte size limit for FetchSnapshot (and Fetch) which governs the amount of data sent for each Fetch and FetchSnapshot request.
A controller becoming "stuck" is theoretically a problem in Fetch although it is less likely to occur when compared to FetchSnapshot since Fetch is incremental and typically involves less data transmitted per request.
Public Interfaces
Senders of FetchSnapshot requests will retrieve a maximum of the configured value per (of bytes) per request.
controller.quorum.fetch.snapshot.max.bytes Description: Maximum amount of data to retrieve for each FetchSnapshot request to the controller. Type: Int Default: 1048576 (1 mebibyte) Valid values: [1,...]
Fetch is similar but with a slight variation. FetchSnapshot retrieves unaligned memory records which means it can fetch an exact number of bytes. However for correctness, Fetch will always retrieve at least 1 record even if it goes above the 1MiB limit.
controller.quorum.fetch.max.bytes Description: Maximum amount of data to retrieve for each Fetch request. Always returns at least one record if a new one is available. Type: Int Default: 1048576 (1 mebibyte) Valid values: [1,...]
Proposed Changes
Since both Fetch and FetchSnapshot currently support retrieving records using multiple requests (with each request fetching a max of 8MiB) - the behavioural change in this KIP primarily reducing the (current) 8MiB limit to 1MiB (specified in new configurations) by default. The max size of fetch is currently governed by an internal configuration called internal.max.size.bytes. We will remove this configuration and replace it with the public controller.quorum.fetch.max.bytes. MaxBytes parameters of FetchSnapshot is currently hard-coded in the KafkaRaftClient. The hard coded value will instead be set by configuration. The primary implementation changes involve plumbing the configurations into the correct places and creating adequate automated testing.
The default value of 1MiB will likely make the fetches slightly slower for deployments with high speed networks. It effectively limits the amount of data which can be inflight at any given time. As a result, controllers will not be able to make the most efficient use of the underlying TCP connections. The 1MiB default value is low enough that a fairly low performance (IE between data-centres) or a degraded network connection will allow requests to operate normally.
Compatibility, Deprecation, and Migration Plan
Changing defaults are unlikely to cause significant performance problems for existing users. Default size of Fetch and FetchSnapshot are reduced 8x but this does not imply an 8x reduction in overall network throughput. FetchSnapshot happens rarely and usually when a new controller / broker joins the cluster. It is unlikely to cause availability problems for the quorum; there will may be a delay in the new controller becoming a Voter. This operation has no effect on availability. Typically, Metadata snapshots are usually in the order of < 10MiB so the delays are unlikely to be significant in the larger life of the quorum. Customers with larger Metadata snapshots can of course, increase the default value.
Fetch requests have a caveat where they must send aligned records rather than the unaligned records of FetchSnapshot. In other words, "whole" records must be transmitted rather than just raw data. In the current implementation, Fetch requests will always be responded to with at least one record. As such, even if controller.quorum.fetch.max.bytes=1 it would not be possible for the RaftClient to get "stuck" returning zero records to such requests.
The change is backwards compatible for both Fetch and FetchSnapshot. Two RaftClients having different values for the configurations will not effect correctness. For Fetch, RaftClient will handle whatever number of records are returned regardless of what was sent. For FetchSnapshot, requests will continue to be sent until the entire snapshot is returned.
Test Plan
We can add unit tests which prove that the configured value is used in requests.
Rejected Alternatives
- Add no new configurations. Currently KAFKA-19541 can be mitigating by increasing the fetch timeout. In practice it is quite difficult to identify occurrences of KAFKA-19541 and increasing the fetch timeout can lead to larger availability problems than the proposed new configurations.
- Add only a configuration for FetchSnapshot. It's likely more rare that Fetch will encounter KAFKA-19541. At the same time, there is also the benefit of explicitly specifying these limits as part of the KRaft protocol rather than have them as implicit implementation details.
- Add configurations but also set the default values to 8MiBs. Proposed 1MiB limit means that network connections which are as low as 500Kb/s would be supported.