Status

Current state: Draft

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The controller quorum fetch loop relies on a timing invariant for correctness and liveness: the  controller.quorum.fetch.timeout.ms must be significantly larger than Raft maximum fetch wait time(current hardcode to 500) after KAFKA-16926 Optimize BeginQuorumEpoch heartbeat.

In practice, the algorithm makes progress because:

raft_max_fetch_wait_time (500 ms) * 2  <=  controller.quorum.fetch.timeout.ms 

If operators set an unusually small raft_max_fetch_wait_time * 2, this invariant can be violated, risking premature timeouts, spurious retries, and degraded stability. To preserve the invariant across configurations and upgrades, we propose to enforce a lower bound of 1000 ms for controller.quorum.fetch.timeout.ms.

For example:

We assume network latency is 100ms

T=0ms:     The Leader send beginQurum resquest to the follower.

T=100ms:     The Follower start fetching and sends fetch request and starts 500ms *fetchTimer*
T=600ms: Leader does not receive fetch request (RAFT_MAX_FETCH_WAIT_MS e.g. 500)
                     do two things:
                     1) The Leader send BeginQuorum request to the follower again.
                     2) The follower sends fetch request to leader and start *fetchTimer*
T=700ms: four cases
                    1) the follower receives BeginQuorum and response it (100 network latency) -> leader still works
                    2) the follower does not receive beginQuorum request -> leader MAYBE not work
                    1) the leader receives fetch and response it (100 network latency) -> follower still works
                   2) the leader does not receive fetch Request -> follower MAYBE not work

T=800ms: Follower's fetch timer is expired.

then the follower transitions to the *Prospective state*, and it triggers
election... (wasted resources even the leader is working fine).

From the above flow, we give the leader and follower two changes to check each state, one is beginQuorum, another is fetch but their start point is different.

Public Interfaces

Configuration change (validation only):

  • controller.quorum.fetch.timeout.ms

    • Type: int (ms)

    • Lower bound: 1000 ms (enforced)

    • Default: unchanged (2000ms)

    • Behavior: Values < 1000 ms are invalid and will be rejected with a clear error message.

Proposed Changes

  • Config validation:

    • Update the config definition for controller.quorum.fetch.timeout.ms to use atLeast(1000).

    • Apply the same validation path for both static startup configs and any dynamic/config-provider paths that may set this value.

  • Error messaging:

    • On invalid values, throw a ConfigException with actionable guidance, e.g.:

      controller.quorum.fetch.timeout.ms must be ≥ 1000 ms to ensure the controller fetch loop makes progress without violating timing assumptions.”

  • Documentation:

    • Update the config reference to document the lower bound and the rationale (ties to fetch-wait timing and controller liveness).

Compatibility, Deprecation, and Migration Plan

  • Backward compatibility:

    • Deployments already using values ≥ 1000 ms are unaffected.

    • Deployments with < 1000 ms will fail fast at startup with a clear error.

  • No deprecations; this is a validation tightening to ensure correctness.

  • Migration:

    • Operators must raise any sub-1000 ms values to ≥ 1000 ms before/with the upgrade.

Test Plan

  • Unit tests for config validation.

Rejected Alternatives

  1. Do not change the limitation:
    • Maybe cause the system unavailable if users' configuration is wrong.

  2. Make raft max fetch wait configurable instead:

    • Out of scope here. Even if exposed, we would still need a minimum for the timeout to maintain a robust ratio.

  • No labels