DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Under Discussion
Discussion thread: here
JIRA: here
Motivation
KRaft relies on two critical timeouts to maintain cluster health: fetch timeouts (a follower hasn't received a timely response from the leader) and election timeouts (a candidate/unattached/prospective node hasn't completed an election in time). Both trigger state transitions that are essential for leader election and fault recovery.
Currently, there are no metrics that track how often these timeouts occur. We have no direct way to observe timeout frequency, making it difficult to detect network instability in Kraft.
So adding cumulative counters for these events gives us immediate visibility into KRaft cluster stability.
Public Interfaces
A new tagged metric will be added to the kafka.raft:type=raft-metrics metric group:
| Metric Name | Tag | Type | Description |
|---|---|---|---|
timeout-expiration-count | timer-name=fetch | Integer | The total number of fetch timeouts that have occurred on this node. A fetch timeout happens when a follower does not receive a timely fetch response from the leader. Registered only when the node is in Follower state. |
timeout-expiration-count | timer-name=election | Integer | The total number of election timeouts that have occurred on this node. An election timeout happens when a node in Unattached, Candidate, Prospective does not complete an election in time and transitions to the next state(Resigned state not included). Registered only when the node is in Unattached, Prospective or Candidate state. |
The metric uses a single name timeout-expiration-count with a timer-name tag to distinguish between timeout types.
State-Based Metric Registration
Timeout metrics are registered and unregistered based on EpochState transitions:
| EpochState | Metrics Registered |
|---|---|
| Follower | timeout-expiration-count{timer-name=fetch} |
| Unattached | timeout-expiration-count{timer-name=election} |
| Prospective | timeout-expiration-count{timer-name=election} |
| Candidate | timeout-expiration-count{timer-name=election} |
When a metric is unregistered during a state transition, the internal counter is not reset. The metric resumes from its previous value when re-registered in a future state transition, capturing the cumulative timeout history of the node.
Proposed Changes
Add two volatile int counters to KafkaRaftMetrics, each exposed as a Gauge<Integer>:
fetch-timeout-expiration: incremented inKafkaRaftClient.pollFollowerAsVoter()whenFollowerState.hasFetchTimeoutExpired()returns true.election-timeout-expiration: incremented inKafkaRaftClientat places wherehasElectionTimeoutExpired()triggers a state transition.
Two new methods will be added to KafkaRaftMetrics:
updateFetchTimeoutExpiration(): increase the fetch timeout counterupdateElectionTimeoutExpiration(): increase the election timeout counter
Compatibility, Deprecation, and Migration Plan
- New metrics only, no changes to existing behavior or interfaces.
- No deprecations. No migration needed.
Test Plan
Unit tests to verify that fetch timeout and election timeout scenarios correctly increment the respective metric counters.