DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Under Discussion
Discussion thread: here
JIRA: here
Motivation
KRaft relies on two critical timeouts to maintain cluster health: fetch timeouts (a follower hasn't received a timely response from the leader) and election timeouts (a candidate/unattached/prospective node hasn't completed an election in time). Both trigger state transitions that are essential for leader election and fault recovery.
Currently, there are no metrics that track how often these timeouts occur. We have no direct way to observe timeout frequency, making it difficult to detect network instability in Kraft.
So adding cumulative counters for these events gives us immediate visibility into KRaft cluster stability.
Public Interfaces
A new tagged metric will be added to the kafka.raft:type=raft-metrics metric group:
| MBean | Type | Description |
|---|---|---|
kafka.raft:type=raft-metrics,name=timeout-expirations,timer-name=fetch | Integer | This metric tracks the total number of fetch timeouts that have occurred on this node. A fetch timeout happens when a follower does not receive a timely fetch response from the leader. This metric is always registered regardless of the node's current state. |
kafka.raft:type=raft-metrics,name=timeout-expirations,timer-name=election | Integer | This metric tracks the total number of election timeouts that have occurred on this node. An election timeout happens when a node in Unattached, Prospective, or Candidate state does not complete an election in time and transitions to the next state. This metric is always registered regardless of the node's current state. |
The metric uses a single name timeout-expirations with a timer-name tag to distinguish between timeout types.
The metric increments when its corresponding timer transitions from unexpired to expired during a poll of the raft client. It does not increment if the timer remains in an already-expired state across consecutive polls.
Proposed Changes
Add two cumulative counters to KafkaRaftMetrics, each exposed as a Gauge<Integer>:
timeout-expirations{timer-name=fetch}: tracks the number of fetch timeout expirations.timeout-expirations{timer-name=election}: tracks the number of election timeout expirations.
Both metrics are registered at startup and remain registered for the lifetime of the raft client.
Compatibility, Deprecation, and Migration Plan
- New metrics only, no changes to existing behavior or interfaces.
- No deprecations. No migration needed.
Test Plan
Unit tests to verify that fetch timeout and election timeout scenarios correctly increment the respective metric counters.
Rejected Alternatives
1.Sensor with CumulativeCount instead of Gauge
We only need a simple cumulative counter. A Gauge backed by a volatile int is more straightforward than a Sensor with CumulativeCount, which is designed for more complex aggregation scenarios.
2.State-based metric registration/unregistration
We considered only registering timeout metrics during relevant states (e.g., fetch timeout only in FollowerState). The problem is that timeouts happen right at the end of a state, which means the counter increments and the node immediately transitions out, so operators would miss the updated value. Always registering is simpler and more useful.
3.Incrementing only when the timer transitions from unexpired to expired
We considered having the timer itself detect the transition and increment the counter automatically. This is not trivial to implement and not necessary. The simpler approach of incrementing in the poll methods when isExpired() returns true produces the same result since expirations always trigger immediate state transitions. This KIP assumes KAFKA-20514 is resolved first.