Status

Current state: Under Discussion

Discussion thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The replica fetcher threads handle multiple partitions. In case a partition fails, the replica fetcher thread associated with that partition terminates. The partitions that have caught up and are running well are also left untracked with termination of the thread which leads to under-replicated partitions. A better approach would be, whenever a partition crashes, the concerned thread should stop tracking the crashed partition and continue handling rest of the partitions.

Public Interfaces

New metrics:

FailedPartitionsCount - Count of partitions that have failed. Instead of separate metrics, clientId is used as a tag to distinguish between Replica and ReplicaAlterLogDir fetchers. Keeping it consistent with some other metrics like MaxLag.

Proposed Changes

In case a partition fails, the replica fetcher thread would stop tracking the failed partition. Instead of throwing an exception which ends up terminating the thread, an error message will be logged and the partition will be added to the failedPartitions set. The thread would continue monitoring rest of the partitions which are lost in the current scenario.

Until the next leader epoch, the partition would remain in the failedPartitions set. At the leader epoch, the failed partitions would be removed from fetcherLagStats and partitionStates, and would be marked as un-failed by removing from the set for failed partitions. Hereafter, the controller can choose the partition as leader or follower and would follow the usual behavior.

Since the two replica fetchers (ReplicaFetcherThread and ReplicaAlterLogDirsThread) are quite similar in behavior and are extended from the same class, probably should not make one deviate much from the other.

Some other potential problems that can be addressed -

Handling exceptions raised during truncating

Compatibility, Deprecation, and Migration Plan

The metric FailedPartitionCount would keep track of the failed partitions. It's a newly added metric which would handle partition failure in a better way. It would avoid losing several healthy partitions in case partition failure occurs.

Rejected Alternatives

Retries - The thread can make attempts to connect to the failed partition which would mostly hit the same problem.
Shutting down the broker - If more than 50% partitions on a broker have failed, the broker can be shut down

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-461 - Improve Replica Fetcher behavior at handling partition failure

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives