Master KIP

KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum (Accepted)

Status

Current state: Under DiscussionAccepted

Discussion thread: here

Voting thread: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

For the KIP-500 bridge release (version 2.6.0 as of the time of this proposal), brokers will be allowed to read from ZooKeeper, but only the controller will be allowed to write. Since we will not be able to use ZooKeeper as an event bus between brokers, we must come up with an alternate strategy.

With this KIP, we propose to add a new RPC that allows a broker to directly communicate state changes of a replica to the controller. This will replace the ZooKeeper based notification for log dir failures and potentially could replace the existing controlled shutdown RPC. Since this RPC is somewhat generic, it could also be used to mark a replicas a "online" following some kind of log dir recovery procedure (out of scope for this proposal).

Public Interfaces

We will add a new RPC named ReplicaStateEvent named AlterReplicaState which requires CLUSTER_ACTION permissions

Code Block

ReplicaStateEventRequestAlterReplicaStateRequest => BrokerId BrokerEpoch EventType EventReason [Topic [PartitionId LeaderEpoch]]
	BrokerId => Int32
	BrokerEpoch => Int64
	EventTypeNewState => Int32Int8
	EventReasonReason => String
	Topic => String
	PartitionId => Int32
	LeaderEpoch => Int32

ReplicaStateEventResponse
AlterReplicaStateResponse => ErrorCode [Topic [PartitionId ErrorCode]]
	ErrorCode => Int32
	Topic => String
	PartitionId => Int32

...

CLUSTER_AUTHORIZATION_FAILED
STALE_BROKER_EPOCH
NOT_CONTROLLER
UNKNOWN_REPLICA_EVENT_TYPE (new)

Partition-level errors:

INVALID_REQUEST: The update was rejected due to some internal inconsistency (e.g. invalid replicas specified in the ISR)
UNKNOWN_TOPIC_OR_PARTITION: The topic no longer exists.

...

Upon encountering errors relating to the log dirs, the broker will now send an RPC to the controller indicating that one or more replicas need to be marked offline. Since these types of errors are likely to occur for group of replicas at once, we will continue to use a background thread in ReplicaManager to allow these errors to accumulate before sending a message to the controller. The controller will synchronously perform basic validation of the request (permissions, do check if topics existsexist, etc) and asynchronously perform the necessary actions to process the replica state changes.

Previously, the broker would only send its ID to the controller using the ZK watch mechanism. Since only the broker ID was sent (and not the broker epoch) the controller needed to probe the broker to learn its true state. Theoretically, a broker could have been repaired and restarted before the controller had a chance to react to the event. The controller would probe the brokers using LeaderAndIsr requests to learn in what state the replicas were in. In this proposal, we now include the broker ID and epoch in the request, so the controller can safely update its internal replica state based on the request data. If the broker had, in fact, been restarted since sending the ReplicaStateEventAlterReplicaState, the controller would be gated by the broker epoch and would not take any action.

RPC semantics

The EventType field field in the request is will only support the value 1 0x1 which will represent "offline"
The EventReason field is an optional a textual description of why the event is being sent
If no Topic is given, it is implied that all topics on this broker are being indicated
If a Topic and no partitions are given, it is implied that all partitions of this topic are being indicated
LeaderEpoch is optional and is not validated for the "offline" event type

Failure Modes

Like the ZK watch before, the ReplicaStateEvent the AlterReplicaState RPC from the broker to the controller is best-effort. We are not guaranteed to be able to send a message to the controller to indicate a replica state change. Also, since the processing of this RPC by the controller is asynchronous, we are not guaranteed that the subsequent handling of the state change will happen. The following failure scenarios are possible:

If the broker cannot send the ReplicaStateEvent the AlterReplicaState request to the controller and the offline replica is a follower, it will eventually be removed from the ISR by the leader.
If the broker cannot send the ReplicaStateEvent the AlterReplicaState request to the controller and the offline replica is a leader, it will remain the leader until the broker is restarted or the controller learns about the replica state through a LeaderAndIsr request (this is the current behavior).
If the controller reads the ReplicaStateEvent the AlterReplicaState and encounters a fatal error before handling the subsequent ControllerEvent, a new controller will eventually be elected and the state of all replicas will become known to the new controller.

For the Broker-side failures, we should implement retries on the AlterReplicaState messages.

Compatibility, Deprecation, and Migration Plan

Since this change is between the brokers, we will use the inter-broker protocol version as a flag for using this new RPC or not. Once the IBP has been increased, the brokers will not write out ZK messages for log dir failures and the controller will not set a watch on the parent znode. They will instead use the new RPC detailed here.

It's also worth mentioning that this KIP does not propose to be the exclusive mechanism for marking replicas offline. The controller will still mark replicas offline if it receives certain errors from a LeaderAndIsr response.

Rejected Alternatives

AlterIsr RPC

...

Another rejected approach was to add an RPC which mirrored the JSON payload used by the ZK workflow currently implemented. This was rejected in favor of a more generic RPC that could be used for other purposes in the future. It was also rejected to prevent "leaking" the notion of a log dir to the public API and to the Controller.

Future Work

We have intentionally designed this KIP to accommodate future use cases involving the replica state. For example, we would potentially mark a replicas a "online" following some kind of log dir recovery procedure.

This RPC is quite similar to the existing ControlledShutdown RPC. If we extend this RPC to define a null list of topics to mean all topics, we could subsume the ControlledShutdown RPC with the AlterReplicaState RPC. After we implement this new RPC, we can consider if we want to move forward and consolidate ControlledShutdown into AlterReplicaState as a future KIP.

Space shortcuts

Child pages

Versions Compared

Old Version 10

New Version Current

Key

Master KIP

Status

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Work

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 10

New Version Current

Key

Master KIP

Status

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Work