You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Status

Current state: Under Discussion

Discussion thread: here

JIRA: KAFKA-6359

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The current kafka-reassign-partitons.sh tool imposes the limitation that only a single batch of partition reassignments can be in-flight, and it is not possible to cancel a reassignment that is in-flight. This has a number of consequences:

  1. If users need to do a large-scale reassignment they end up having to do the reassignment in smaller batches, so they can abort the overall reassignment sooner, if operationally necessary
  2. Each batch of reassignments takes as long as the slowest partition; this slowest partition prevents other reassignments from happening.
  3. The ZooKeeper-imposed limit of 1MB on znode size places an upper limit on the number of reassignments that can be done at a given time.

This change would enable

  • Adding more partition reassignments, while some are still in-flight.
  • Cancelling individual partition reassignments (by reverting the reassignment to the old set of brokers)
  • Development of an AdminClient API which supported the above features.

To illustrate the last bullet, consider an AdminClient API for partition reassignment that returns a KafkaFuture providing access to some ReassignmentPartitionsResult which (implicitly) includes the identity of each partitions Reassignment. Further AdminClient APIs could then be added to:

  • query the status of a particular Reassignment
  • list all current Reassignments
  • change a current Reassignment
  • scope a throttle to the duration of a Reassignment

These things are hard to do currently because individual partition reassignments don't have a stable identity.

Public Interfaces

Strictly speaking this is not a change that would affect any public interfaces (since ZooKeeper is not considered a public interface, and it can be made in a backward compatible way), however since some users are known to operate on the /admin/reassign_partitions znode directly I felt it was worthwhile using the KIP process for this change.

Proposed Changes

The main idea is to move away from using the single /admin/reassign_partitions znode to control reassignment.

Instead a protocol similar to that used for config changes will be used. To start a reassignment (of a single partition, or a collection of partitions) a persistent sequenced znode will be created in /admin/reassignment_requests. The controller will be informed of this new request via a ZooKeeper watch. It will initiate the reassignment, create a znode /admin/reassignments/$topic-$partition and then delete the /admin/reassignment_requests/request_NNNN znode.

A client can subsequently create more requests involving the same partitions (by creating a /admin/reassignment_requests/request_MMMM znode), and the controller would re-initiate reassignment as the assignment changes. This means it is possible to "cancel" the reassignment of a single partition by changing the reassignment to the old assigned replicas. Note, however, that the controller itself doesn't remember the old assigned replicas – it is up to the client to record the old state prior to starting a reassignment if that client needs to support cancellation.

When the ISR includes all the newly assigned replicas (given as the contents of the reassignment znode), the controller would remove the relevant child of /admin/reassignments (i.e. at the same time the controller currently updates/removes the /admin/reassign_partitions znode).

In this way the child nodes of /admin/reassignments are correspond precisely to the replicas currently being reassigned. Moreover the ZooKeeper czxid of a child node of /admin/reassignments identifies a reassignment uniquely. It is then possible to determine whether that exact reassignment is still on-going, or (by comparing the czxid with the mzxid) has been changed.

The following sections describe, in detail, the algorithms used to mange the various znodes

The supporting the legacy znode protocol (/admin/reassign_partitions)

  1. Some ZK client creates /admin/reassign_partitions with the same JSON content as currently
  2. ZK triggers watch on /admin/reassign_partitions, controller notified
  3. Controller creates /admin/reassignment_requests/request_xxx (for xxx the znode sequence number)

The JSON in step 3 would look like:

{
    "version":1,
    "assignments":{
        "my-topic": {"0":[1,2,3]}
        "your-topic": {"3":[4,5,6]}
    }
    "legacy":true
}

The "legacy" key is used to track that this request originates from /admin/reassign_partitions.

The new znode protocol

  1. Some ZK client creates /admin/reassignment_requests/request_xxx (without "legacy" key mentioned above, unless it was created via the legacy protocol described above)
  2. ZK triggers watch on /admin/reassignment_requests/request_xxx, controller notified

  3. For each /admin/reassignment_requests/request_xxx the controller:

    1. Reads the znode to find the partitions being reassigned and the newly assigned brokers.

    2. Checks whether a reassignment for the same partition exists:
      1. If it does, and one is a legacy request and the other is not, then it logs a warning that the request is being ignored.
      2. Otherwise:

        1. The controller initiates reassignment

        2. The controller creates or updates  /admin/reassignments/$topic/$partition corresponding to those partitions
    3. The controller deletes /admin/reassignment_requests/request_xxx

The check in 3.b.i prevents the same partition being reassigned via different routes.

Revised protocol when a partition "catches up

  1. If the reassignment was via the legacy path:
    1. The controller removes partition from /admin/reassign_partitions, or if this is the last partition being reassigned by a change to /admin/reassign_partitions, then /admin/reassign_partitions is deleted.
  2. The controller removes /admin/reassignment/$topic/$partition

On controller failover

When failing over, the new controller needs to cope with three cases:

  1. Creation of /admin/reassign_partitions while there was no active controller
  2. Creation of childen of /admin/reassignment_requests while there was no active controller
  3. Restarting reassignment of all the in-flight reassignments in /admin/reassignments

The controller context's partitionsBeingReassigned map can be updated as follows:

  1. Add a ReassignedPartitionsContext for each reassignment in /admin/reassignments
  2. Add ReassignedPartitionsContexts for each child of /admin/reassignment_requests, but only if the context being added has the same legacy flag as any current context for the partition.
  3. Add ReassignedPartitionsContexts for partition in /admin/reassign_partitions, but only if the context being added has the same legacy flag as any current context for the partition.

Compatibility, Deprecation, and Migration Plan

As described above, compatibility with /admin/reassign_partitions is maintained, so existing software will continue working and the only difference to a client that operates on /admin/reassign_partitions would observe would be a slight increase in latency due to the round trips needed to create the new znodes.

This compatibility behaviour could be dropped in some future version of Kafka, if that was desirable.

Rejected Alternatives

  1. A similar protocol based on just /admin/reassignments without the /admin/reassignment_requests was initially considered, but that required a ZK watch per reassignment, which would not scale well. This proposal
    requires only 1 more watch than the current version of the broker. 
  • No labels