This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • KIP-236: Interruptible Partition Reassignment
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »


Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).


The current tool imposes the limitation that only a single batch of partition reassignments can be in-flight, and it is not possible to cancel a reassignment that is in-flight. This has a number of consequences:

  1. If users need to do a large-scale reassignment they end up having to do the reassignment in smaller batches, so they can abort the overall reassignment sooner, if operationally necessary
  2. Each batch of reassignments takes as long as the slowest partition; this slowest partition prevents other reassignments from happening.

This change would enable

  • Adding more partition reassignments, while some are still in-flight.
  • Cancel individual partition reassignments (by reverting the reassignment to the old set of brokers)
  • Development of an AdminClient API which supported the above features.

To illustrate the last bullet, consider an AdminClient API for partition reassignment that returns a Future providing access to a ReassignmentPartitionsResult which (implicitly) includes the identity of each partitions Reassignment. Further AdminClient APIs can be added to:

  • query the status of a particular reassignment
  • list all current reassignments
  • change a current reassignment
  • scope a throttle to the duration of a reassignment

Public Interfaces

Strictly speaking this is not a change that would affect any public interfaces (since ZooKeeper is not considered a public interface, and it can be made in a backward compatible way), however since some users are known to operate on the /admin/reassign_partitions znode directly I felt it was worthwhile using the KIP process for this change.

Proposed Changes

The main idea is to move away from using the single /admin/reassign_partitions znode to control reassignment. Instead each partition currently being reassigned will have a child znode of /admin/reassignments. The controller will listen for changes to the children of /admin/reassignments and initiate reassignment when a new child is added. For example to reassign partition 42 of mytopic to the brokers [10,11,15]:

/admin/reassignments/mytopic-42 would be created with contents {"version":1,"assignment":[10,11,15]}

The controller will also watch for changes in the contents of these znodes, and re-initiate reassignment if the content changes. This means it is possible to "cancel" the reassignment of a single partition by changing the reassignment to the old assigned replicas.Note however that the controller itself doesn't remember the old assigned replicas – it is up to the client to record the old state prior to starting a reassignment if cancellation is necessary for that client.

When the ISR includes all the replicas given as the contents of the reassignment znode, ([10,11,15] in the example) the controller would remove the relevant child of /admin/reassignments (i.e. at the same time the  controller currently updates/removes the /admin/reassign_partitions znode).

In this way the child nodes of /admin/reassignments are precisely the replicas currently being reassigned. Moreover the ZooKeeper czxid of a child node identifies a reassignment uniquely. It is then possible to determine whether that exact reassignment is still on-going, or (using the mzxid) has been changed.

Compatibility, Deprecation, and Migration Plan

In order to retain compatibility with existing software that understands /admin/reassign_partitions the controller would be changed to create the child nodes of /admin/reassignments when the /admin/reassign_partitions znode was created. The code to update and delete /admin/reassign_partitions would also be retained, so the only difference to a client that operates on /admin/reassign_partitions would be a slight increase in latency due to the need to create the new znodes. This compatibility behaviour could be dropped in some future version of Kafka, if that was desirable.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • No labels