Problem

Potential Data Loss

When we take a bookie out of the cluster, we have no idea of whether doing so would cause any quorum loss or not.

For example, if we've already lost or removed a few bookies, the auto rereplication might still be working to rereplicate the ledgers for the lost bookies or even fail to replicate some ledgers. Now we want to remove another bookie, and to make sure we don't lose any data, we need to know whether the ledgers on the lost bookies which shares a quorum with this bookie have all been replicated or not. However, there's no bookie lifecycle state which could indicate whether the bookie is still being replicating or already replicated. So removing a bunch of bookies too quickly would potentially cause some quorum loss.

Situations like this are listed but not limited to:

Scale down the cluster
OS migration (need reinstall)
Move bookies to a different cluster
Frequent bad disks

Expensive Operation & Lack of Automation

Currently, to remove a few bookies out of cluster, we have to do a lot of manual operations:

Kill the bookie
Run the manual rereplicator / auto rereplicator
Wait for rereplication done for that bookie
Remove the cookie for that bookie
Remove the bookie out of the cluster
Start killing the next bookie

The manual operation for a single bookie is not that bad, but when we are dealing with hundreds and thousands of bookies, the operations will become a pain. And there's currently no easy way to automate the whole process because we need to do a lot of manual checks to prevent any data loss.

If we can add a HTTP management interface and automate the whole lifecycle of bookkeeper operation, we would get a lot of benefits:

Bookkeeper on cloud can be auto-scaled
Save engineer time doing operations
HTTP endpoint would be easier to interact with
Provide the possibility to add an admin dashboard for Bookkeeper cluster

Design

Bookkeeper Lifecycle State

Active
Draining
Draining Failed
Drained

Lifecycle States

Active

An active bookie is a bookie that has been registered in the cluster and is ready to serve the traffic at any time. A bookie that is in active state doesn’t necessary mean that the bookie is running. An active bookie could be a running bookie serving read and write traffic, a read only bookie, or even a bookie registered but not running the server.

Draining

A draining bookie is a bookie that we're going to retire or remove from the cluster. If a bookie is in draining state, it should either serve in read-only mode or do not serve any traffic. Bookie can only transition from active state to draining state. When a bookie transition to draining state, Auditor should detect / be notified about the state change and start the rereplication for that bookie.

Draining-Failed

If the bookie is in draining state, the draining process would either succeed or failed. When we tried as best as we could but still not able to replicate all the ledgers on the draining bookie, the bookie should transition into draining-failed state. Once a bookie is in draining-failed state, auto rereplicators should stop replicating the bookie. A draining-failed bookie can still serve read traffic but no write traffic. Usually a draining-failed bookie should not transition into any other state automatically, because it indicates that some ledgers on that bookie can't be replicated, thus removing that bookie might potentially cause data loss.

For those draining-failed bookies, we should manually take care of them, we can move the state into the drained state only after fixing the "bad" ledgers. But there's a concern that if we have a lot of draining-failed bookie, there would be a lot of manual work, which would be conflicted with the goal of automating the whole lifecycle process. So in the future, we might need a better solution or add some strict rules to auto-fix the "bad" ledgers.

Drained

A bookie will turn into drained state from draining state once all its ledgers have been replicated successfully. Drained state indicates that the bookie is now safe to be removed from the cluster since we guarantee that all ledgers on that bookie has been replicated successfully and removing the bookie wouldn't cause any data loss.

Where do we persist lifecycle state?

There're basically two options:

In Zookeeper (Metadata Store)
On Ledger disk / Journal disk

We prefer to store the lifecycle state on Zookeeper because we need to access and manage the lifecycle state even when the bookie host is not available. If we store the state on disk, when the bookie have bad disk or is somehow not reachable, we would lose the capability to access the lifecycle state of that bookie.

In Zookeeper, we can create a znode for the lifecycle state under the cookie znode to store the state data: cookies/<ip:port>/state

How do we manage lifecycle state?

Default: Active

Active -> Draining: We can build a tool to modify the state in Zookeeper, or we can set it through an HTTP endpoint once we have it

Draining -> Draining-Failed: Auditor change the state in ZK when it detects not all ledgers can be replicated successfully

Draining-Failed -> Drained: We manually set the state using a tool or through HTTP endpoint, or we can try to make auditor smart enough to handle all kinds of replication failures

Draining -> Drained: Auditor change the state in ZK when it detects all ledgers for the bookie have been replicated successfully

How to listen and react to lifecycle state change?

Auto Rereplicator on bookie

What do we do with a drained bookie?

A drained bookie is safe to be removed from the cluster at any time. We need to delete the cookie and clear the lifecycle state before removing the bookie out of the cluster, otherwise, when the bookie comes back, it will still be in draining state, which means it can only serve in read-only mode.

Bookkeeper HTTP Management Interface

Manage Lifecycle State
Manage Serving state
Manage Cookie
Register callback for state change
Report bad ledgers
Report

Actions

TBA

Child pages

BP-4 - BookKeeper Lifecycle Management