One property BookKeeper satisfies is persistence: if an entry e is added to a ledger and the client receives a confirmation, then the last entry of the ledger once it is closed is at least e.
The procedure we use to recover a ledger alone does not prevent a violation of the persistence property. Consider the following scenario. We have an application with a ledger writer and a ledger reader waiting until the ledger writer closes the ledger to read. The leader reader, however, suspects incorrectly that the ledger writer has crashed and proceeds to open the ledger and recover it. The reader reads as many entries as it can and writes the identifier of the last confirmed entry to the metadata store. The ledger writer concurrently writes more entries to the ledger successfully, even though the ledger has been closed already.
This case violates persistence. To prevent such cases, we implement a fencing mechanism. Fencing prevents two clients from modifying a ledger concurrently, either by adding entries or closing it. To implement fencing, we embed into the recovery protocol a mechanism to fence off bookies. With this mechanism, clients notify the bookies in the ensemble of a ledger that it needs to close the ledger. Once a bookie receives such a notification, it marks its ledger fragment as fenced and errors out any request to add an entry to the ledger. To implement this feature, we use a recovery flag in messages from clients to bookies. All messages from a ledger reader have the flag on while recovering the ledger.
This flag guarantees that all bookies with which the ledger reader exchanges messages during recovery promise not to process new add requests without the flag for the ledger. Since during recovery a client needs to contact at least one bookie in every ack quorum, in each ack quorum there is at least one bookie that has been fenced off during recovery. Consequently, we guarantee that the ledger writer is not able to add more entries successfully once enough bookies have been fenced. To prevent ensemble changes during recovery, we first note in the ledger metadata that recovery has started.
We show an example in the attached figure. In the example, we have an ensemble of three bookies and we replicate entries in two bookies. The ledger writer adds successfully up to entry 11 and writes entry 12 partially. The reader initiates recovery by first obtaining the last confirmed add from each bookie, and starts from the next after the latest, which corresponds to entry 12. The reader reads entry 12 and ensures that it is replicated in a quorum by writing back entry 12. The next entry, entry 13, does not exist, since it is not present in any of the bookies that would have it, had it been written. The reader finally declares the ledger closed and writes to the metadata store, using CAS. When the ledger tries to complete the add of entry 12, the bookie rejects the request because it has been fenced.
To fence a ledger from the client API, you can simply open it using BookKeeper#asyncOpenLedger or BookKeeper#openLedger. If the ledger you are opening is currently being written to by another client, it will automatically be fenced, and the writing client will receive a BKWriteException for any subsequent calls to LedgerHandle#addEntry/LedgerHandle#asyncAddEntry.
However, there are cases where you do not want to fence a ledger when you open it. For example, if you have a standby node tailing the WAL to maintain a up to date state for warm failover, then you do not want to impede the writer. In this case, you can open the ledger with BookKeeper#asyncOpenLedgerNoRecovery or BookKeeper#openLedgerNoRecovery. These methods do not guarantee that you will read the entire ledger, as if there is a concurrent writer, you will only be able to read the prefix of entries which had been confirmed at the time the ledger was opened. In these cases, you must open the ledger again with BookKeeper#asyncOpenLedger or BookKeeper#openLedger before failing over to your warm backup.
The full API doc is available at http://zookeeper.apache.org/bookkeeper/docs/r4.0.0/apidocs/org/apache/bookkeeper/client/BookKeeper.html.
The following are the steps that the client takes to fence a ledger. This happens in tandem with recovery, so some details of this are also included. Multiple client can run the sequence of steps concurrently; the result will be the same and correct for all clients.
Step 1 (flag the ledger as in recovery) The client updates the zookeeper metadata for the ledger. This prevents the writer from changing the ensemble to continue writing. If the writer tries to change the ensemble it will see that the version of the metadata it has is out of date, and will have to read the new metadata before rebuilding the ensemble. On reading it will see the ledger is in recovery, so it will try to close the ledger.
Step 2 (fence the bookies) The client will send a read request to all bookies in the ensemble. It waits until it has received a response from at least one bookie in each possible ack quorum. For the RoundRobinDistributionSchedule(which is the only one at the moment), there are as many possible quorums as there are bookies in the ensemble. An individual bookie will usually be in more than one of these quorums. RRQuorumCoverageSet handles the logic for ensuring all the quorums are covered. Once this step is complete, the ledger is fenced. No more entries can be added.
Step 3 (ledger is recovered) At this point, the client can start recovering the ledger. It has read the last confirmed entry during Step 2. It starts reading forward, one entry at a time, until it gets an NoSuchEntry response from at least one bookie in each quorum for a particular entry id. The last entry in the ledger is, therefore, the preceding entry. The ledger is closed with this as its last entry.