You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page is to talk about Fscking and autorecovery of ledgers and bookies, so that we can discuss and get a clear story of what needs to be done, from which we can then derive a list of JIRAs.

The problem can be split into two parts, detection and recovery.

Detection

Currently, we have no automated mechanism to check whether a bookie contains all the ledger entries it should, which can potentially lead to underreplication in the whole system. We need a mechanism to ensure that a bookie contains the entries which zookeeper says it does.

The brute force mechanism here would be for each bookie to get a list of ledger fragments it should have, and then read all entries in the fragment and check that the checksum is correct. A lighter approach would be to only check the first and last entry of a fragment. This could be expensive on systems which had many small ledgers though.

What about the case where a whole bookie disappears?

Open questions

  • Who triggers detection?
  • What do we do when we find a segment is missing?

Recovery

Once we detect that a fragment is underreplicated, who should run the process to recover it. How do we prevent two actors from attempting to recovery a fragment at the same time and potentially overload the system?

  • No labels