This page is to talk about Fscking and autorecovery of ledgers and bookies, so that we can discuss and get a clear story of what needs to be done, from which we can then derive a list of JIRAs.
The problem can be split into two parts, detection and recovery.
Detection
Currently, we have no automated mechanism to check whether a bookie contains all the ledger entries it should, which can potentially lead to underreplication in the whole system. We need a mechanism to ensure that a bookie contains the entries which zookeeper says it does.
The brute force mechanism here would be for each bookie to get a list of ledger fragments it should have, and then read all entries in the fragment and check that the checksum is correct. A lighter approach would be to only check the first and last entry of a fragment. This could be expensive on systems which had many small ledgers though.
What about the case where a whole bookie disappears?
Open questions
- Who triggers detection?
- What do we do when we find a segment is missing?
Recovery
Once we detect that a fragment is underreplicated, who should run the process to recover it. How do we prevent two actors from attempting to recovery a fragment at the same time and potentially overload the system?