Skip to end of metadata
Go to start of metadata

Problem

Bookkeeper is not able to reclaim disk space when it's full

If all disks are full or almost full, both major and minor compactions would be suspended, and only GC will be running. In the current design, this is the right thing to do, because when disks are full, EntryLogger can not allocate any new entry logs any more, and apart from that,  the intention is to prevent disk usage from keep growing.

However, the problem is if we have a mixed of short-lived ledgers and long-lived ledgers in all entry logs, when disks are full, GC wouldn't be able to delete any entry logs, and if compaction is disabled, bookie can't reclaim any disk space any more by itself.

Compaction might keep generating duplicated data which would cause disk full

Currently, there's no transactional operation for compaction. In the current CompactionScannerFactory, if it fails to flush entry log file, or fails to flush ledgerCache, the data which is already flushed wouldn't be deleted, and the entry log that is being compacted will be retried again for the next time, which would generate duplicated data.

Moreover, if the entry log being compacted has long-lived data and the compaction keeps failing for some reason(e.g. corrupted entry, corrupted index), it would cause the BK disk usage keep growing until the either the entry log can be garbage collected, or disk full.

Proposal

Use a separate log for compaction

In order to address the first issue, we can use a separate log file for compaction and have a separate allocation logic. To allocate a compaction log file, we don't have to choose from the writable ledger directories, which is determined by diskWarnThreshold. In fact, as long as there's a ledger directory that has enough disk space for the next compaction log (we can use log size limit), we should be good to allocate. Because when disks are full, bookie must be running in read only mode, only compaction would write to ledger disks.

Add transactional phases for compaction

Once we separate the log file for compaction, we can achieve a transactional compaction operation. By "transactional", we mean that if anything fail at any phases during compaction, we should be able to roll back the current compaction properly, failed compaction would still be able to retry in the next scan, but rolling back the failed compaction would help us clean up the duplicated data.

Add recovery for compaction

If for any reason we succeed in flushing the compaction log but then fail to update the index or flush the ledger cache, we would get into the situation that the index is partially updated. In this case, we can't simply roll back by deleting the old entry log or compaction log, because the partially updated index file might already pointing some ledgers to the new compaction log. So we need a way to recover the partially updated index file for the compaction log file.

Design

Create a separate file for compaction but share the same allocation logic

The idea is that we reuse the current allocation logic but use a different file suffix for compaction. In this way, we can make minimum changes but achieve the goal of using a separate file for compaction.

Introduce Compaction Transactional Phases

We can add a separate class CompactionWorker to handle all the compaction logic in the same place. The main compact function would be like this:

 

And this is the abstract class for all compaction phases.

 

 

  • No labels