Star Deltas

Introduction

FSFS currently uses xdelta to store different version of the same node efficiently. Basically, we represent node as

and store x_0 plus the incremental \delta information. gets reconstructed by starting with x_0 and iteratively applying all deltas. Assuming that size(x_i) is roughly proportional to i and the deltas averaging around some constant value, this approach has the following properties:

with N being either the size of the node or the number of revisions to it. To mitigate the quadratic runtime behavior, we use skip-deltas:

with being next "rounder" binary

Since we skip intermediate representations, we will repeat the respective change information (approx .5 log N times). Storage size and reconstruction time are now

Please note that the actual implementation uses a hybrid scheme.

Observation. This approach does not cover forked node histories as they are common with branches and merges. The same change can be merged into many branches but may not result in the same node content (i.e. rep caching may not kick in). However, those changes themselves are very similar and often even identical.

A potential solution might be to deltify deltas, i.e. to use a second-order deltification scheme. Once that is done, even higher orders might be used. It is unclear how this could be implemented in a meaningful way. Also, its effectiveness depends heavily on the order in which branched deltas get processed.

Basic Ideas

We would like to be able to deltify everyting against else. no longer considers individual pairs of texts but rather deltifies the text against all previously processed text. For each new string we only need to store those parts that could not be found in any of the previous strings plus a set of instructions how to combine those parts.

Note that only the last line adds to the already existing data.

Expected storage size, runtime size, writing and reconstruction speed should all be O(N).

Note. Deltification is only a means to storage size reduction. We do not attach any semantics to the deltification result. It is neither minimal nor related to "svn delta" and friends.

Data structure

The core data structure consists of two elements: the text buffer and the instructions array. The text buffer is a single byte array containing the various string fragments that will be combined to fulltexts according to the instructions. The latter is a single array of (offset, count) pairs. The offset may be prefixed (see below) but in its simpliest form, it will copy COUNT bytes from text buffer starting at text buffer offset OFFSET.

Since we want to store more than a single fulltext ("string"), another array will map the string index to the first instruction in the instructions array:

Preliminary results

Test data set: first revisions of fs_fs.c

  • 644 files
  • 142 MB total
  • 379 kB max. file size

Data sizes:

  • 766 kB text body
  • 259k instructions (1.01 MB) without common sequence optimization
  • 357 kB on disk (using quick_pack)

Execution times:

  • 94 ms creation (1510 MB / s)
  • 60 ms optimization (2370 MB / s)
  • 4.1 ms compression
  • 158 ms total write time (900 MB/s data in, 2.2 MB/s out to disk)
  • 4 ms unpack
  • 19 ms reconstruct (7,200 MB/s)
  • 23 ms total read time (6,200 MB/data out, 15 MB/s in from disk)

When imported into an empty 1.8 repository:

  • 1.3 MB revs
  • 780 ms for svnadmin dump
  • No labels