Status

Current state: Accepted

Discussion thread: https://lists.apache.org/thread/8xf5245tclf1mb18055px47b982rdg4b

JIRA: CASSANDRA-18397 - Getting issue details... STATUS

Released: 5.0.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Making compaction choices in Cassandra is currently a very difficult task, with compaction strategies serving specific needs or coming with hard-to-understand sets of shortcomings. As it is rather difficult to switch strategies later, users are often left with suboptimal choices, resulting in wasted resources and poorer-than necessary performance.

We are looking to create a compaction strategy that makes for a good default, can cover the applications of the existing strategies and solves their problems, and is easy to switch to and reconfigure as necessary. Ultimately we aim to make the strategy adaptive, so that it can take the task of making suitable compaction choices from the user.

Audience

Cassandra users and developers.

Goals

A new “Unified Compaction Strategy” (UCS) is to be introduced. The strategy will be a back-to-basics approach to compaction, which:

  • Covers the applications of levelled, tiered and time-windowed compaction strategies, including combinations of levelled and tiered in different levels of the compaction hierarchy.
  • Can perform multiple compactions on the same level in parallel.
  • Has an easy-to-understand read vs. write amplification tradeoff and satisfies tight bounds for both.
  • Does not recompact on reconfiguration and fully takes advantage of all prior compaction work.
  • Understands the existing layouts of STCS and LCS and can be switched to without requiring extensive recompaction.
  • Drastically reduces the size of sstables (and required space overhead) compared to STCS.
  • Reduces the complexity of the strategy and the amount of out-of-spec write amplification compared to LCS. Does not suffer from L0-to-L1 compactions blocking all other L0 or L1 compactions.
  • Maintains time order and performs whole-sstable expiration, making a separate time series compaction strategy unnecessary.

As a longer term goal we plan to introduce a compaction controller that can switch compaction parameters automatically to fit the client workload.

Non-Goals

This work does not aim to rewrite or refactor any of the compaction process itself.

Additionally, at this time there is no intention to integrate support for tiered storage where multiple types of storage technology are used.

Proposed Changes

The basic design of the strategy is an evolution of the straight-forward size-tiered compaction strategy, with the following key features:

  • Tiering by density
    • UCS lays out sstables in levels based on their “density”, defined as the size of an sstable divided by the spanned token range (adjusting for any holes in the locally-owned ranges).
    • Levels are defined as bands on the logarithmic scale of the density starting from an estimated “flush size”.
  • Sharding
    • Every level of the compaction hierarchy is sharded, i.e. split into multiple token ranges that form independent compaction spaces.
    • The number of shards for a given level is determined by the level’s density band.
    • On writing an sstable (both flush and compaction), an expected output level is determined and the output is split to an sstable per shard.
  • Configuration space
    • Each level of the compaction hierarchy has an associated integer “scaling parameter” w, which specifies levelled vs. tiered compaction together with the required fan factor.
    • The fan factor f is set to 2 + |w|. That is, the density band for this level covers densities between the previous level’s maximum, and f times than maximum.
    • When w < 0:
      • Compaction is triggered whenever there are 2 overlapping sstables in any “bucket” (i.e. shard of a level).
      • This has the effect of compacting data on this level up to f-1 times until it becomes dense enough to fall in the next level of the hierarchy.
      • At rest, at most 1 sstable will be present on this level for any key.
      • This acts as “levelled” compaction. We will print w<0 as Lf (e.g. we will print and read w = L10 to mean w = -8).
    • When w > 0:
      • Compaction is triggered whenever there are f overlapping sstables in any bucket.
      • This has the effect of compacting data once on this level to make it dense enough to fall in the next.
      • At rest, at most f-1 sstables will be present on this level for any key.
      • This acts as “tiered” compaction. We will print w>0 as Tf (e.g. we will print and read w = T4 to mean w = 2).
    • w = 0 is a common ground for both levelled and tiered:
      • Both of the descriptions above are valid and simplify to the same thing.
      • We will print w=0 as N, and we will also permit users to specify this as T2 or L2.
    • (As is common in the compaction literature, compaction is described under no overwrites and deletions, because this is the worst case scenario.)
  • Time order
    • The strategy preserves time order, i.e. whenever it chooses sstables to compact, it groups sstables with the closest timestamps.
    • This avoids unnecessarily mixing new data with older sstables, and keeps higher-level sstables free of new data, enabling whole-sstable expiration.
  • Reserved threads
    • Under sustained load the compaction workload needs to be split among the levels of the compaction hierarchy proportionally to each level’s write amplification.
    • To facilitate this, the strategy assigns a number of compaction threads to each compaction level. Reserved threads will not start compaction in other levels, to be able to react quickly (e.g. when a flush creates a compaction opportunity on L0).
    • Non-reserved threads pick the level to work on randomly.
  • Level-skipping
    • If compaction is late, i.e. the number of overlapping sstables in a bucket is much higher than f, a set of sstables to compact is selected in a manner that is expected to create an sstable suitable for a higher level of the hierarchy.
    • Such a compaction is carried out using threads reserved for the target compaction level.
    • The compaction result is thus expected to “skip levels”, reducing the number of compactions that need to be carried out and giving compaction a chance to “catch up”.

This compaction can work in modes similar to STCS (with w = T4 matching STCS’s default threshold of 4), LCS (with w = L10 to match LCS’s default fan factor of 10), and can also work well enough for time-series workloads when used with a large tiered fan factor (e.g. w = T20). Read-heavy workloads, especially ones that cannot benefit from bloom filters or time order (i.e. wide partition non-time-series) are best served by levelled configurations. Write-heavy, time series or key-value workloads are best served by tiered ones.

Generally, increasing w improves write amplification (WA) at the expense of read amplification (RA). Decreasing it improves RA at the expense of WA. Very low values of w (i.e. levelled compaction with a large factor, e.g. L1000) operate like a sorted array, where every write is very costly, but reads are quick. Very high levels of w (i.e. tiered compaction with a large factor, e.g. T256) act like an unsorted log, where writes are very easy, but finding a piece of data is very complex and slow.

In tiered mode, sharding on upper levels of the hierarchy ensures that the space overhead is very limited (a small multiple of a selected “target sstable size”). As shard boundaries are preset, compaction can start and progress independently on any bucket, which provides for much higher levels of concurrency. This is even more pronounced in levelled scenarios where no sstables are shared between compactions of different levels.

The design also permits hybrid compaction modes (e.g. tiered on the lower levels, levelled on the higher) as proposed by e.g. the Dostoevsky paper.

Because all selected compactions increase the density of the resulting sstables, work done for any compaction configuration is also useful and treated as such by other configurations, which makes it possible to switch to a different configuration without losing the advantage of all prior work done. This includes switching from legacy STCS or LCS.


Further documentation of the current state of the proposal can be found in its documentation in the WIP implementation branch.

New or Changed Public Interfaces

The new compaction strategy class, “UnifiedCompactionStrategy” will be possible to be selected in a table’s compaction configuration. At minimum it will accept the following parameters:

  • scaling_parameters: this takes a list of scaling parameter designations (e.g. L4, T8, N) and specifies the fan factor and compaction method to use for each level of the hierarchy. If the list is shorter than the number of levels, the last value is repeated for all higher levels.
    Default value: T4 (i.e. tiered compaction with fan factor 4 for all levels of the hierarchy).
  • target_sstable_size: a desirable size of sstables on disk. This determines the number of shards higher levels of the hierarchy are split in. The strategy will calculate the number of shards in a way that it expects to split data in sstables of size between (target_sstable_size / √2 and target_sstable_size * √2). Lower values increase the overall number of sstables on disk, which improves streaming efficiency, compaction parallelism and disk overhead at the expense of higher memory usage.
    To ensure that the split points of lower levels are also split points of higher ones, the number of shards is chosen to be a power of two multiple of the base shard count.
    Default value: tbd
  • base_shard_count: a minimum number of shards, used when the target size would cause the level to be split into fewer shards. This determines a minimum concurrency level for L0 and other lower levels of the hierarchy, which can otherwise become bottlenecks.
    Default value: 4 (1 for system tables)


Example:

ALTER TABLE … WITH compaction = {'class': 'UnifiedCompactionStrategy', 'scaling_parameters': 'T8, T4, N, L4'};

Compatibility, Deprecation, and Migration Plan

As for any compaction strategy, it is always possible to switch a table to and from the new strategy. Switching from UCS to a legacy strategy would be painful and require a significant amount of recompaction, because the legacy strategies are unable to understand the layout of UCS sstables. Switching from legacy strategies, when a matching configuration is selected (e.g. T4 for default STCS), should be easy and not require any additional compaction.

After a period of testing, the new strategy should be made default, using settings corresponding to the current STCS default. The legacy LCS and STCS compaction strategies will remain present for at least one major Cassandra version. At the end of this period, provided that this strategy proves to be able to convert from legacy strategies painlessly, an automatic upgrade mechanism will be implemented to deprecate LCS and STCS in favour of equivalent UCS configurations.

Deprecation of TWCS will be assessed separately; as it may still have advantages over UCS in some scenarios, it is likely that it will remain in use longer.

Test Plan

Unit tests and microbenchmarks will accompany the patch. Stress tests to large datasizes (1 and 10 TB of data) will be performed with key-value, wide partition and time series workloads, as well as fuzz testing to look for unexpected behaviour.

Rejected Alternatives

An alternative approach to solving the space amplification problem of STCS is to flush sstables into non-overlapping runs split when a certain size threshold is reached, similarly to the approach taken by LCS. Since the boundaries between individual sstables in this approach vary between runs, one must either compact a whole level at a time, or compact some sstables that only partially overlap. The former makes it quite hard to perform concurrent compactions on the same level, and the latter increases the write amplification, which is quite undesirable for tiered levels, and may introduce oddities in densities as well as time order.

We also tried a mixed approach, where shard boundaries are fixed, but we only split sstables when they reach a certain minimal size, combined with a single fixed shard count for all levels of the hierarchy. This works well for the upper levels of the hierarchy (when sstables become big enough to be split in all shards), but has similar problems to the above for the lower ones: requires serial full-level compactions, which become a bottleneck for sustained throughput, or can cause lower-level sstables to remain stranded, adding to the effective read amplification and violating time order (where old data stays in low-levels of the hierarchy). Moreover, the fixed shard count is difficult to manage when the total data size is not known in advance or changes with time.

In levelled mode, UCS as described does not perform compactions into the next level like LCS. At first glance it may appear that this makes it not as efficient; in fact it is actually doing the same thing, with a little better precision. The trigger for LCS to compact is a level exceeding its size threshold, in which case it selects an sstable from the run to compact into the next level. If we consider UCS working with the same set of sstables, in such a case there would be at least one sstable whose density is above the band for the level, which means that it would be treated by the strategy as already belonging to the next level, immediately triggering compaction if another sstable run already resides there. In other words, it has the same effect and efficiency, but is much simpler, does not let one level restrict compactions initiated on another (e.g. L0-to-L1 and L1-to-L2), and does not need explicit level designations.

Specifically for time-series workloads, we also considered a mode of operation where the mechanism of grouping sstables into buckets is replaced by (or extended to include) a time component. This would make the strategy able to match TWCS. At this point we consider this modification unnecessary, as levels of the hierarchy naturally become time buckets, as long as time order is maintained when compactions are chosen. 


  • No labels