Status

Current state: Committed

Discussion thread: here

Released: Unreleased

Motivation

One of the pain points of Cassandra is garbage collection pauses. Garbage collectors are great at handling short-lived objects and good with near persistent ones, but unfortunately our memtables do not fall in either of these categories. This forces user to choose between:

long garbage collection pauses
small memtable size

We tend to opt for the latter, which entails lower query performance, but also has serious implications on the efficiency of compaction, including in the number of compactions data has to go through (i.e. the database’s write amplification).

An alternative memtable implementation that does not keep its indexing structures on the heap can help with garbage collection efficiency and improve the amount of data we can keep in memory, which in turn should provide tangible improvements to performance, both in tail latencies and in throughput. To further improve efficiency, it would be best if we can offer solutions tailored to specific problems, at the very least avoid having to be great at legacy features and use cases.

Another interesting application of memtable pluggability is the option to use specialized memtable implementations like Intel’s persistent memory solution (see CASSANDRA-13981). This effort was initially implemented against Cassandra pluggable storage engine API (CASSANDRA-13475) and can be easily adapted to a pluggable memtable.

Audience

Cassandra operators who want to configure the database for optimal performance, as well as operators with specialized hardware.

Developers of the Cassandra database and hardware vendors.

Goals

To provide an interface that permits users to select a memtable implementation suitable for their workload and hardware, including long-lived and persistent memtables.

To provide a mechanism for testing and proving alternative memtable implementations.

To provide a means of supporting deprecated legacy features without burdening new alternative memtable implementations.

Non-Goals

We are not aiming to fully replace Cassandra’s storage engine, or to change any of the data presentation interfaces (e.g. rows, tombstones, iterators).

Proposed Changes

The proposed design consists of a new memtable option in the server configuration yaml as well as a per table setting, the extraction of a memtable interface, and the addition of options for controlling flushing and commit log behaviour.

The memtable option will specify an implementation class and any options needed to configure it. This is to be configurable both in the table schema (to enable the selection of a memtable more suited to its access patterns, performance requirements, resource assignment or legacy features), as well as specified in the node yaml (mainly to enable changing the default, but possibly also switching the mechanism for individual nodes e.g. to facilitate up/downgrades).

The memtable interface (detailed in the next section) encompasses mutation, retrieval and flushing functionality, plus diagnostics and lifecycle management. It will be decoupled from the column family store (CFS) and any database state beyond the table metadata. The memtable option selects a memtable factory that will be used to create memtable instances.

Several new additions will be made to the column family store functionality:

Every flush request will be given a “reason” that is logged and also passed on to the memtable. The memtable can decide to refuse a flush request, if it has a different method of handling the reason (for example, handling schema changes internally).
The memtable implementation will be consulted before data is sent to the commit log. It can declare that the data in the memtable is already persistent (i.e. it does not need to be preserved in the commit log, but it may be sent to the log for the purposes of point-in-time restore or changed-data-capture) or that it can bypass the commit log altogether.
The streaming code will be enhanced with an option to stream from a memtable (by flushing to a temporary sstable) and stream to a memtable (disabling zero-copy streaming). These options can be chosen by the memtable implementation to facilitate long-lived and persistent solutions.

Tight coupling between CFS and memtable will be reduced: flushing functionality is to be extracted, controlling memtable memory and period expiration will be handled by the memtable.

New or Changed Public Interfaces

Database users

(Note: the final version ended up with a somewhat different scheme described in the bundled documentation.)

The primary user-facing change is the addition of a new table parameter “memtable”, specified in table definitions like this:

CREATE TABLE ... WITH memtable = { 'class' : 'HashOrderedMapMemtable' };

At minimum, a non-empty memtable parameter must specify a memtable “class”. The specific class may have other configurable parameters.

Tables that do not specify the memtable parameter will use the default configuration, which can be specified in the “memtable” option in cassandra.yaml:

memtable: class: SkipListMemtable

Alternatively, to support the need for per-node configuration (e.g. to better optimize heterogeneous deployments, or to test a memtable implementation switch on a small set of nodes), the API can be configured using templates specified in cassandra.yaml, for example:

memtable_templates:
trie:
class: TrieMemtable
shards: 16
skiplist:
class: SkipListMemtable
memtable:
template: skiplist

(which defines two templates and specifies the default memtable implementation to use). The template can then be specified in the table definition like this:

CREATE TABLE ... WITH memtable = { 'template' : 'trie' }.

The initial implementation may leave existing memtable configuration parameters (memory mode, memory size limits and flush thresholds, expiration time) as top-level parameters in the yaml, with the expectation that they will be gradually replaced with memtable-specific ones and deprecated.

Memtable implementers

The main interface that needs to be implemented is extracted from the current memtable code and includes:

Write and read operations: put, getPartition and makePartitionIterator.
Statistics and features, including partition counts, data size, encoding stats, written columns.
Memory usage tracking, including methods of retrieval and of adding extra allocated space (used by non-CFS secondary indexes).
Flush functionality, preparing the set of partitions to flush for given ranges with relevant statistics.
Lifecycle management, i.e. operations that prepare and execute switch to a different memtable, together with ways of tracking the affected commit log spans.

Beyond thread-safety, the concurrency constraints of the memtable are intentionally left unspecified. Current implementations provide atomicity and sequential consistency of operations on individual partitions, but some use cases may benefit from memtables with relaxed consistency models.

The memtable class must also implement a static
Memtable.Factory factory(Map<String, String> furtherOptions)
method, which is used to parse the parameters given with the table definition and construct a factory, which in turn is used to create the individual memtables.

If the implementation has no configurable parameters (as e.g. current SkipListMemtable uses top-level cassandra.yaml parameters), it can instead declare a static FACTORY field containing a singleton factory instance.

Some features of the memtable implementation may be needed before an instance is created or where the memtable instance is not accessible. To make working with them more straightforward, the following memtable-controlled options are implemented on the factory:

boolean writesAreDurable() should return true if the memtable has its own persistence mechanism and does not want the commitlog to provide persistence. In this case the commit log can still store the writes for CDC or PITR, but it need not keep them for replay until the memtable is flushed.
boolean writesShouldSkipCommitLog() should return true if the memtable does not want the commit log to store any of its data. The expectation for this flag is that a persistent memtable will take a configuration parameter to turn this option on to improve performance, which would make configuration for specific persistent-memory-backed tables easier than the pre-existing method through the non-durable-writes keyspace option whose semantic doesn’t quite match the intent of this.
boolean streamToMemtable() and boolean streamFromMemtable() should return true if the memtable is long-lived and cannot flush to facilitate streaming. In this case the streaming code will implement the process in a way that retrieves data in the memtable before sending, and applies received data in the memtable instead of directly creating an sstable.

The full details of the proposed memtable api can be found in the current implementation here: https://github.com/datastax/cassandra/blob/memtable-api/src/java/org/apache/cassandra/db/memtable/Memtable.java

An initial implementation of the proposal can be seen in this branch: https://github.com/datastax/cassandra/tree/memtable-api

The following branch adds work-in-progress versions of two more memtable implementations, TrieMemtable (from DSE 6.8) and HashOrderedMapMemtable (CASSANDRA-7282): https://github.com/datastax/cassandra/tree/memtable-api-nbhom

Compatibility, Deprecation, and Migration Plan

There should be no effect on existing users that do not explicitly select a new memtable. Eventually we may decide to change the default memtable implementation to a different non-persistent one. As non-persistent memtables do not outlive the run time of the Cassandra process, such a switch should not require any migration. Beyond performance, there should be no visible effect of such a change.

The same applies to tables that are configured to use a different non-persistent implementation: no effect other than performance differences should be visible and any new selection would apply as soon as the next memtable flush or process restart. The commit log and any sstables will look and work exactly as before, point-in-time-restore and change-data-capture (if they are used) will continue to work unchanged.

Persistent memtables may store data differently and will need migration. To introduce persistent memory, we expect users to either start new tables or new datacentres, and the standard bootstrapping process should work well with streaming the data into the memtable. The inverse process, turning off persistent memory memtables by changing the table’s selected memtable in the schema, will automatically trigger flushing of the data to sstables.

Users of point-in-time-restore or changed-data-capture must configure persistent memtables to allow data to be sent to the commit log (i.e. specify writesAreDurable but not writesShouldSkipCommitLog). If this is done, the commit log will have an unchanged format and the two options will work without any changes. (Note: point-in-time-restore also requires a memtable-specific method of capturing and restoring snapshots.)

Test Plan

For any new memtable implementation we want to run the full battery of Cassandra tests with it, thus a method of selecting the memtable implementation for unit and dtests must be provided. To test the new long-lived/non-flushing functionality, a PersistentMemoryMemtable mock based on SkipListMemtable will be made available, and a subset of the tests will be run with it as the selected implementation.

Rejected Alternatives

One of the options to implement alternative memtables would be by replacing the whole of the storage system, using a variation of CASSANDRA-13475. To discuss why this was rejected, consider the two main uses of this proposal separately:

For non-persistent memtables, this alternative would require duplication of the on-disk format, streaming and compaction implementations, with all associated risk.
For persistent memtables, our approach provides the option to spill to disk, i.e. to use a hybrid system where persistent memory is a part of the storage system and is combined with slower but larger disk, at no extra cost. Crucially, this also makes it trivial to migrate from persistent memory, and appears to greatly simplify implementing features like streaming and repair.

While we intentionally scoped this work to only implement a pluggable memtable API, one can easily argue that persistent memtables are in fact a mechanism for plugging in different storage engines — a memtable that never flushes has little practical difference from an alternate storage system. It may be worthwhile to investigate if this solution can offer enough flexibility to serve the purposes for which alternate storage engines (CASSANDRA-13475) are being considered.

A second option that was considered, specifically for non-persistent memtable implementations, was to simply replace the current implementation with something better. This is the method DataStax used in DSE 6.8, with the advantage that only one memtable implementation needs to be supported in the future. However, this cannot address all of the goals of this proposal, namely:

Support for persistent storage.
Not burdening new solutions with supporting legacy features.

In terms of the support burden, we would argue that providing the future option as a pluggable module is no worse because older versions of the database are supported for a long time, and implementing a fix for the legacy option in a new version would be a trivial port in code that hasn’t changed much (or at all).

A third choice that was considered was to package the commit log with the memtable. Conceptually these two pieces of the storage engine form one component — the LSM buffer of Cassandra, and as such it makes a lot of sense to bundle them together. This is even more evident in the case of memtables that operate on persistent memory, where a commit log is not necessary at all. Additionally, it makes sense to consider other singular structures that serve both purposes, e.g. append-only B-Trees, to use space more efficiently.

A bundled commit log is probably a better overall solution. Unfortunately, developments for CASSANDRA-13475 have added extra layers of separation between memtables and commit log which cannot be preserved, while features such as change-data-capture require a commit log to function. This approach may be possible, but would be significantly more difficult and greatly increase the scope of this change.

Space shortcuts

Page tree

Status

Motivation

Audience

Goals

Non-Goals

Proposed Changes

New or Changed Public Interfaces

Database users

Memtable implementers

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Page tree

CEP-11: Pluggable memtable implementations

Status

Motivation

Audience

Goals

Non-Goals

Proposed Changes

New or Changed Public Interfaces

Database users

Memtable implementers

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives