Status
Current state: Accepted
Discussion thread: here
JIRA: CASSANDRA-17240
Released: 5.0.0
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Memtables in Cassandra are big, have complex on-heap structure, and relatively long lifetimes, which makes them a pain point for memory management and garbage collection.
We would like to propose an alternative memtable implementation based on tries. It provides the following advantages:
- More compact storage with shared key prefixes.
- Can be placed off-heap and is garbage-collection-friendly when used on-heap.
- Fast lookups, resulting in faster writes and memtable reads.
Our internal testing has shown this can provide some dramatic performance improvements, including close to doubling of the sustained write throughput combined with drastically reduced tail latencies under load.
The implementation is already battle tested in production as the memtable format in DSE 6.8.
Audience
Cassandra developers
Goals
- Eliminated or drastically improved garbage collection pause times.
- Reduced memory footprint of the memtable for a given amount of data.
- Several times larger feasible memtable size.
- Larger flushed sstables leading to better compaction efficiency.
- Higher write throughput and lower local write and read latency.
Non-Goals
The proposal has potential expansions into more efficient methods of collating data from multiple sources (i.e. replacing MergeIterator
hierarchies), sharing data between memtables and commit log, as well as more efficient storage formats based on the trie interface detailed here. These are out of scope for the current proposal.
Proposed Changes
An alternative memtable implementation is provided alongside the legacy skip list solution. The implementation is made active using the memtable API (CEP-11 / CASSANDRA-17034).
The initial version of the memtable will replace the concurrent skip-list partitions map with a sharded single-writer trie. To maintain partition order, all keys are mapped to their byte-comparable representations using CASSANDRA-6936. To minimize the size of the structure, the keys are only stored in the trie paths, and converted back to the standard format on retrieval.
In later iterations this will be expanded to include the partition-to-row maps, forming a direct map to rows and doing away with most of the complexity and overhead of maintaining separate partition maps.
The memtable utilizes a custom trie implementation, which stores its intermediate structure in on- or off-heap byte buffers. To bring complexity down, the trie only supports a single writer, but multiple reads can proceed concurrently with a write. To enable higher write concurrency, the memtable shards the local ranges into multiple tries, each of which can execute a write independently and serve any number of concurrent reads.
Internally, the trie stores data in blocks of fixed size and utilizes several different node types corresponding to common patterns in a trie’s structure. Full description of the storage format and its operation are given in the bundled MemtableTrie.md.
The implementation also includes a generic Trie
interface that provides an efficient method of traversing tries as well as merging and slicing them. The interface utilizes a “cursor” concept to drastically reduce the number of intermediate objects required for traversing and transforming deep hierarchies and can be used as a basis for improvements in Cassandra’s data presentation layers (i.e. UnfilteredPartitionIterator
) and compaction. Detailed documentation is provided in the bundled Trie.md.
New or Changed Public Interfaces
The proposal provides a new memtable option, specified using the “memtable” table option:
CREATE TABLE … WITH memtable = {‘class’: ‘TrieMemtable’, ‘shards’: 16};
The “shards” option is optional; by default the number of shards is set to the number of CPU threads available to the process.
Compatibility, Deprecation, and Migration Plan
Initially, existing users will not be impacted unless they choose the new implementation. If they do, there can be no effects other than performance changes, because the memtables only live as long as the running process. Migration on and off of the solution should be completely transparent to users.
We expect the default memtable format to be switched to the new one after a period of testing. The legacy skip list version can be deprecated as soon as the community is satisfied that the new solution works as well or better for all workloads.
Test Plan
In addition to unit tests of the provided implementations, the full Cassandra test suite will be also run with trie memtables (either in the form of a specific additional test target, or, to keep costs down, by changing one of the existing targets that does not already modify memtable behaviour).
Rejected Alternatives
Why not a B-Tree?
A B-Tree is the default choice for database maps, but it is a structure that relies solely on comparisons of keys. A trie, on the other hand, can make use of the lexicographic / byte-by-byte-comparable order of the keys (provided by the CASSANDRA-6936 translation) and thus be qualitatively more efficient in both storage size as well the performance of lookups and union/intersection.
While it is possible to customize B-Trees to also rely on and utilize byte order, a structure that is directly built for that is inherently simpler.
Why not use an existing trie implementation?
The state-of-the-art solution is adaptive radix trees (ART). It shares many of the ideas of our implementation as well as prior work on sstable indexes. Unfortunately no known implementation addresses our main objectives, avoiding complex long-lived structures on the heap and concurrent execution of reads with a write.
Why not a fully concurrent structure?
Support for concurrent writes drastically increases the complexity of any structure, to an extent that would make the features of the current implementation infeasible. In practice, we have also seen that sharded memtables provide for much better scaling when the number of CPU threads on a machine is higher.
The downside is that with a per-shard lock there is a higher chance of hot-spotting due to different partitions hitting the same shard. Since Cassandra is built with the premise that hashing distributes load evenly across partitions, this is a good tradeoff for all applications that use a hashing partitioner. This includes all normal workloads with the exception of legacy secondary indexes.
Hot individual partitions are a more realistic problem, which is also an issue for concurrent solutions due to the need to provide atomicity and consistency guarantees for all mutations on the same partition. The proposed solution is slightly worse in this respect (because the lock periods also cover the map lookup and mutation) but not prohibitively so, and there are ways to improve this if it turns out to be a real problem in practice.