Current state: Accepted
Discussion thread: https://lists.apache.org/thread/3dpdg6dgm3rqxj96cyhn58b50g415dyh
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
The sstable primary index in Cassandra is a quite dated design targeted at slow spinning-disk drives and suffers from a range of performance problems. Both the partition and row indexes rely on comparisons of typed multi-component keys, requiring deserialization of keys from disk or cache, and perform inefficient linear searches in blocks containing hundreds of keys. Moreover, the partition index
- maintains a user-managed in-memory summary, which can often grow too large and require manual adjustment;
- to compensate for poor lookup performance maintains a dedicated key cache in memory, with the associated code and management complexity;
- both of the above structures are long-lived complex objects with non-permanent lifetimes, complicating garbage collection.
Additionally, the existing per-partition row index:
- resides between keys in the partition index, adding to the amount of data transferred from disk to find relevant partition keys;
- is either fully fetched from disk, or requires binary block lookup, which can require a page fetch from disk per comparison for large partitions;
- stores full keys that very often include repeated prefixes;
- due to the points above, necessarily has a pretty high block granularity and requires long linear searches inside indexing blocks to find specific rows;
- deserializes indexing blocks fully for reversed queries, which is slow and generates huge amounts of garbage.
Both index structures can be improved dramatically by taking advantage of byte-ordered representations (CASSANDRA-6936) and tries.
Cassandra users and developers
- Improved sstable query performance
- Better handling of wide partitions
- Reduced in-memory and cache footprint of sstable indexes
- Improved garbage collection efficiency
- Reduced operational complexity
Changes to any data storage format or read/write paths outside the scope of primary indexing. Providing alternative bloom filter implementations.
We propose to introduce a new SSTable format called Big Trie-Indexed (BTI) that addresses the problems mentioned earlier. The BTI sstable format shares all components except the primary index and its summary with the legacy BIG format. This makes it trivial to stream, including to older Cassandra nodes that have no knowledge of the new format.
The BTI primary index is split in two components, each residing in its own index file. The partition index file (...-Partitions.db) maps unique prefixes of decorated partition keys to data file locations, or, in the case of wide partitions indexed in the row index file, to locations in the row index file. The row index file (...-Rows.db) only contains entries for partitions that contain more than one row and are bigger than one index block. For all such partitions, it stores a copy of the partition key, a partition header, and an index of row block separators, which map each row key into the first block where any content with equal or higher row key can be found.
Both indexes are structured as tries on disk, which are used directly in their on-disk representation presented in memory-mapped or cached buffers. They take advantage of several optimizations to make them compact and efficient:
- Typed nodes are used to improve size and lookup efficiency by using separate node types for leaf, single-child, sparse and dense nodes.
- Sized pointers encoding distances between a parent and child node are used to reduce the space requirements for storing pointers to children.
- Page-packing places related nodes together in disk pages, which reduces the size of pointers and makes it possible to take multiple steps in the trie before requesting another page from disk.
Partition indexes store only the unique prefix of each partition key and compare the rest of the key as provided by the data or row index file. They also include a single hash check byte which is checked before going to the full key to avoid fetching the relevant disk page on mismatch with high probability. This results in an index which is noticeably smaller even for short keys (e.g. int) and can be many times smaller for longer and/or repetitious keys as it takes advantage of prefix compression and does not store bytes beyond unique prefixes.
This index does not use a separate summary or key cache. The top levels of the trie are consulted often and thus usually remain in the page or chunk cache; their size is typically smaller than the legacy index summary and will be moved in or out of memory automatically by the page or chunk cache to react to memory pressure. When the relevant disk pages are still in cache, lookup in the on-disk index is as fast as a dedicated in-memory key cache.
Like the legacy row index, the proposal still organises rows in row index blocks. Instead of storing both minimum and maximum key for each block, for the trie key it uses block separators, which are greater than all entries of the previous block and smaller than all entries in the next. For any key this allows us to find the block with the closest greater (for reverse iteration, smaller) key, and iterate the blocks in increasing (resp. decreasing) order from there. With each block we store information about the position of the block, together with any deletion that may be active at the start of the block (this is also the deletion active at the end of the previous block for reverse iteration).
The new row index is several times more compact than the legacy, and has no efficiency issues with keeping track of millions of index blocks. To take further advantage of this, the default row index block granularity is changed from 64 to 16kb and we recommend users decrease it further to 4, 1 or even 0kb (index every row) if they need good row lookup performance at the expense of a bigger row index file.
Full description of the sstable index format can be found in this document: Trie-based indices
New or Changed Public Interfaces
The new sstable format will be selected using CEP-17, SSTable format API. This will affect all sstables created by the database after the switching time, including flushed, compacted, upgraded or scrubbed sstables. For any sstables in the new format, the key cache will not be used and all index summary settings will be ignored; however, the key cache will still be instantiated to handle pre-existing sstables in BIG format.
Compatibility, Deprecation, and Migration Plan
Versions of Cassandra that include the new format will be able to read and write BTI as well as the older BIG format of sstables. This will include sstables created by DataStax Enterprise 6, 6.7 and 6.8. Any sstables created after switching to the new format will not be readable by older versions of Cassandra; however, as the data file format does not change, sstable content will still be streamable to older nodes.
One can upgrade to the new version by running
sstableupgrade after switching to the new format. The same operation can be used to downgrade, if BIG is selected as the sstable format in use.
We expect the new format to become default after a period of testing; the ability to read and write BIG format sstables will remain for at least one more major version, and it is likely that the ability to read them will be preserved for longer to enable sstable-upgrade-free upgrades from versions of Cassandra that do not support BTI.
A new test target will be added, or, to save costs, an existing one (e.g.
utests-trie) will be modified, to make sure the full suite of tests is run with the BTI format. Long fuzz tests will also be performed.
The format has been in production use in DataStax Enterprise since version 6, which also attests to its stability.
B-Tree indexes such as Birch still suffer from the inefficiency and complexity of using comparison-based structures, especially if the stored keys can be long. Taking advantage of byte order (e.g. storing only bits beyond a common prefix) can be bolted on to a B-Tree, but results in even more complex code to look suspiciously similar to a trie.
Succinct tries like SuRF require external structures for lookup, which either need to be forced to memory or increase the number of page fetches required to take steps along the trie. It is also highly non-trivial to imagine a page-friendly variation of the succinct structures.
ART, which is an in-memory key-value (i.e. partition) index, is very similar to the ideas of this solution, and shares some of its features (typed nodes, unique prefixes and check bytes). In a sense this proposal can be seen as further development of ART, targeting on-disk indexes.