Current state: Adopted

Discussion threadhere


Released: -

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).


Accessing sstables are currently performed by the implementation of abstract SSTableReader and SSTableWriter classes. There is only one implementation of sstable in Cassandra - BigTable with its reader and writer implementation. Such design should allow for optionally plugging-in alternative implementations of sstables and for testing the individual implementations in isolation. 

Unfortunately, the implementation is highly tied to the current implementation - BigTable and many of its components are co-referenced with other components of the system. This makes it very difficult to just add a new implementation or develop tests which verify a single component in isolation.

Alternative implementations could include (but not limited to) a better indexing mechanism, different data organization, different optimizations and other improvements or changes.

Even without adding a completely new implementation, introduction of breaking changes to the existing sstable format is problematic with the current state of the codebase. Right now it involves updating the version but the same classes have to support all the previous versions as well in a single implementation which pollutes the code with many version dependent conditional blocks.


  • Cassandra developers who wish to see SSTableReader and SSTableWriter more modular than they are today,
  • Cassandra developers who want to develop and publish different file format implementations.

Proposed Changes

The legitimate path for creating BigTableReader, the only implementation of SSTableReader is via SSTableReader.Factory instance obtained from the desired SSTableFormat instance. The assumption is that the certain SSTableFormat instance returns the appropriate factory for creating SSTableReader instances for that specific format.

The factory in fact has only one #open method which takes just an SSTableReaderBuilder instance as an argument and creates a BigTableReader instance by passing that builder to the constructor. However, this is not the widely used method to create an SSTableReader instance. In most cases, it is created in many different variants via one of many static #open methods in SSTableReader. Those methods internally load metadata and do some loading logic and then use SSTableReaderBuilder which is responsible for opening files, load indexes, load or recreate bloom filters and index summaries, as well as for some validation. Eventually, the SSTableReaderBuilder instance, when everything is fixed, is passed to the constructor of BigTableReader

BigTableWriter instances are also created via static #open methods in SSTableWriter. However, those methods internally use SSTableWriter.Factory instance to create the appropriate instance of SSTableWriter

With SSTableWriter there is also the case of ZeroCopyBigTableWriter, which is not an implementation of SSTableWriter, but rather another direct implementation of SSTable class (which is a parent for both SSTableReader and SSTableWriter). It is currently created by directly using its constructor. 

Another thing we will have to refactor is how are sstable components defined. The current implementation has hardcoded component types as enum and predefined component instances but this cannot be retained in this way as some of the components are very specific to the implementation. For example, sstable does not necessarily need to have index summaries or indexes at all, or it may have multiple index components of different kinds. 

Table metrics are currently defined in TableMetrics class as final instance fields. The problem is that there are sstable related metrics which can be specific to certain sstable implementations. 

Also, the current implementation of key cache is tightly coupled with the big table primary index implementation so this will have to be modified too.

Scrubber and Verifier are implemented for big table format. They make several assumptions about what is included in the index file and what should be verified / scrubbed while another format may not even have an index file or have multiple index files.

Summary of the changes we propose:

  • have a single factory for creating both readers and writers for particular implementation of sstable and use it consistently - no direct creation of any reader / writer
  • make sstable components and component types flexible and internal to the sstable format implementation - external to the implementation may operate on components but they should not assume any semantics of them - only sstable implementation should understand the meaning of certain component
  • move the metrics related to sstable format out from TableMetrics class and make them tied to certain sstable implementation
  • refactor caching service so that caching is optional and depends on whether exact sstable implementation supports it, and is sstable implementation agnostic
  • refactor sstable iterators so that they are sstable format implementation agnostic
  • extract interfaces or scrubber and verifier and make the current implementation specific to big table format; scrubber and verifier should be created by sstable factory
  • if it is possible, make the SSTableReader interface consistent with memtable so that the same methods and pattern can be used in read commands to read from both sources (refer to Memtable API CEP)

Compatibility, Deprecation, and Migration Plan

There should be no issues regarding the migration as for the default configuration no behaviour should be changed. Although we are able to change the sstable format on the configuration level, it applies only to the newly created sstables (in fact we change the default format). As long as the libraries containing the implementation of some format are present on the classpath, Cassandra is able to load the files with that format regardless of whether it is the default format or not. Therefore even after switching to a new sstable implementation at some point, we are able to handle the old sstables.

We don’t expect compatibility issues by changing the sstable API as the current API doesn’t really allow for providing a custom implementation.

Rejected Alternatives

It may feel like the most outstanding alternative is CASSANDRA-13475 (pluggable storage engine). However it is not a real alternative as the scope of that story is much broader. The scope of  this CEP is that we want to keep the current C* mechanics with the ability to switch the way how the data is stored and indexes - its format.

Another alternative to support different sstable implementations is to use the already existing versioning system. It could work however, it would make things much more difficult to maintain. We can see two major problems with that alternative:

  • bringing a new sstable implementation in would require many additional conditions in the current code making it even more complex, hard to follow and error prone
  • new sstable implementations would have to be introduced only in the main line of Cassandra code to keep the compatibility - imagine company X developed a custom implementation and provided their custom version tag, say OA as it would be the next major version after Cassandra 4.0; at the same time the community brought OA to the main line with completely different implementation. It would make it very difficult for X to reintegrate with the new Cassandra version as the same sstable version in their fork and in the main line would denote something completely different; by introducing a custom sstable implementation, each provider can interpret versioning in their own way

  • No labels