Audience: All Cassandra Users and Developers
User Impact: Improved LWT performance, particularly in WAN or contended scenarios

Released: 4.1

Motivation

  • LWTs suffer from poor performance, particularly in a WAN setting, and particularly under contention.
  • LWTs have never guaranteed linearizability across range movements, which is a significant problem for a mechanism intended to offer strong consistency.
  • CASSANDRA-12126 has introduced significant performance regressions in order to resolve long-standing correctness issues. This may result in users being unable to use LWTs where they could previously, or else having to accept a poorly-documented correctness trade-off in order to keep the lights on.

This work aims to address these shortcomings with various improvements to the performance of our Paxos implementation, without fundamentally altering its behaviour.

Goals

  • Durable writes in no more than two round-trips when uncontended
  • Linearizable reads in no more than one round-trip when uncontended
  • Reduced contention
  • Linearizability across range movements

Description of Approach

Paxos Repair
We will introduce a new repair mechanism, that can be run with or without regular repair. This mechanism will:

  • Track, per-replica, transactions that have been witnessed as initiated but have not been seen to complete
  • For a majority of replicas complete (either by invalidating, completing, or witnessing something newer) all operations they have witnessed as incomplete prior to the intiation of repair
  • Globally invalidate all promises issued prior to the most recent paxos repair

We will deprecate system.paxos TTLs, and instead expunge records that are older than the most recent paxos repair for any given range/table.

This mechanism alone will permit safely returning success prior to performing the COMMIT step of our Paxos implementation. Users will still have to opt-in to this behaviour by providing commit consistency of ANY, or ONE, or perhaps LOCAL_QUORUM, depending on their preference. However this can be recommended as safe and preferred once this mechanism is in place, taking us from four to three round-trips.

Paxos Optimisations
Several optimisations to our paxos implementation will be introduced, including

  • Combine promise+read before a proposal: if the proposal is successful, the read will have been linearized along with the write, taking us to two round-trips.
  • Optimistic reads: if a majority of promises witnessed consistent state when promising and performing their read, the majority read can be returned to the client without waiting to issue an empty proposal. This takes us to one round-trip on read.
  • Preventing read/read competition: promises will be issued separately for reads and writes, with read promises invalidating write promises, and write promises invalidating read promises, but read promises will not invalidate each other, or prevent the above optimistic read optimisation.
  • Bounding re-proposals: incomplete commands that are re-proposed will not continue to be re-proposed if the original command has been committed (specifically, we track separately the ballot of the original proposal and the re-proposal, so that if the original proposal reaches the commit state as part of the original proposal, or any re-proposal, all re-proposals can instead go straight to commit).
  • Coordinators will not self-compete for operations on the same partition
  • Coordinators will cache PaxosState to limit dependence on performance of system.paxos


Correctness Improvements

  • Safety under range movements: we will introduce strong consistency for LWTs under range movements, by reconciling the replica-set amongst the replicas we contact for voting. If they differ, we will update our internal ring state with that of all replicas that disagreed.
  • We will introduce mechanisms to spot and log linearizability violations for the user to file as bug reports


Other Improvements

  • Introduce a dedicated TimeUUID class to prevent mistakes mixing ballot with sstable timestamps

Upgrade / Migration

  • Upgrade will likely be initially optional, and the mechanism TBD. At minimum there will be JMX endpoints to enable/disable the new mechanisms.

Test Plan

  • The Cluster Simulation CEP introduces significant testing for the correctness of this system, which will be expanded to ensure coverage of new functionality
  • Extensive real-world testing will be conducted on synthetic and live traffic
  • Unit tests for the new subsystems will also accompany the patch
  • No labels