DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
The classic C* topology for achieving multi-region high availability uses 3 data centers. In this topology, any single data center can be lost without sacrificing data integrity or availability. Unfortunately, this topology requires paying for 3 full size DCs, and limits the achievable latencies to the distance between the 2 closest data centers. This tradeoff can be especially painful for single DC workloads making use of paxos operations.
For applications that need to partition themselves geographically to put data and processing as close to users as possible to minimize latency, this HA topology is often not a practical approach. In many cases, these applications will run out of a single DC with no backup in the case of a DC loss, or will have a secondary, transactionally inconsistent DC and run queries at LOCAL QUORUM out of the primary.
This CEP proposes an alternative topology that uses considerably fewer nodes than three full data centers, while still providing transactionally consistent HA. It does this by introducing a the concept of a satellite data center. Satellite data centers are data centers in a separate failure zone from the primary, while being close enough to provide a low latency connection. They are also much smaller than a full DC, containing 1/15th as many nodes as a full DC, possibly less. So instead of requiring 3 full DCs for HA, users could have an HA cluster with 2 full DCs, and 1 satellite.
This preserves the application semantics of a single DC cluster, but in the event of a datacenter loss, enables operators to perform a transactionally consistent failover to the secondary DC with no data loss, and minimal loss of availability. It also details a topology variant with 2 satellites (1 per full DC) that enables switching the primary status between the 2 full size data centers while preserving the ability to perform a transactionally consistent failover back to the opposite DC.
Goals
- provide an HA option to single DC workloads that doesn't require running 3 full DCs
- preserve single DC application semantics
- allow fast and correct C* orchestrated failover with minimal loss of availability
- guarantee failover/failback correctness
- degrade gracefully in face of non-primary DC loss
- minimize size of satellite datacenter
Non-goals
- Replace classic multi-dc replication
Key Concepts
If we start with a single DC cluster, we can read and write against it at quorum and everything will remain consistent. However if that datacenter becomes unavailable then our entire application goes down. If we add a second DC, we can now read and write against our primary DC at local quorum and things remain consistent in the primary DC and our latency doesn’t increase. In this configuration, if the primary goes down, we could switch application traffic to the secondary, and things would be mostly up to date, but there will be some inconsistencies since we weren’t reading and writing to both DCs at quorum.
This is the use case we’re trying to solve here without having to pay for a third full DC. If we add a third DC that we mirror writes to, but is only responsible for storing these writes until we know they’ve been replicated to both the primary and secondary DCs, then the third DC can be very small. We can then use that small delta of writes that were acked to the client but not yet the secondary DC to write a transactionally safe failover mechanism to transfer application traffic to the secondary DC without losing any data, and with a small availability loss.
FoundationDB[1] and DynamoDB[2] both have concepts of satellite DCs as well (MRSC in Dynamo)
Proposed Changes
Satellite data center support will be implemented on top of witness replicas and mutation tracking to create what's effectively a witness data center. Since CEP-46 will complete the Witness Replica feature using Mutation Tracking as the replication mechanism for witness replicas, this means that the satellite replicas will not maintain an LSM, only temporary log data. This saves satellite replicas the overhead of running a full storage layer. The code that executes memtable writes, memtable flushes, data (sstable) reads, and compaction will be dormant (except for local system tables), which is how we can target a satellite DC size of 1/15th the full DC size, since the DC is just handling temporary mutation journal data.
Failover coordination process
The complicated part. Failover is coordinated from Cassandra via nodetool failover (or something similar). Failover is orchestrated as a TCM sequence, and is outlined in detail in the next section. Although Cassandra will orchestrate the failover process internally, it will not decide to failover by itself. It needs to be initiated by the operator / external tooling.
Failover Replication Strategy
Satellite datacenter replication is configured at the keyspace level by a replication strategy. The failover replication strategy will control how queries are executed, specifically what consistency level the query will be executed at and whether the query can be coordinated from the local data center or needs to be forwarded to another DC. The replication strategy is informed by the TCM failover state.
It’s worth pointing out that aside from the datacenter being configured as a satellite by one or more keyspace replication strategies, there’s nothing else about a satellite datacenter that’s special.
A note on consistency levels: The failover replication strategy presents a single logical DC to the client, so quorum/local_quorum/quorum_of_quorum are all interpreted as quorum in the single logical DC (same with serial/local_serial). Internally, the failover mechanism guarantees that reads and writes are executed at the appropriate internal consistency levels between the appropriate DCs needed to maintain correctness. These ‘internal’ CLs, and how they change during failover and primary datacenter reconfiguration, are covered in detail below.
Query Forwarding
If the client executes a query against a datacenter that's not the primary, it needs to be forwarded to the primary DC.
Satellite Failover States & Processes
Below is a description of the normal and failed over states, the processes for failing over and back, how the primary role is transferred in a 2 satellite topology, how to disable secondary and satellites, and the rationale for the operations and states. The processes described below are the processes executed by Cassandra, there’s no operator intervention required outside of telling Cassandra to start the state change.
A note about consensus (paxos/lwt) operations
To minimize network latencies and load on the satellite datacenter, consensus operations are coordinated at LOCAL_SERIAL from the primary DC. The data write portion (commit) of the operation is performed at QUORUM_OF_QUORUM, and the read portion (prepare) is performed at LOCAL_QUORUM. Accord will not be supported by the initial implementation, but will be addressed as a follow up.
Normal Operation
All client traffic directed to primary DC, secondary DC doesn't accept client traffic. If secondary has a satellite, it's dormant, not participating in reads or writes.
Read/Write CL: QUORUM_OF_QUORUMS
Serial Read/Write CL: LOCAL_SERIAL / LOCAL_QUORUM (read) / QUORUM_OF_QUORUMS (commit)
Basic reads and writes are performed at CL_QUORUM_OF_QUORUMS. Typically this will mean that the primary and satellite will be used to satisfy the CL since they will have the lowest latency between them. However, if enough satellite nodes become unavailable, nodes in the secondary DC can be used to satisfy CL, albeit with a latency penalty.
Writes will only return successfully to the client once they have been replicated to the primary and a quorum of either the satellite or secondary, and would therefore be visible after the loss of the primary dc and failover. Additionally, reads only return to the client once the primary and either the satellite or secondary DC agree on the data being read.
Paxos CAS reads and writes are performed at LOCAL_SERIAL, but the read and write stages use the same consistency criteria as basic reads and writes. This way, the consensus process and metadata is isolated to the primary DC, but the results are replicated to the satellite and/or secondary dc before returning a success to the client.
Failover Process
The failover process is used when the current primary DC becomes unavailable. This procedure is the same regardless of whether the cluster is using one or two satellites.
Operator procedure
Initiate failover nodetool failover, switch client traffic to secondary DC
Cassandra process
1. Disable consensus operations to the primary DC / enable on secondary
Since consensus is local to the primary, we can't start executing consensus operations against the secondary until we know that no consensus operations can be executed against the primary. Specifically, no consensus operation on the primary can be allowed to write to the satellite (or secondary) once the secondary has begun running consensus operations. Communication with the primary may not be reliable or possible, but instructing the satellite to stop accepting consensus writes from the primary would guarantee that no in flight writes in the primary could succeed.
2. Switch client request coordination to Secondary DC
Write CL: LOCAL_QUORUM
Read CL: QUORUM_OF_QUORUMS
Serial Read/Write CL: LOCAL_SERIAL / QUORUM_OF_QUORUMS (read) / LOCAL_QUORUM (commit)
With consensus operations disabled on the primary, the application can now begin running queries against the secondary DC. Any successful normal or paxos write previously performed against the primary will have also been replicated to either the satellite or the secondary, so we can start running all queries against the secondary, including LOCAL_SERIAL operations, so long as read operations consult the satellite DC.
3. Stop promoting data to repaired/reconciled
Reconciliations performed after / as part of failover should not mix data with reconciliations performed before, to support failback.
4. Reconcile satellite data to secondary DC
Catches the secondary DC up with the result of any writes it may have missed before failover
5. Reconfigure C* cluster to stop replicating to the satellite and the primary, and stop promoting fully repaired/reconciled data
Write CL: LOCAL_QUORUM
Read CL: LOCAL_QUORUM
Serial Read/Write CL: LOCAL_SERIAL / LOCAL_QUORUM (read & commit)
See the "Offline Status" part of C* changes needed for more details.
6. Operate in "Failed over" state
Failed Over Operation
All client traffic directed to secondary DC, primary DC doesn't accept client traffic and forwards any queries executed against it to secondary. Writes are not forwarded to the primary satellite or the secondary satellite (if the cluster has one).
Read/Write CL: LOCAL_QUORUM
Serial Read/Write CL: LOCAL_SERIAL / LOCAL_QUORUM (read/commit)
Reads and writes performed at LOCAL_QUORUM. CAS reads and writes are also performed at LOCAL_SERIAL. Data is not written to the primary satellite, primary DC, or secondary satellite (if the cluster has one). Although we have effectively become a single DC topology, we do not promote data as fully reconciled. Instead, we accumulate it in the mutation tracking equivalent of unrepaired or pending repair so that we can avoid needing to perform a full rebuild of the primary on failback, and only need to sync unreconciled data back.
Rationale for not writing to the primary satellite: presumably the primary DC is unavailable, and writing to the satellite will be needed for failback, but there isn't a reason to incur a latency penalty in the meantime.
Rationale for not writing to the secondary satellite: although the secondary is now acting as the primary, there's no point in writing to it's satellite. Since the primary DC is not available, we can't forward writes to it and the satellite would just fill up with unreconcilable data.
Failback Process
The failback process is used when we're operating in the "Failed Over" state, the primary has become available and we are ready to transfer the primary role back to it
Operator procedure
Initiate failover nodetool failback, switch client traffic to primary once C* process has finished
Cassandra Process
1. Clear unreconciled data and consensus metadata on primary DC, and all data on satellite nodes
It's likely that writes in flight during failover will have been written to disk on the primary, but not the satellite or secondary. It's also likely that in flight consensus operations may have reached a completable state during failover, but also weren't recorded on the satellite or secondary. Since failover, the application will have moved forward in time without this state, so we need to make sure it doesn't reappear on failback. To achieve this, we clear all consensus metadata and unrepaired/unreconciled data.
2. Reconcile data in Secondary DC
To increase the amount of data we can stream as sstables and minimize the number of individual log entries we need to send to the satellite during read reconciliations.
3. Start replicating to the primary and satellite DC - restrict reads from primary until cohort data has been transmitted
Read/Write CL: QUORUM_OF_QUORUMS
Serial Read/Write CL: LOCAL_SERIAL / QUORUM_OF_QUORUMS (read & commit)
4. Perform global reconciliation
This will transmit the siloed cohort data back to the primary, as well as any recent log entries that were missed
5. Disable consensus operations on the secondary DC / Enable consensus operations on primary
6. Switch client traffic to primary DC
7. Resume operating in "Normal operation" state
Primary Transfer Process (2 satellite configuration)
Since both DCs will be primary at different points of the process, we'll be referring to the full DCs as DC1 & DC2, and their respective satellites as SA1 & SA2. In this example, we are transferring the primary role from DC1 to DC2.
This is similar to the failover process described earlier, with a few key differences.
- We need to handoff the primary satellite role from SA1 to SA2, which makes the transitional query consistency a bit more complicated.
- We no longer stop promoting data to reconciled
This is only available to clusters where each full DC has a satellite, and in cases where we want to transfer the primary role to the opposite data center without precluding an HA failover back to the original primary. Reasons include no risk exercising of the failover process, and proactively transferring the primary role when trouble is expected at the current primary (hurricanes, etc).
Operator procedure
Initiate failover nodetool transfer_primary, switch client traffic to secondary DC
Cassandra process
1. Disable consensus operations on DC1 / enable on DC2
Since consensus is local to the primary, we can't start executing consensus operations against the secondary until we know that no consensus operations can be executed against the primary. Specifically, no consensus operation on the primary can be allowed to write to the satellite (or secondary) once the secondary has begun running consensus operations. Communication with the primary may not be reliable or possible, but instructing the satellite to stop accepting consensus writes from the primary would guarantee that no in flight writes in the primary could succeed.
2. Switch client traffic to DC2
see (*) for explanations of the DC set notation
Write CL: QUORUM_OF_QUORUMS {DC2, SA2, DC1}*
Read CL: QUORUM_OF_QUORUMS {DC2, SA2, DC1}* + {DC1, SA1, DC2}*
Serial Read/Write CL: LOCAL_SERIAL / QUORUM_OF_QUORUMS {DC2, SA2, DC1}* + {DC1, SA1, DC2}* (reads) / QUORUM_OF_QUORUMS {DC2, SA2, DC1}* (commit)
With consensus operations disabled on the DC1, the application can now begin running queries against DC2. Any successful normal or paxos write previously performed against the DC1 will have also been replicated to either the SA1 or DC2, so we can start running all queries against the secondary, including LOCAL_SERIAL operations, so long as read operations consult both the primary and secondary satellite DC.
(* ) The Read/Write/Serial CLs indicate which sets of DCs/Satellites the CLs need to be satisfied against. In the case of writes, we immediately begin including SA2 in the voting group for writes. In the case of reads, we need ensure we see all writes made to DC1 & DC2 and so we need to satisfy quorum of quorums for the original DC set {DC1, SA1, DC2} and the new DC set {DC2, SA2, DC1}
3. Reconcile SA1 data to DC2
Catches the DC2 up with the result of any writes made against DC1 it may have missed before failover
4. Operate in "Normal Operation" state, now with DC2/SA2 as the primary/satellite
Read/Write CL: QUORUM_OF_QUORUMS
Serial Read/Write CL: LOCAL_SERIAL / LOCAL_QUORUM (read) / QUORUM_OF_QUORUMS (commit)
Secondary / Satellite Disable Process
In cases where the satellite is down for an extended period of time, the application will experience a latency penalty since it now has to wait on responses from the secondary to satisfy it's consistency level. Depending on the circumstances, we may want to revert to the current behavior - LOCAL_QUORUM queries against the primary, non-transactionally replicated to the secondary. This will improve query availability and latency, but safe HA failover won't be possible until the satellite is re-enabled.
In cases where the secondary is down for an extended period of time, the satellite will accumulate data that it can't reconcile and the primary will eventually need to stop sending data to it.
A multi-step process isn't needed to disable a secondary or satellite. Once instructed to disable the satellite, C* will stop including the secondary and satellite in its consistency requirements and execute queries at the following consistency levels against the primary:
Read/Write CL: LOCAL_QUORUM
Serial Read/Write CL: LOCAL_SERIAL / LOCAL_QUORUM (read) / LOCAL_QUORUM (commit)
Secondary / Satellite Re-Enable Process
Once the secondary or satellite becomes available, re-enabling is simple:
1. Revert to the "Normal Operation" state
2. Reconcile data between the primary and secondary
Since queries will have been executed against the primary at LOCAL_QUORUM, we need to explicitly reconcile data written before the secondary or satellite was re-enabled before we can safely perform an HA failover to the secondary. In the case of a reenabled satellite, this should be very quick, as the primary will have been replicating to the secondary in the background. In the case of a reenabled secondary, there will be more data to transfer.
[1] https://github.com/apple/foundationdb/wiki/Multi-Region-Replication
[2] https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/V2globaltables_HowItWorks.html