Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateCommitted (to trunk)

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

It is an operational oversight that there is no repair scheduling mechanism offered out of the box with Apache Cassandra.  To have and keep data at-rest consistent you must run repairs.  For running clusters, inconsistent data at-rest presents potential problems in different scenarios, from performance overhead, to broken consistency guarantees, and even a loss of data.

...

Migrating[1] (and rollback) to/from incremental repairhas been extremely challenging, especially in a large fleet. One of the design principles is to make it almost touchless from the operator’s point of view.

Users

New Users

Creating a reliable, scalable, and robust repair scheduler is a multi-year journey. Without an out-of-the-box Apache Cassandra solution, newly adopting Apache Cassandra users would have to make significant investments in designing, developing, and scaling the repair scheduling framework before they can use Apache Cassandra. 

Existing Users

Apache Cassandra 4.0 onwards, we need to run multiple repair types for the correct database behavior, such as Full, Incremental, and Paxos.

  • Some newly introduced repair types, such as Incremental repair, are complex to manage operationally, such as migrating[1] to/from incremental repair, which could jeopardize the live traffic. The token-calculation algorithm for Incremental repair needs to consider the size of unrepaired data. Additionally, the incremental repair does anti-compaction, so if it is not monitored correctly, eventually, it may make disks 100% full and create catastrophic failures in the live traffic. 
  • Unlike full repair, incremental repair does not work for all workload types - for some workload types, it poses reliability and availability risks to the Cassandra cluster. Therefore, it is vital to have an automated way to migrate on/off incremental repair without any human involvement and creating any reliability risks to the live production traffic. 
  • All the existing users would have to invest in their solutions to make them work across these multiple categories. When multiple implementations are done in silos for the same problem, then often, we lose the cross-team knowledge, and everyone is reinventing the wheel. Therefore, for the existing user, it is also beneficial to have an out-of-the-box Apache Cassandra solution, as the solution will solidify over the period.

Goals

Since repair has been a widely discussed topic for a long time, and folks have developed their custom repair solutions, the proposal is to have a single, formally recommended, and supported solution for Repair Scheduling that is part of the project's codebase and available by default to all users. Once we have an official solution, the entry bar for newcomers will be implicitly lowered. It motivates the existing user to enrich the solution and eventually migrate to an officially blessed solution to avoid reinventing the wheel.

...

  1. The solution has to be extremely easy for an operator to manage, so any naive user should be able to manage it.
  2. The solution should provide 1) Full Repair and 2) Incremental Repair out-of-the-box.  Moreover, the architecture should be easily extendable to accommodate more future repair types, such as Preview and Paxos repair.
  3. Reduce the incremental repair operational complexity:
    1. Migration: Migration to using incremental repair and its scheduling is very complex, so this solution should provide automation for that.
    2. Ongoing: Once the incremental repair runs in the cluster, it must be monitored carefully. It could create catastrophic problems, such as disk full—enough safeguards for the incremental repair to avoid jeopardizing the live traffic.
    3. Token: A robust token-split mechanism specific to incremental repair.
  4. The solution should work on a small, medium, and large Cassandra cluster.
  5. The solution should scale and perform without much additional operational overhead. In other words, the operational complexity should not linearly increase with the Cassandra fleet size.

Non-Goals

  1. No human-in-the-loop automated repair

Proposed Changes (Last active: 2024-10-01)

Keeping the above motivation in mind, the proposed design embarks on our journey to have the repair orchestration inside Apache Cassandra. This fully Automated Repair scheduler inside Cassandra does not depend on the control plane, significantly reducing our operational overhead.

...

The scheduler supports Full and Incremental repair types today with the following features. The Preview repair code is under review.  A new repair type, such as Paxos repair and any future repair types, should be extended with minimal dev work!

Available Features

  • Capability to run on multiple nodes simultaneously
  • Default implementation and an interface to override the data set being repaired per repair session
  • Ability to extend the token split algorithms with the following two implementations readily available:
    • Split tokens evenly based on the number of splits specified (Default)
    • Splits token ranges by putting an having max cap on the size of data being repaired as part of one repair session as well max cap at a table level - this is crucial for Incremental Repair
  • Enough observability so an operator can configure alarms depending on their need
  • A new CQL table property to disable repair type at a table level if one wants the scheduler to skip one or more tables
  • Enabling/Disabling scheduler for each repair type dynamically
  • Per Data Center per job configuration
  • Admin
    • Prioritize running repairs on specific nodes over others
    • Force one or more nodes to start immediately, even if it is not their turn to repair organically

Incremental Repair Specific Features

As highlighted above, Onboarding/offboarding[1] incremental repair is quite painful. The challenge amplifies exponentially if we want to do this on a large Cassandra cluster and an extensive fleet of Cassandra. Making it smooth and reliable is a painstaking task. So, the design already incorporates the following.

  • A way to onboard/offboard incremental repair
    • No restart required
    • No keyspaces have to be specified
  • Safety Against 100% Disk Full

    Incremental repair can lead to a 100% full disk in a corner-case scenario. A new mechanism has been added to alleviate disk usage: if the disks are 80% full, then the current and futuristic incremental repair will automatically stop! 

  • Safety Against Materialized View and CDC

    It is not recommended (due to existing bugs) to use the incremental repair on a table if a Materialized View and/or CDC is enabled. The scheduler additionally confirms whether the "write path" is disabled (CASSANDRA-17666) during the streaming. If the "write path" is not disabled, then it will skip incremental repair, else allow.

Ongoing Work [2024]

This work will likely be finished by December 2024.

  • Safeguard Full repair against disk protection, i.e., fail current and future full repair sessions if the disk is already at 80%
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20045
  • Incorporate Preview Repair
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20046
  • Making the incremental repair more reliable by having an unrepaired size-based token splitter  
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20047

Future Work [2025]

The discussion will commence by December 2024, but the implementation work will happen in early 2025.

  • Repair should be major version forward/backward compatible.
    • [2024] Bring an official alignment through a new ML discussion to ensure everybody is on the same page. Some work is already done as part of
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyCASSANDRA-7530
    • [2025] Add test cases that will catch if compatibility is broken
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyCASSANDRA-20042
  • Table-level tracking: If a node is restarted while repair is ongoing, resume the repair from where it left off last time as opposed to from scratch
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20044
    • [2024] Design discussion
    • [2024/25] Implementation
  • Auto-delete snapshots at X% disk full
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20035
    • [2025] Design discussion
    • [2025] Implementation
  • Stop the scheduler if two major versions are detected [2025] 
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyCASSANDRA-20048
    • Note: This will not be needed once we have a robust
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyCASSANDRA-20042
      but until then, this is a safeguard 

Observability

The scheduler should provide enough metrics for an operator to make informed decisions:

...

Metric NameDescription

RepairsInProgress

If repair is in progress on the node or not?

NodeRepairTimeInSec

Time taken to repair the node in seconds

ClusterRepairTimeInSec

Time taken to repair the entire Cassandra cluster in seconds

LongestUnrepairedSec

Time since the last repair ran on the node in seconds

SucceededTokenRangesCount

The number of token ranges successfully repaired on the node

FailedTokenRangesCount

The number of token ranges that failed to repair on the node

SkippedTokenRangesCount

Number of token ranges skipped on the node

TotalMVTablesConsideredForRepair

Number of materialized views considered on the node

TotalDisabledRepairTables

Number of tables on which the repair has been disabled on the node

RepairTurnMyTurn

The node went through repair naturally due to its turn  (common)

RepairTurnMyTurnDueToPriority

Node was repaired due to priority set by admin (uncommon)


Migration Plan

  • Adopting the scheduler should be extremely easy and require users to make only minor changes, such as tweaking its configuration options. 
  • The scheduler should not modify the integral parts of the anti-entropy algorithm itself.
  • All the configuration options should be disabled by default, making the scheduler an opt-in so it does not interfere with the Cassandra version upgrades.
    • The opt-in should not require any cluster restart; one should be able to opt-in simply by activating through nodetool

Test Plan

  • Capability to tune the following important scheduler properties without restarting the Cassandra cluster:
    • Turning on/off
    • Increase/decrease parallelism
    • Allow/Disallowing tables for repair
  • The scheduler should have been thoroughly tried and tested in real-world synthetic and live traffic
  • Unit tests for the new subsystems
  • Unit tests for all configuration
  • DTest (preferably In-JVM dtest) covering the basic scheduling functionality 

Scale 

  • This design has been thoroughly tried and tested on an immense scale (hundreds of unique Cassandra clusters, tens of thousands of Cassandra nodes, with tens of millions of QPS) on top of 4.1 open-source; please see more details here.
  • Another reference

Detailed Design Doc 

Automated Repair in Cassandra

PR

4.1.6

Many folks currently are using 4.1.6 in production. Hence, the following PR on 4.1.6 will make it easier for everybody to review the code, test, etc.  If the community decides to merge this CEP, then it will land only on the trunk - 4.1.6 PR is merely for code review and testing purposes only.

https://github.com/apache/cassandra/pull/3367/

Trunk

https://github.com/apache/cassandra/pull/3598

Dtest fix

https://github.com/apache/cassandra-dtest/pull/270

Discussion over Slack

[1][2]

Post-Vote Updates

The following features have been added to post-vote updates:

...

Please follow this epic for continuous updates:  

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyCASSANDRA-19918

Other Solutions

We have a few other ready-made solutions available, some of which are being used in the industry at scale in private forks. Overall, these solutions have been evaluated, and the reason they do not look like a good long-term solution for Apache Cassandra is due to the following:

  • Operational complexity of setting it up, e.g., dependency on a control plane
  • Lack of integration with the DB
  • Lack of out of the box availability
  • Lack of support for multiple repair types
  • Not capable of running incremental repair reliably and at scale
  • Lack of it being donated to the foundation

Solution#1 Scheduled Repair by Joey Lynch (Last active: 2018-11-20)

There have been many attempts to automate repair in Cassandra, which makes sense given that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.

...

Detailed design doc: Scheduled Repair

Solution#2 Automatic repair scheduling by Marcus Olsson (Last active: 2016-06-01)

Scheduling and running repairs in a Cassandra cluster is most often a required task, but this can both be hard for new users and it also requires a bit of manual configuration. There are good tools out there that can be used to simplify things, but wouldn't this be a good feature to have inside of Cassandra? To automatically schedule and run repairs, so that when you start up your cluster it basically maintains itself in terms of normal anti-entropy, with the possibility for manual configuration.

...

Detailed design doc: Distributed Repair Scheduling

Solution#3 Reaper by The Last Pickle / Spotify (Last active: TBD)

Reaper is a centralized, stateful, and highly configurable tool for running Apache Cassandra repairs against single or multi-site clusters.

...

Details: https://github.com/thelastpickle/cassandra-reaper

Solution#4 ecChronos Reaper by Ericsson (Last active: TBD)

ecChronos is a decentralized scheduling framework primarily focused on performing automatic repairs in Apache Cassandra.

...