DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Committed (to trunk)
Discussion thread: https://lists.apache.org/thread/glvmkwknf91rxc5l6w4d4m1kcvlr6mrv
Voting thread: https://lists.apache.org/thread/ywtbvlzm112h2mv0xww6lbq555939hkr
Update thread: https://lists.apache.org/thread/30xjmqt2804pt40sk7tl46n7b6tt7wt6
JIRA: https://issues.apache.org/jira/browse/CASSANDRA-19918
Released: Part of the next release, 5.1
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
It is an operational oversight that there is no repair scheduling mechanism offered out of the box with Apache Cassandra. To have and keep data at-rest consistent you must run repairs. For running clusters, inconsistent data at-rest presents potential problems in different scenarios, from performance overhead, to broken consistency guarantees, and even a loss of data.
Other cluster maintenance (hint delivery, compaction, flushing of data, etc) all occur automatically, but there is no mechanism to have repair automatically run in the database. This makes the database harder to use and the proliferation of custom solutions to this add unnecessary complexity to the ecosystem.
Anti-entropy (Apache Cassandra repairs) is important for every Apache Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. A few open-source orchestration solutions that trigger repair externally are available, as many corporations may have already figured out their own repair solution. The repair activity should be an integral part of Cassandra itself, very much like Compaction, to call it a complete solution.
The proposal here is to align one solution among the existing solutions and make it part of the core Cassandra itself.
Multiple repairs need to be scheduled inside Cassandra to work correctly. A few of the repair types are: 1) Full repair, 2) Incremental Repair, 3) Preview Repair, and 4) Paxos repair
The design of the scheduler should be capable of extending multiple repair categories with a minimal code change, and all repair types should progress automatically with minimal manual intervention.
Migrating[1] (and rollback) to/from incremental repair has been extremely challenging, especially in a large fleet. One of the design principles is to make it almost touchless from the operator’s point of view.
Users
New Users
Creating a reliable, scalable, and robust repair scheduler is a multi-year journey. Without an out-of-the-box Apache Cassandra solution, newly adopting Apache Cassandra users would have to make significant investments in designing, developing, and scaling the repair scheduling framework before they can use Apache Cassandra.
Existing Users
Apache Cassandra 4.0 onwards, we need to run multiple repair types for the correct database behavior, such as Full, Incremental, and Paxos.
- Some newly introduced repair types, such as Incremental repair, are complex to manage operationally, such as migrating[1] to/from incremental repair, which could jeopardize the live traffic. The token-calculation algorithm for Incremental repair needs to consider the size of unrepaired data. Additionally, the incremental repair does anti-compaction, so if it is not monitored correctly, eventually, it may make disks 100% full and create catastrophic failures in the live traffic.
- Unlike full repair, incremental repair does not work for all workload types - for some workload types, it poses reliability and availability risks to the Cassandra cluster. Therefore, it is vital to have an automated way to migrate on/off incremental repair without any human involvement and creating any reliability risks to the live production traffic.
- All the existing users would have to invest in their solutions to make them work across these multiple categories. When multiple implementations are done in silos for the same problem, then often, we lose the cross-team knowledge, and everyone is reinventing the wheel. Therefore, for the existing user, it is also beneficial to have an out-of-the-box Apache Cassandra solution, as the solution will solidify over the period.
Goals
Since repair has been a widely discussed topic for a long time, and folks have developed their custom repair solutions, the proposal is to have a single, formally recommended, and supported solution for Repair Scheduling that is part of the project's codebase and available by default to all users. Once we have an official solution, the entry bar for newcomers will be implicitly lowered. It motivates the existing user to enrich the solution and eventually migrate to an officially blessed solution to avoid reinventing the wheel.
The solution should have the following properties:
- The solution has to be extremely easy for an operator to manage, so any naive user should be able to manage it.
- The solution should provide 1) Full Repair and 2) Incremental Repair out-of-the-box. Moreover, the architecture should be easily extendable to accommodate more future repair types, such as Preview and Paxos repair.
- Reduce the incremental repair operational complexity:
- Migration: Migration to using incremental repair and its scheduling is very complex, so this solution should provide automation for that.
- Ongoing: Once the incremental repair runs in the cluster, it must be monitored carefully. It could create catastrophic problems, such as disk full—enough safeguards for the incremental repair to avoid jeopardizing the live traffic.
- Token: A robust token-split mechanism specific to incremental repair.
- The solution should work on a small, medium, and large Cassandra cluster.
- The solution should scale and perform without much additional operational overhead. In other words, the operational complexity should not linearly increase with the Cassandra fleet size.
Non-Goals
- No human-in-the-loop automated repair
Proposed Changes (Last active: 2024-10-01)
Keeping the above motivation in mind, the proposed design embarks on our journey to have the repair orchestration inside Apache Cassandra. This fully Automated Repair scheduler inside Cassandra does not depend on the control plane, significantly reducing our operational overhead.
At a high level, a dedicated thread pool is assigned to the repair scheduler. The repair scheduler inside Cassandra maintains a new replicated table under a distributed system_distributed keyspace. This table maintains the repair history for all the nodes, such as when it was repaired the last time, etc. The scheduler will pick the node(s) that run the repair first and continue orchestration to ensure Every table and all of their token ranges are repaired. The algorithm can also run repairs simultaneously on multiple nodes and splits the token range into subranges with the necessary retry to handle transient failures. The automatic repair runs as soon as we start a Cassandra cluster, like Compaction, and does not require human-in-the-loop.
The scheduler supports Full and Incremental repair types today with the following features. The Preview repair code is under review. A new repair type, such as Paxos repair and any future repair types, should be extended with minimal dev work!
Available Features
- Capability to run on multiple nodes simultaneously
- Default implementation and an interface to override the data set being repaired per repair session
- Ability to extend the token split algorithms with the following two implementations readily available:
- Split tokens evenly based on the number of splits specified (Default)
- Splits token ranges by putting an having max cap on the size of data being repaired as part of one repair session as well max cap at a table level - this is crucial for Incremental Repair
- Enough observability so an operator can configure alarms depending on their need
- A new CQL table property to disable repair type at a table level if one wants the scheduler to skip one or more tables
- Enabling/Disabling scheduler for each repair type dynamically
- Per Data Center per job configuration
- Admin
- Prioritize running repairs on specific nodes over others
- Force one or more nodes to start immediately, even if it is not their turn to repair organically
Incremental Repair Specific Features
As highlighted above, Onboarding/offboarding[1] incremental repair is quite painful. The challenge amplifies exponentially if we want to do this on a large Cassandra cluster and an extensive fleet of Cassandra. Making it smooth and reliable is a painstaking task. So, the design already incorporates the following.
- A way to onboard/offboard incremental repair
- No restart required
- No keyspaces have to be specified
Safety Against 100% Disk Full
Incremental repair can lead to a 100% full disk in a corner-case scenario. A new mechanism has been added to alleviate disk usage: if the disks are 80% full, then the current and futuristic incremental repair will automatically stop!
Safety Against Materialized View and CDC
It is not recommended (due to existing bugs) to use the incremental repair on a table if a Materialized View and/or CDC is enabled. The scheduler additionally confirms whether the "write path" is disabled (CASSANDRA-17666) during the streaming. If the "write path" is not disabled, then it will skip incremental repair, else allow.
Ongoing Work [2024]
This work will likely be finished by December 2024.
- Safeguard Full repair against disk protection, i.e., fail current and future full repair sessions if the disk is already at 80% CASSANDRA-20045 - Getting issue details... STATUS
- Incorporate Preview Repair CASSANDRA-20046 - Getting issue details... STATUS
- Making the incremental repair more reliable by having an unrepaired size-based token splitter CASSANDRA-20047 - Getting issue details... STATUS
Future Work [2025]
The discussion will commence by December 2024, but the implementation work will happen in early 2025.
- Repair should be major version forward/backward compatible.
- [2024] Bring an official alignment through a new ML discussion to ensure everybody is on the same page. Some work is already done as part of CASSANDRA-7530 - Getting issue details... STATUS
- [2025] Add test cases that will catch if compatibility is broken CASSANDRA-20042 - Getting issue details... STATUS
- Table-level tracking: If a node is restarted while repair is ongoing, resume the repair from where it left off last time as opposed to from scratch
CASSANDRA-20044
-
Getting issue details...
STATUS
- [2024] Design discussion
- [2024/25] Implementation
- Auto-delete snapshots at X% disk full
CASSANDRA-20035
-
Getting issue details...
STATUS
- [2025] Design discussion
- [2025] Implementation
- Stop the scheduler if two major versions are detected [2025]
CASSANDRA-20048
-
Getting issue details...
STATUS
- Note: This will not be needed once we have a robust CASSANDRA-20042 - Getting issue details... STATUS but until then, this is a safeguard
Observability
The scheduler should provide enough metrics for an operator to make informed decisions:
- Per-node visibility
- Total time taken to repair
- Repair start time
- Repair end time
- Failed/successful/skipped token-ranges
- Is repair currently in progress?
- Skipped Tables
- Skipped Datacenters
Using the above metrics, an operator can generate a global view. For example, the PR has the following metrics emitted by every single node:
| Metric Name | Description |
|---|---|
RepairsInProgress | If repair is in progress on the node or not? |
NodeRepairTimeInSec | Time taken to repair the node in seconds |
ClusterRepairTimeInSec | Time taken to repair the entire Cassandra cluster in seconds |
LongestUnrepairedSec | Time since the last repair ran on the node in seconds |
SucceededTokenRangesCount | The number of token ranges successfully repaired on the node |
FailedTokenRangesCount | The number of token ranges that failed to repair on the node |
SkippedTokenRangesCount | Number of token ranges skipped on the node |
TotalMVTablesConsideredForRepair | Number of materialized views considered on the node |
TotalDisabledRepairTables | Number of tables on which the repair has been disabled on the node |
RepairTurnMyTurn | The node went through repair naturally due to its turn (common) |
RepairTurnMyTurnDueToPriority | Node was repaired due to priority set by admin (uncommon) |
Migration Plan
- Adopting the scheduler should be extremely easy and require users to make only minor changes, such as tweaking its configuration options.
- The scheduler should not modify the integral parts of the anti-entropy algorithm itself.
- All the configuration options should be disabled by default, making the scheduler an opt-in so it does not interfere with the Cassandra version upgrades.
- The opt-in should not require any cluster restart; one should be able to opt-in simply by activating through nodetool
Test Plan
- Capability to tune the following important scheduler properties without restarting the Cassandra cluster:
- Turning on/off
- Increase/decrease parallelism
- Allow/Disallowing tables for repair
- The scheduler should have been thoroughly tried and tested in real-world synthetic and live traffic
- Unit tests for the new subsystems
- Unit tests for all configuration
- DTest (preferably In-JVM dtest) covering the basic scheduling functionality
Scale
- This design has been thoroughly tried and tested on an immense scale (hundreds of unique Cassandra clusters, tens of thousands of Cassandra nodes, with tens of millions of QPS) on top of 4.1 open-source; please see more details here.
- Another reference
Detailed Design Doc
PR
4.1.6
Many folks currently are using 4.1.6 in production. Hence, the following PR on 4.1.6 will make it easier for everybody to review the code, test, etc. If the community decides to merge this CEP, then it will land only on the trunk - 4.1.6 PR is merely for code review and testing purposes only.
https://github.com/apache/cassandra/pull/3367/
Trunk
https://github.com/apache/cassandra/pull/3598
Dtest fix
https://github.com/apache/cassandra-dtest/pull/270
Discussion over Slack
Post-Vote Updates
The following features have been added to post-vote updates:
- Reliable Incremental repair by splitting token ranges by putting a max cap on the size of data being repaired as part of one repair session as well as a max cap at a table level. (Done)
- Introduced a new repair category, "Preview repair". (Done)
- Extended the CQL table property to assign repair priority. With this feature, if priority is set for certain tables, then they will be repaired before others. (Done)
- By default, disable repair on all the tables in system_traces keyspace because the keyspace does not hold critical data. (Done)
- Whether to allow a node to take its turn running repair while one or more replicas run repair for the same repair type. (Done)
- Whether to allow a node to take turns running repair when a replica executes another repair type. For example, incremental repairs for this node are prevented if a replica executes full repairs. (Done)
- Automatically disabling repair (powered by Guardrail) when two different Cassandra versions are detected in the ring. (In Review)
Please follow this epic for continuous updates: CASSANDRA-19918 - Getting issue details... STATUS
Other Solutions
We have a few other ready-made solutions available, some of which are being used in the industry at scale in private forks. Overall, these solutions have been evaluated, and the reason they do not look like a good long-term solution for Apache Cassandra is due to the following:
- Operational complexity of setting it up, e.g., dependency on a control plane
- Lack of integration with the DB
- Lack of out of the box availability
- Lack of support for multiple repair types
- Not capable of running incremental repair reliably and at scale
- Lack of it being donated to the foundation
Solution#1 Scheduled Repair by Joey Lynch (Last active: 2018-11-20)
There have been many attempts to automate repair in Cassandra, which makes sense given that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.
At Netflix we've built a scheduled repair service within Priam (our sidecar), which we spoke about last year at NGCC. Given the positive feedback at NGCC we focussed on getting it production ready and have now been using it in production to repair hundreds of clusters, tens of thousands of nodes, and petabytes of data for the past six months. Also based on feedback at NGCC we have invested effort in figuring out how to integrate this natively into Cassandra rather than open sourcing it as an external service (e.g. in Priam).
As such, vinaykumarcse and I would like to re-work and merge our implementation into Cassandra, and have created a design document showing how we plan to make it happen, including the the user interface.
As we work on the code migration from Priam to Cassandra, any feedback would be greatly appreciated about the interface or v1 implementation features. I have tried to call out in the document features which we explicitly consider future work (as well as a path forward to implement them in the future) because I would very much like to get this done before the 4.0 merge window closes, and to do that I think aggressively pruning scope is going to be a necessity.
Ticket: CASSANDRA-14346
Detailed design doc: Scheduled Repair
Solution#2 Automatic repair scheduling by Marcus Olsson (Last active: 2016-06-01)
Scheduling and running repairs in a Cassandra cluster is most often a required task, but this can both be hard for new users and it also requires a bit of manual configuration. There are good tools out there that can be used to simplify things, but wouldn't this be a good feature to have inside of Cassandra? To automatically schedule and run repairs, so that when you start up your cluster it basically maintains itself in terms of normal anti-entropy, with the possibility for manual configuration.
Ticket: CASSANDRA-10070
Detailed design doc: Distributed Repair Scheduling
Solution#3 Reaper by The Last Pickle / Spotify (Last active: TBD)
Reaper is a centralized, stateful, and highly configurable tool for running Apache Cassandra repairs against single or multi-site clusters.
The current version supports running Apache Cassandra cluster repairs in a segmented manner, opportunistically running multiple parallel repairs at the same time on different nodes within the cluster. Basic repair scheduling functionality is also supported.
Reaper comes with a GUI, which if you're running in local mode can be at http://localhost:8080/webui/
Please see the Issues section for more information on planned development, and known issues.
Details: https://github.com/thelastpickle/cassandra-reaper
Solution#4 ecChronos Reaper by Ericsson (Last active: TBD)
ecChronos is a decentralized scheduling framework primarily focused on performing automatic repairs in Apache Cassandra.
The aim of ecChronos is to provide a simple yet effective scheduler that helps in maintaining a cassandra cluster. It is primarily used to run repairs but can be extended to run all manner of maintenance work as well.
- Automate the process of keeping cassandra repaired.
- Split a table repair job into many smaller subrange repairs
- Expose statistics on how well repair is keeping up with the churn of data
- Flexible through many different plug-in points to customize to your specific use case
ecChronos is a helper application that runs next to each instance of Apache Cassandra. It handles maintenance operations for the local node. The repair tasks make sure that each node runs repair once every interval. The interval is configurable but defaults to seven days.
More details on the underlying infrastructure can be found in ARCHITECTURE.md.
More information on the REST interface of ecChronos is described in REST.md.
Details: https://github.com/Ericsson/ecchronos?tab=readme-ov-file
5 Comments
Michael Semb Wever
Can we avoid content changes to this page post vote please. It's valuable to see in retrospect what was voted on. Ongoing work can be recorded elsewhere I believe, e.g. the ticket(s).
(But can we update the
Current stateandDiscussionfields, with links, in the beginning.)JAYDEEPKUMAR CHOVATIA
Sorry, I was not aware that we are not supposed to edit the content after passing the VOTE.
Sure, let me revert it.
JAYDEEPKUMAR CHOVATIA
Reverted to yesterday's version and only updated the
Current state,Discussion fields.Michael Semb Wever
Thanks! Where will the new information go ? (you can also add a '
Post-Vote Updates'section to this page, if tickets or any other place doesn't work.)JAYDEEPKUMAR CHOVATIA
That's a pretty good idea. I have just added a new section, "Post-Vote Updates," capturing all the improvements in the last 2 months, and I will continue to keep this up-to-date.
Thank you!