Status

Current state: Merged

Discussion thread: here

JIRA: CASSANDRA-16921 - Getting issue details... STATUS

Released: Unreleased

Audience: Cassandra Developers
User Impact: None
Target Release: Q3 2021

Motivation

Like all distributed systems, proving correctness in Cassandra is challenging. We have made great strides in testing with in-jvm dtests, harry and other approaches, however these all require either fairly invasive and deliberate perturbations to the normal ordering of events in the system in order to elicit specific conditions, regressions etc, or they must be made immune to changes in the ordering of events in order for the test to run successfully.

Ideally boundary and unexpected conditions and event orderings could be elicited automatically, without any manual intervention - either for cluster level tests, or those at the component or class level. Tests written only to assert internal consistency of state could then explore far more system behaviours for the same level of investment.

This work dovetails well with Harry, which automates the exploration and validation of different data patterns and workloads. We hope to eventually combine the two approaches.

Goals

A facility for simulating a cluster and actions upon it (or simpler unit tests), such that the behaviour is deterministic and repeatable but pseudo-random
Necessary modifications to Cassandra to facilitate this, including improvements to relevant APIs

Proposed Changes

Refactor internal APIs around concurrency to support mock implementations that are able to control execution, including

SimpleCondition, Semaphore, CountDownLatch, BlockingQueue, etc
Executors, futures, starting threads, etc - including important improvements to consistency of approach in the codebase
The use of currentTimeMillis and nanoTime
The replacement of java.io.File with a wrapper on java.nio.files.Path providing an ergonomic API, and some improvements to consistency of file handling
Support for alternative streaming implementations
Improvements to the dtest API to support necessary functionality

Introduction of a simulator package, containing
- Mock implementations of all systems that control event ordering, including those mentioned above; and
  - Object monitors
  - Network messages

A framework for intercepting events on these mock systems and translating them into events to be scheduled and evaluated in arbitrary order
A system for orchestrating random modifications to cluster topology that should not affect the correctness of operations on the system (initially this will be quite strict as to how these events occur, given Cassandra’s present weakness in performing these reliably)
Byte weaving class loaders for modifying execution to:
- Intercept monitor entry/exit and control when these occur
- Intercept the invocation of certain global methods we mock the implementation of
- Replace certain non-deterministic constructs with deterministic ones, such as IdentityHashMap, Object.hashCode(), Enum.hashCode()
- Pseudo-randomly pause thread execution either side of important (ordinarily non-blocking) synchronisation events, such as atomic field updating, volatile field access, etc

Introduction of test cases using the new facilities, including

A linearizability verifier for LWTs
Unit test to expose concurrency bugs in an individual class

Test Plan

API refactors will come with some associated improvements to unit test coverage
The system will come accompanied with some unit tests of its functionality
The system itself and the introduced use cases represent the majority of the introduced testing, and constitute a significant expansion of test coverage for the project.

Space shortcuts

Page tree

Status

Motivation

Goals

Proposed Changes

Test Plan

Space shortcuts

Page tree

CEP-10: Cluster and Code Simulations

Status

Motivation

Goals

Proposed Changes

Test Plan