Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Chaos testing is a methodology used to proactively identify and address weaknesses in a software system's resilience. It involves deliberately injecting faults, failures, or unexpected behaviors into the system to simulate real-world conditions.

A chaos testing framework is a structured approach or set of tools used to implement chaos engineering practices within software systems. It provides a systematic way to simulate various failures, faults, and unexpected behaviors in order to proactively identify weaknesses and improve the resilience of the system. Key aspects of a chaos testing framework include:

  • Resilience Improvement: Chaos testing helps improve the system's ability to withstand unexpected failures and disruptions by exposing weaknesses that may not be evident under normal conditions.
  • Real-World Simulation: It simulates real-world conditions where systems may experience network outages, hardware failures, or other unexpected events, providing more realistic testing scenarios.
  • Early Detection of Issues: By injecting faults deliberately, chaos testing allows teams to detect and fix potential issues before they impact users in production, reducing downtime and customer dissatisfaction.
  • Continuous Improvement: It promotes a culture of continuous improvement by iteratively testing and refining the system's resilience over time, ensuring that it evolves to meet changing demands and challenges.
  • Confidence in Deployments: Teams gain confidence in deploying updates and changes knowing that their systems have been rigorously tested under adverse conditions, reducing the risk of outages or failures post-deployment.

Celeborn chaos testing framework is proposed in the CIP to simulate various anomalies and ensure coverage of testing scenarios in distributed environments. The purpose of Celeborn chaos testing is to proactively identify weaknesses and improve the resilience of software systems by intentionally injecting faults and disturbances. The main objectives and purposes:

  • Enhancing System Resilience: By simulating real-world failures and disruptions (such as server crashes, network latency, or database errors), chaos testing aims to uncover weaknesses that might not be apparent under normal operating conditions. This helps organizations build more resilient systems that can withstand unexpected events.
  • Risk Mitigation: Chaos testing helps identify potential issues and vulnerabilities in production systems before they affect end users. By detecting and addressing these weaknesses early, teams can reduce the risk of downtime, service disruptions, and customer dissatisfaction.
  • Validating Fault Tolerance Mechanisms: It verifies the effectiveness of fault tolerance mechanisms and recovery strategies implemented within the system. Chaos testing ensures that these mechanisms can adequately handle and recover from failures without compromising system integrity or performance.
  • Cultural and Organizational Improvement: Implementing chaos testing fosters a culture of continuous improvement and proactive resilience within engineering teams. It encourages collaboration, learning, and the adoption of best practices for designing and maintaining robust software architectures.

Therefore, Celeborn chaos testing framework aims to make software systems more reliable, resilient, and capable of adapting to unforeseen challenges in dynamic and distributed environments. It is an essential practice for organizations looking to ensure high availability and performance in today's complex technological landscapes.

Public Interfaces

The chaos testing framework is divided into three components: Scheduler, Runner, and CLI. We propose the component interfaces of chaos testing component to support chaos testing framework.

VerificationPlan

VerificationPlan is designed as a plan interface that serves as the communication bridge and operational entities of the chaos testing framework. A VerificationPlan is composed of Actions, Trigger, and Checker which ensure plan could be executed via the Scheduler.

VerificationPlan
class VerificationPlan(val actions: List[Action], val trigger: Trigger, val clusterChecker: Checker)
  extends Serializable {}

Operation

Operation is execution unit of the Runner component which defines the execution behavior on Runner of chaos testing anomalies. Scheduler submits operation corresponding to VerificationPlan to Runner for verification of execution result. The execution result of operation runner represents the testing outcomes of simulating various anomalie actions.

Operation
abstract class Operation(
    val target: ActionTarget,
    val updateContextBlock: (RunnerContext => Unit) = ctx => {},
    val interval: Long)
  extends Serializable with Logging {
  def executeOnRunner(context: RunnerContext): OperationResult
}

Selector

Selector is a plugable interface for selecting strategy that select Celeborn Master, Worker nodes, and disks within a specified interval. The selector determines the target of Action used to generate an Operation. The default implementations of Selector support assign and random selecting strategy with target type including master, worker, disk and runner.

Action
abstract class Selector(val interval: Long) extends Serializable {
  def select(schedulerContext: SchedulerContext, target: String): List[ActionTarget]
  def getInterval(): Long = this.interval
}

Action

The Action interface is proposed to represent anomalie behaviors like killing process of Celeborn Master and Worker, IO hang, CPU occupy, meta and disk corrupt etc. Scheduler generates operations executed in Runner via selecting strategy of Action for simulating anomalies. Meanwhile, the Action deduces operations with deduced context for Checker.

Action
abstract class Action(val selector: Selector, val interval: Long)
  extends Serializable with Logging {
  val target: String = ResourceTypeConstants.TARGET_SCHEDULER
  val tendency = ActionTendencyConstant.NEUTUAL
  def identity(): String
  def generateOperations(context: SchedulerContext): List[Operation]
  def deduce(context: SchedulerContext, operations: List[Operation]): SchedulerContext
}

Trigger

Scheduler supports different policies for triggerring an Action to generator anomalie operation simulated in Runner. Therefore, Trigger is introduced to define a policy of triggerring Action in Scheduler like random and sequence.

Trigger
class Trigger(val policy: String, val interval: Interval, val repeat: Int) extends Serializable
  with Logging {}

Checker

The Checker monitors resource status of Celeborn Master, Worker, disk, CPU, Scheduler and Runner. Whether resources have reached the minimum usable set from deduced context is also validated via Checker in Scheduler.

Checker
trait Checker extends Serializable with Logging {
  def validate(deducedContext: SchedulerContext): Boolean
  def tendency: Int
  def avaliableTargets: util.HashSet[String]
}

Command

Command interface is introduced to define the commands of CLI including planning behavior like submitting, pausing and resuming plan etc. CLI component supports SubmitCommand, PausePlanCommand, ResumePlanCommand, ReportCommand and StopCommand to maintain chaos testing.

Command
trait Command extends Logging {
  def init(args: Array[String])
  def execute(commandContext: CommandContext)
}

Celeborn chaos testing framework uses a separate configuration file, with the default configuration file named verifier.conf. The proposed configurations in verifier.conf includes:

ConfigurationDefaultDescription

verf.runner.test.mode

falseWhether to skip the node resource check process.

verf.scheduler.address

localhost:19097

The endpoint address of the Scheduler.

verf.scripts.master.start.script

$CELEBORN_HOME/sbin/start-master.shThe script location of starting Celeborn Master.

verf.scripts.master.stop.script

$CELEBORN_HOME/sbin/stop-master.shThe script location of stopping Celeborn Master.

verf.scripts.worker.start.script

$CELEBORN_HOME/sbin/start-worker.shThe script location of starting Celeborn Worker.

verf.scripts.worker.stop.script

$CELEBORN_HOME/sbin/stop-worker.shThe script location of stopping Celeborn Master.

verf.block.bad.inflight.location

/root/badblock/inflightThe inflight location of the bad block.

verf.runner.timeout

30sThe timeout limit for the Runner.

verf.runner.register.retry.delay

5sThe registration retry time for the Runner.

verf.runner.heartbeat.delay

30sThe initial delay time for the Runner heartbeat check.

verf.runner.heartbeat.interval

30s

The heartbeat interval for the Runner.

verf.plan.action.selector.default.interval

5s

The default interval for the Action selector.

verf.plan.action.default.interval

5s

The default interval for all Actions.

Proposed Changes

Based on the description of component interfaces, here's a breakdown of the Celeborn chaos testing framework components.

Components

  1. Scheduler
    • Executes testing plans.
    • Generates anomaly events. Current anomaly event types include:
      • Kill master process
      • Kill worker process
      • Worker directory not writable
      • Worker disk IO hang
      • High CPU load
      • Master node metadata corruption (cannot be executed in random testing mode; throws an error during test plan parsing).
    • Monitors cluster resource status to determine if resources have reached the minimum usable set.
  2. Runner
    • Monitors node resource status, including whether the master process exists, worker process exists, and if disks are readable/writable and not in an IO hang state.
    • Executes anomaly events.
  3. CLI
    • Queries current status of various resources in the testing environment.
    • Pauses or resumes workflows.
    • Submits testing plans.

These components provide a comprehensive framework for chaos testing that simulates and manages of various failure scenarios in distributed environments.

The architecture diagram of the Celeborn chaos testing framework is shown below:

Fg1. Celeborn Chaos Testing Framework Architecture Diagram

The main interaction process between the Scheduler, Runner, and CLI components is shown in Figure 2:

  1. The Runner registers itself with the Scheduler upon startup.
  2. CLI submits VerificationPlan to the Scheduler via the SubmitCommand. VerificationPlan includes Actions, Trigger and Checker.
  3. The Scheduler executes received VerificationPlan repeatedly in a cycle:
    • The Scheduler triggers Action with the specified policy.
    • Operations are generated via Action.
    • The Action deduce operation through the deduced context.
    • Checker validates whether resources have reached the minimum usable set from deduced context.
    • Operation is submitted to Runner thread to execute.
    • The operation runner executes the anomaly VerificationPlan from Scheduler and relies the execution result to Scheduler.
    • SchedulerContext is updated with the execution result of operation.
  4. Node resource statuses are monitored and reported to the Scheduler including process of Celeborn Master, Worker, disk, IO etc.


Fg2. Celeborn Chaos Testing Framework Sequence Diagram

CLI

We propose the following usage and commands of CLI:

  • Usage
CLI
 schedulerHost schedulerPort cmd args*
  • Commands
CommandArgumentsDescription
submit/xxx/plan.jsonSubmit json file of verification plan.
stop
Stop executing verification plan, do not restore cluster state, reset the scheduler state.
pause
Pause execution of verification plan without resetting scheduler state.
resume
Resume execution of verification plan.
report
Query the overall scheduler status, including the current execution plan and various resource lists.

The verification plan is in JSON format and consists of three elements: action, trigger, and checker:

  • Format: 
Verification Plan JSON
// Note that all interval values in verification plan are time strings and cannot be entered directly as numbers.
// Trigger has two policies: random and sequence.
// random: a checker must be configured.
{
	"actions": [{
		"id": "xxx", // id represents the anomaly action type.
		"interval": "1s", // The time interval needed after the current action completes.
		"selector": {
			"interval": "", // Some actions may have multiple operations. The interval indicates the interval between operations.
			"indices": [], // Represents specific nodes.
			"device": [] // Represents resources related to nodes; for example, CPU, working directory.
		}
	}],
	"trigger": {
		"policy": "random",
		"interval": { // Represents the interval, with two interval type: range and fix.
			"type": "range",
			"start": "5s",
			"end": "15s"
		}
	},
	"checker": "resource"
}
// sequence: a checker cannot be configured.
{
	"actions": [{
		"id": "xxx", // id represents the anomaly action type.
		"interval": "1s", // The time interval needed after the current action completes.
		"selector": {
			"interval": "", // Some actions may have multiple operations. The interval indicates the interval between operations.
			"indices": [], // Represents specific nodes.
			"device": [] // Represents resources related to nodes; for example, CPU, working directory.
		}
	}],
	"trigger": {
		"policy": "sequence",
		"interval": { // Represents the interval, with two interval type: range and fix.
            "type": "fix",
			"value": "10s"
		},
        "repeat": "5" // Indicates how many times the current plan needs to be executed, defaulting to 1.
	}
}
  • Actions
IdDescription

Restriction

startmasterStart the Celeborn Master process.
stopmasterStop the Celeborn Master process.
startworkerStart the Celeborn Worker process.
stopworkerStop the Celeborn Worker process.

hangio

Simulate device IO hang.

resumedisk

Restore the directory to a writable state.

resumeio

Remove the device IO hang state.

occupycpu

Preempt CPU cores.

corruptdisk

The directory becomes unwritable.

corruptmeta

Delete Ratis metadata of Celeborn Master.Not supported in random testing mode.
  • Example
Verification Plan Example
{
	"actions": [{
			"id": "stopmaster",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": "60s"
		},
		{
			"id": "startmaster",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": "60s"
		},
		{
			"id": "stopworker",
			"selector": {
				"indices": [
					0
				]
			}
		},
		{
			"id": "startworker",
			"selector": {
				"indices": [
					0
				]
			}
		},
		{
			"id": "hangio",
			"selector": {
				"indices": [
					1
				],
				"device": [
					0,
					1,
					2,
					3
				],
				"interval": "15s"
			}
		},
		{
			"id": "occupycpu",
			"cores": 4,
			"duration": "45s",
			"selector": {
				"indices": [
					2
				]
			},
			"interval": "15s"
		},
		{
			"id": "resumeio",
			"selector": {
				"indices": [
					1
				],
				"device": [
					0,
					1,
					2,
					3
				],
				"interval": "15s"
			}
		}
	],
	"trigger": {
		"policy": "sequence",
		"interval": {
			"type": "fix",
			"value": "120s"
		},
		"repeat": "3"
	}
}

Robustness

Trigger various anomalies (excluding master metadata corruption) until the current system reaches its minimum resource state. Then, gradually restore the system until all anomalies are resolved, continuing this cycle indefinitely. For an n-node master Celeborn cluster, the minimum usable resources are defined as having (n/2 + 1) master nodes alive and 2 workers with at least 1 disk available. The verification method is that run a 1TB TPC-DS workload. During this process, TPC-DS should successfully complete execution with correct results. The verification plan of robustness is shown below:

Robustness Plan JSON
{
	"actions": [{
			"id": "startworker",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "stopworker",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "startmaster",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "stopmaster",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "corruptdisk",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "hangio",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "resumedisk",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "resumeio",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "occupycpu",
			"duration": "30s",
			"cores": 5,
			"selector": {
				"type": "random"
			}
		}
	],
	"trigger": {
		"policy": "random",
		"interval": {
			"type": "range",
			"start": "5s",
			"end": "15s"
		}
	},
	"checker": "resource"
}

Predefined

  1. Master Node Failure:

    • One or two master nodes fail intermittently for 30 seconds each time.
  2. Worker Node Failure:

    • One or two worker nodes fail intermittently for 30 seconds each time.
  3. Disk Anomalies:

    • Randomly select a worker node:
      • Test one disk, two disks, or all disks to become unwritable.
  4. Disk IO Hang:

    • Randomly select a worker node:
      • Test one disk, two disks, or all disks to experience IO hang.
  5. Master Metadata Anomaly:

    • In the event of master node anomalies, select one or two nodes.

The verification method is that runs a query that lasts several minutes. In most cases, the query should successfully execute. However, occasional failures are allowed under specific conditions. The corresponding verification plan is as follows:

Predefined Plan JSON
{
	"actions": [{
			"id": "stopmaster",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": 30
		},
		{
			"id": "startmaster",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": 30
		},
		{
			"id": "stopmaster",
			"selector": {
				"indices": [
					2,
					3
				]
			},
			"interval": 30
		},
		{
			"id": "startmaster",
			"selector": {
				"indices": [
					2,
					3
				]
			},
			"interval": 30
		},
		{
			"id": "stopmaster",
			"selector": {
				"indices": [
					2,
					3
				]
			},
			"interval": 30
		},
		{
			"id": "corruptmeta",
			"selector": {
				"indices": [
					2,
					3
				]
			},
			"interval": 30
		},
		{
			"id": "startmaster",
			"selector": {
				"indices": [
					2,
					3
				]
			},
			"interval": 30
		},
		{
			"id": "stopworker",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": 30
		},
		{
			"id": "startworker",
			"selector": {
				"indices": [
					1,
					2
				]
			},
			"interval": 30
		},
		{
			"id": "corruptdisk",
			"selector": {
				"indices": [
					1,
					2
				],
				"device": [
					1,
					2
				]
			},
			"interval": "120s"
		},
		{
			"id": "resumedisk",
			"selector": {
				"indices": [
					1,
					2
				],
				"device": [
					1,
					2
				]
			}
		},
		{
			"id": "hangio",
			"selector": {
				"indices": [
					1,
					2,
					3
				],
				"device": [
					1,
					2,
					3
				]
			},
			"interval": "120s"
		},
		{
			"id": "resumeio",
			"selector": {
				"indices": [
					1,
					2, 3
				],
				"device": [
					1,
					2,
					3
				]
			}
		},
		{
			"id": "occupycpu",
			"duration": "30s",
			"cores": "8",
			"selector": {
				"indices": [
					1,
					2
				]
			}
		}
	],
	"trigger": {
		"policy": "sequence",
		"interval": {
			"type": "fix",
			"value": "10s"
		},
		"repeat": "5"
	}
}

Compatibility, Deprecation, and Migration Plan

 The chaos testing framework is new feature with no compatibility, deprecation, and migration plan.

Test Plan

  1. Start the Runner and Scheduler, and use CLI commands to submit verification plan files to simulate the following anomalies and verify Celeborn's stability.
    • Kill master process
    • Kill worker process
    • Worker directory not writable
    • Worker disk IO hang
    • High CPU load
    • Master node metadata corruption
  2. Mock shuffle process and support implementation of other corner cases to test each stage of shuffle and enrich workloads.
  3. Provide helm chart to support deployment of chaos testing framework on Kubernetes.

Rejected Alternatives

 The chaos testing framework has no other rejected alternatives.

  • No labels