Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Trigger various anomalies (excluding master metadata corruption) until the current system reaches its minimum resource state. Then, gradually restore the system until all anomalies are resolved, continuing this cycle indefinitely. For an n-node master Celeborn cluster, the minimum usable resources are defined as having (n/2 + 1) master nodes alive and 2 workers with at least 1 disk available. The verification verification method is that run a 1TB TPC-DS workload. During this process, TPC-DS should successfully complete execution with correct results. The verification plan of robustness is shown below:

Code Block
languagetext
titleRobustness Plan JSON
collapsetrue
{
	"actions": [{
			"id": "startworker",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "stopworker",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "startmaster",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "stopmaster",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "corruptdisk",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "hangio",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "resumedisk",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "resumeio",
			"interval": 1,
			"selector": {
				"type": "random"
			}
		},
		{
			"id": "occupycpu",
			"duration": "30s",
			"cores": 5,
			"selector": {
				"type": "random"
			}
		}
	],
	"trigger": {
		"policy": "random",
		"interval": {
			"type": "range",
			"start": "5s",
			"end": "15s"
		}
	},
	"checker": "resource"
}

Predefined

  1. Master Node Failure:

    • One or two master nodes fail intermittently for 30 seconds each time.
  2. Worker Node Failure:

    • One or two worker nodes fail intermittently for 30 seconds each time.
  3. Disk Anomalies:

    • Randomly select a worker node:
      • Test one disk, two disks, or all disks to become unwritable.
  4. Disk IO Hang:

    • Randomly select a worker node:
      • Test one disk, two disks, or all disks to experience IO hang.
  5. Master Metadata Anomaly:

    • In the event of master node anomalies, select one or two nodes.

The verification method is that runs a query that lasts several minutes. In most cases, the query should successfully execute. However, occasional failures are allowed under specific conditions. The corresponding verification plan is as follows:

...

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users?
  • If we are changing behavior how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan

Describe in few sentences how the CIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

 The chaos testing framework is new feature with no compatibility, deprecation, and migration plan.

Test Plan

  1. Start the Runner and Scheduler, and use CLI commands to submit verification plan files to simulate the following anomalies and verify Celeborn's stability.
    • Kill master process
    • Kill worker process
    • Worker directory not writable
    • Worker disk IO hang
    • High CPU load
    • Master node metadata corruption
  2. Mock shuffle process and support implementation of other corner cases to test each stage of shuffle and enrich workloads.

Rejected Alternatives

 The chaos testing framework has no other rejected alternativesIf there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.