This page is meant as a template for writing a CIP. To create a CIP choose Tools->Copy on this page and modify with your content and replace the heading with the next CIP number and a description of your issue. Replace anything in italics with your own description.

Document the state by adding a label to the CIP page with one of "discussion", "accepted", "released", "rejected".

Discussion threadhttps://lists.apache.org/thread/oy2n093d488nw29thtnwshn9dz22t627
Vote threadTBD
JIRA-
Release-


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As of CELEBORN-1227, Celeborn has introduced a dashboard to view and manage various operations related to Celeborn masters and workers. These include managing configurations, viewing Master/Workers status, and much more. This dashboard is extremely valuable for providing insights and observability into the Celeborn cluster. 

To further improve the operability of Celeborn, we would like to propose a CLI for managing the Celeborn cluster. Such a CLI tool would be complementary to the existing Celeborn dashboard, and prove to be beneficial to users for the following reasons: 

  • Enhance the system admin/ops experience – a CLI would allow for more automation and scriptability. 
  • Easily maintainable –  little to no setup is required. 
  • Multi-platform usage –  such as remote machines where a GUI might not be accessible. 

Given this, we propose a CLI for celeborn. 

Public Interfaces

The CLI interface would utilize click – a lightweight Python CLI toolkit. Below would be the CLI interface commands based on the current Celeborn REST API:

MASTER/WORKER/BOTH

SUBCOMMAND

REST API

MASTER

–-show-masters-info

/masterGroupInfo

—-show-cluster-apps

/applications

—-show-cluster-shuffles

/shuffles

—-show-top-disk-used-apps

/listTopDiskUsedApps

—-exclude-worker <worker_id>

/exclude

—-remove-excluded-worker <worker_id>

/exclude

—-send-worker-event <WorkerEventType> —-worker-list <worker_list>

/sendWorkerEvent

—-show-worker-event-info

/workerEvenInfo

—-show-lost-workers

/listWorkers

—-show-excluded-workers

/excludedWorkers

—-show-shutdown-workers

/shutdownWorkers

—-show-lifecycle-managers

/hostnames

WORKER

—-show-apps-on-worker

/applications

—-show-shuffles-on-worker

/shuffles

—-show-top-disk-used-apps

/listTopDiskUsedApps

—-show-partition-location-info

/listPartitionLocationInfo

—-show-unavailable-peers

/unavailablePeers

—-is-shutdown

/isShutdown

—-is-registered

/isRegistered

—-exit <TYPE>

/exit

BOTH

--hostport <host:port>

N/A

—-output-type <json/csv/text/etc>

N/A

—-cert <path/to/cert>

N/A

—-key <path/to/key>

N/A

—-cacert <path/to/CA/cert/file>

N/A

—-verbose

N/A

—-show-conf

/conf

—-show-dynamic-configs

/listDynamicConfigs

—-show-workers-info

/workerInfo

—-thread-dump

/threadDump

Sample Usage


# Get the masters info

celeborn master --show-masters-info --output-type console --cert identity.cert --key identity.key --cacert cacert.crt

# Exclude worker1 from the cluster

celeborn master --exclude-worker worker1 --cert identity.cert --key identity.key --cacert cacert.crt

# Show the applications running on the worker

celeborn worker --show-apps-on-worker --hostport celeborn-worker-123.prod.mycompany.com:8888 --cert identity.cert --key identity.key --cacert cacert.crt

# Check if a worker is shutdown

celeborn-worker --is-shutdown --hostport celeborn-worker-123.prod.mycompany.com:8888 --cert identity.cert --key identity.key --cacert cacert.crt --output-type console


Proposed Changes

The CLI interface would utilize click – a lightweight Python CLI toolkit. It would be connecting with the REST API's available on both the Masters and Workers. 

REST API/CLI Evolution

Often, from a Celeborn administrator perspective, it would be good to filter out output based on a particular attribute. For example, consider the case where the Celeborn platform team may want to inspect and manage the workers based on which rack they are residing in (or some other network attribute). Eventually, the REST API and CLI both should allow for query params to be passed in to enable this feature. 

Additionally, the Celeborn platform team might be supporting more than 1 Celeborn cluster/environment. Similar to kubectl, the Celeborn CLI should provide a way to switch contexts between clusters. This would enable a much better user experience. 

Both of these can be added in v2 of the CLI. 

Compatibility, Deprecation, and Migration Plan

This CLI will be written in Python3. No deprecation or migration plans.

Test Plan

The changes will be covered by unit tests, e2e tests and manual tests.

Rejected Alternatives

N/A

  • No labels