Document the state by adding a label to the CIP page with one of "discussion", "accepted", "released", "rejected".
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in Apache Celeborn CLI Proposal.
Public Interfaces
We propose the introduction of a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.
Proposed Changes
I want to take this opportunity to reclassify and potentially rename the current API endpoints, here is the api mapping. Will leverage openapi-generator for this.
Master API Mapping
- /${version}/conf
- mapping: /conf
- method: GET
- params: none
- return: ConfigData
- /${version}/conf/dynamic
- mapping: /listDynamicConfigs
- method: GET
- params: level, tenant, name
- return: Seq[DynamicConfig]
- /${version}/thread_dump
- mapping: /threadDump
- method: GET
- params: none
- return: Seq[ThreadStack]
- /${version}/applications
- mapping: /applications
- method: GET
- params: none
- return: Seq[ApplicationHeartbeatData]
- /${version}/applications/top_disk_usage
- mapping: /listTopDiskUsedApps
- method: GET
- params: none
- return: Seq[AppDiskUsageSnapshotData]
- /${version}/applications/hostnames
- mapping: /hostnames
- method: GET
- params: none
- description: List all running application's LifecycleManager's hostnames of the cluster
- return: Seq[String]
- /${version}/shuffles
- mapping: /shuffle
- method: GET
- params: none
- return: Seq[String]
- /${version}/masters
- mapping: /masterGroupInfo
- method: GET
- params: none
- return: MasterGroupData
- /${version}/workers
- mapping:/workerInfo
- workers field in response
- method: GET
- params: none
- return: WorkersResponse
- /${version}/workers
- mapping:/lostWorkers
- lostWorkers filed in response
- method: GET
- params: none
- return: WorkersResponse
- /${version}/workers
- mapping:
- /excludeWorkers
- excludedWorkers filed in response
- /excludeWorkers
- method: GET
- params: none
- return: WorkersResponse
- mapping:
- /${version}/workers
- mapping:/shutdownWorkers
- shutdownWorkers field in response
- method: GET
- params: none
- return: WorkersResponse
- /${version}/workers
- mapping:/decommissionWorkers
- decommissioningWorkers field in response
- method: GET
- params: none
- return: WorkersResponse
- /${version}/workers/events
- mapping: /workerEventInfo
- method: GET
- params: none
- description: List all worker event info of the master
- return: Seq[WorkerEventData]
- /${version}/workers/exclude
- mapping: /exclude
- method: POST
- requestBody: ExcludeWorkerRequest
- description: Excluded workers of the master add or remove the worker manually given worker id. The parameter add or remove specifies the excluded workers to add or remove, which value is separated by commas.
- return: Seq[WorkerData]
- /${version}/workers/events
- mapping: /sendWorkerEvent
- method: POST
- requestBody: SendWorkerEventRequest
- description: For Master(Leader) can send worker events to manager workers. Legal types are 'None', 'Immediately', 'Decommission', 'DecommissionThenIdle', 'Graceful', 'Recommission', and the parameter workers are separated by commas.
- return: HandleResponse
Worker API Mapping
- /${version}/conf
- mapping: /conf
- method: GET
- params: none
- return: ConfigData
- /${version}/conf/dynamic
- mapping: /listDynamicConfigs
- method: GET
- params: level, tenant, name
- return: Seq[DynamicConfig]
- /${version}/thread_dump
- mapping: /threadDump
- method: GET
- params: none
- return: Seq[ThreadStack]
- /${version}/shuffles
- mapping: /shuffle
- method: GET
- params: none
- return: Seq[String]
- /${version}/shuffles/partitions
- mapping: /listPartitionLocationInfo
- method: GET
- params: none
- return: ShufflePartitionsResponse
- /${version}/applications
- mapping: /applications
- method: GET
- params: none
- return: Seq[String]
- /${version}/applications/top_disk_usages
- mapping: /listTopDiskUsedApps
- method: GET
- params: none
- return: Seq[AppDiskUsageData]
- /${version}/workers/unavailable_peers
- mapping: /unavailablePeers
- method: GET
- params: none
- return: Seq[WorkerTimeStampData]
- /${version}/workers
- mapping:
- /workerInfo
- method: GET
- params: none
- return: WorkerData WorkerInfoResponse
- mapping:
- /${version}/workers
- mapping:
- /isShutdown
- isShutdown field in `WorkerData`
- /isShutdown
- method: GET
- params: none
- return: WorkerData WorkerInfoResponse
- mapping:
- /${version}/workers
- mapping:
- /isDecomissioning
- isDecommissioning field in `WorkerData`
- /isDecomissioning
- method: GET
- params: none
- return: WorkerData WorkerInfoResponse
- mapping:
- /${version}/workers
- mapping:
- /isRegistered
- isRegistered field in `WorkerData`
- /isRegistered
- method: GET
- params: none
- return: WorkerData WorkerInfoResponse
- mapping:
- /${version}/exit
- mapping: /exit
- method: POST
- requestBody: WorkerExitRequest
- return: HandleResponse
- /${version}/conf
Schemas
ConfigData
Name | Type |
name | string |
value | string |
DynamicConfig
Name | Type | Description |
level | string | SYSTEM, TENANT or TENANT_USER |
description | string | additional info, such as tenantId and userIdentifier info |
configs | List[ConfigData] | Config list |
ThreadStack
Name | Type |
threadId | string |
threadName | string |
threadState | string |
stackTrace | Seq[String] |
blockedByThreadId | Option[Long] |
blockedByLock | string |
holdingLocks | List[string] |
ApplicationHeartbeatData
Name | Type |
appId | string |
lastHeartbeatTimestamp | Long |
AppDiskUsageSnapshotData
Name | Type |
start | Long |
end | Long |
topNItems | List[AppDiskUsageData] |
AppDiskUsageData
Name | Type |
appId | String |
estimatedUsage | Long |
estimatedUsageStr | String |
MasterGroupData
Name | Type |
groupId | String |
leader | MasterLeader |
masterCommitInfo | List[MasterCommitData] |
MasterLeader
Name | Type |
id | String |
address | String |
MasterCommitData
Name | Type |
commitIndex | String |
id | String |
address | String |
clientAddress | String |
startUpRole | String |
WorkerData
Name | Type |
host | String |
rpcPort | Int |
pushPort | Int |
fetchPort | Int |
replicatePort | Int |
internalPort | Int |
slotsUsed | Int |
lastHeartbeatTimestamp | Long |
heartbeatElapsedSeconds | Long |
diskInfos | Map[String, String] |
resourceConsumption | Map[String, String] |
workerRef | String |
workerState | String |
workerStateStartTime | Long |
isRegistered | Boolean |
isShutdown | Boolean |
isDecommissioning | Boolean |
WorkersResponse
Name | Type |
workers | List[WorkerData] |
lostWorkers | List[WorkerTimestampData] |
excludedWorkers | List[WorkerData] |
manualExcludedWorkers | List[WorkerData] |
shutdownWorkers | List[WorkerData] |
decommissioningWorkers | List[WorkerData] |
WorkerInfoResponse
Name | Type |
host | String |
rpcPort | Int |
pushPort | Int |
fetchPort | Int |
replicatePort | Int |
internalPort | Int |
slotsUsed | Int |
lastHeartbeatTimestamp | Long |
heartbeatElapsedSeconds | Long |
diskInfos | Map[String, String] |
resourceConsumption | Map[String, String] |
workerRef | String |
workerState | String |
workerStateStartTime | Long |
isRegistered | Boolean |
isShutdown | Boolean |
isDecommissioning | Boolean |
WorkerTimestampData
Name | Type |
worker | WorkerData |
timestamp | Long |
WorkerEventData
Name | Type |
worker | WorkerData |
event | WorkerEventInfoData |
WorkerEventInfoData
Name | Type |
eventType | String |
eventTime | Long |
HandleResponse
Name | Type |
success | Boolean |
message | String |
ShufflePartitionsResponse
Name | Type | Desc |
primaryPartitions | Map<String, Map<String, PartitionLocationData>> | Map<shuffleKey, Map<uniqueId, PartitionLocationData>> |
replicaPartitions | Map<String, Map<String, PartitionLocationData>> | Map<shuffleKey, Map<uniqueId, PartitionLocationData>> |
PartitionLocationData
Name | Type |
idEpoch | String |
hostAndPorts | String |
mode | String |
peer | String |
storage | String |
mapIdBitMap | String |
WorkerId
Name | Type |
host | String |
rpcPort | int |
pushPort | int |
fetchPort | int |
replicatePort | int |
ExcludeWorkerRequest
Name | Type |
add | List[WorkerId] |
remove | List[WorkerId] |
SendWorkerEventRequest
Name | Type |
eventType | string |
workers | List[WorkerId] |
WorkerExitRequest
Name | Type | Desc |
type | string | Legal exit types are 'Decommission', 'Graceful' and 'Immediately' |
Compatibility, Deprecation, and Migration Plan
We will reserve the current API for compatibility, and mark the current API as deprecated and plan deprecation in the future version, and write the migration docs for migration.
Test Plan
The changes will be covered by unit tests, e2e tests and manual tests.
Rejected Alternatives
Retain the current API's naming convention and merely optimize the request and response formats by moving them under the `/api/v1` namespace.
For example:
- /listDynamicConfigs → /api/v1/listDynamicConfigs
- /listTopDiskUsedApps → /api/v1/listTopDiskUsedApps.