Document the state by adding a label to the CIP page with one of "discussion", "accepted", "released", "rejected".


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.

The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.

And I also saw the need for REST API evolution in Apache Celeborn CLI Proposal.

Public Interfaces

We propose the introduction of a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.

Proposed Changes

I want to take this opportunity to reclassify and potentially rename the current API endpoints, here is the api mapping. Will leverage openapi-generator for this.

Master API Mapping

    • /${version}/conf
      • mapping: /conf
      • method: GET
      • params: none
      • return: ConfigData
    • /${version}/conf/dynamic
      • mapping: /listDynamicConfigs
      • method: GET
      • params: level, tenant, name
      • return: Seq[DynamicConfig] 
    • /${version}/thread_dump
      • mapping: /threadDump
      • method: GET
      • params: none
      • return: Seq[ThreadStack]
    • /${version}/applications
      • mapping: /applications
      • method: GET
      • params: none
      • return: Seq[ApplicationHeartbeatData]
    • /${version}/applications/top_disk_usage
      • mapping: /listTopDiskUsedApps
      • method: GET
      • params: none
      • return: Seq[AppDiskUsageSnapshotData]
    • /${version}/applications/hostnames
      • mapping: /hostnames
      • method: GET
      • params: none
      • description: List all running application's LifecycleManager's hostnames of the cluster
      • return: Seq[String]
    • /${version}/shuffles
      • mapping: /shuffle
      • method: GET
      • params: none
      • return: Seq[String]
    • /${version}/masters
      • mapping: /masterGroupInfo
      • method: GET
      • params: none
      • return: MasterGroupData
    • /${version}/workers
      • mapping:/workerInfo
      • workers field in response
      • method: GET
      • params: none
      • return: WorkersResponse
    • /${version}/workers
      • mapping:/lostWorkers
      • lostWorkers filed in response
      • method: GET
      • params: none
      • return: WorkersResponse
    • /${version}/workers
      • mapping:
        • /excludeWorkers
          • excludedWorkers filed in response
      • method: GET
      • params: none
      • return: WorkersResponse
    • /${version}/workers
      • mapping:/shutdownWorkers
      • shutdownWorkers field in response
      • method: GET
      • params: none
      • return: WorkersResponse
    • /${version}/workers
      • mapping:/decommissionWorkers
      • decommissioningWorkers field in response
      • method: GET
      • params: none
      • return: WorkersResponse
    • /${version}/workers/events
      • mapping: /workerEventInfo
      • method: GET
      • params: none
      • description: List all worker event info of the master
      • return: Seq[WorkerEventData]
    • /${version}/workers/exclude
      • mapping: /exclude
      • method: POST
      • requestBody: ExcludeWorkerRequest
      • description: Excluded workers of the master add or remove the worker manually given worker id. The parameter add or remove specifies the excluded workers to add or remove, which value is separated by commas.
      • return: Seq[WorkerData]
    • /${version}/workers/events
      • mapping: /sendWorkerEvent
      • method: POST
      • requestBody: SendWorkerEventRequest
      • description: For Master(Leader) can send worker events to manager workers. Legal types are 'None', 'Immediately', 'Decommission', 'DecommissionThenIdle', 'Graceful', 'Recommission', and the parameter workers are separated by commas.
      • return: HandleResponse

Worker API Mapping

    • /${version}/conf
      • mapping: /conf
      • method: GET
      • params: none
      • return: ConfigData
    • /${version}/conf/dynamic
      • mapping: /listDynamicConfigs
      • method: GET
      • params: level, tenant, name
      • return: Seq[DynamicConfig]
    • /${version}/thread_dump
      • mapping: /threadDump
      • method: GET
      • params: none
      • return: Seq[ThreadStack]
    • /${version}/shuffles
      • mapping: /shuffle
      • method: GET
      • params: none
      • return: Seq[String]
    • /${version}/shuffles/partitions
      • mapping: /listPartitionLocationInfo
      • method: GET
      • params: none
      • return: ShufflePartitionsResponse
    • /${version}/applications
      • mapping: /applications
      • method: GET
      • params: none
      • return: Seq[String]
    • /${version}/applications/top_disk_usages
      • mapping: /listTopDiskUsedApps
      • method: GET
      • params: none
      • return: Seq[AppDiskUsageData]
    • /${version}/workers/unavailable_peers
      • mapping: /unavailablePeers
      • method: GET
      • params: none
      • return: Seq[WorkerTimeStampData]
    • /${version}/workers
      • mapping:
        • /workerInfo
      • method: GET
      • params: none
      • return: WorkerData  WorkerInfoResponse
    • /${version}/workers
      • mapping:
        •  /isShutdown
          • isShutdown field in `WorkerData`
      • method: GET
      • params: none
      • return: WorkerData  WorkerInfoResponse
    • /${version}/workers
      • mapping:
        •  /isDecomissioning
          • isDecommissioning field in `WorkerData`
      • method: GET
      • params: none
      • return: WorkerData  WorkerInfoResponse
    • /${version}/workers
      • mapping:
        • /isRegistered
          • isRegistered field in `WorkerData`
      • method: GET
      • params: none
      • return: WorkerData  WorkerInfoResponse
    • /${version}/exit
      • mapping: /exit
      • method: POST
      • requestBody: WorkerExitRequest
      • return: HandleResponse

Schemas

ConfigData

Name

Type

name

string

value

string

DynamicConfig

Name

Type

Description

level

string

SYSTEM, TENANT or TENANT_USER

description

string

additional info, such as tenantId and userIdentifier info

configs

List[ConfigData]

Config list

ThreadStack

Name

Type

threadId

string

threadName

string

threadState

string

stackTrace

Seq[String]

blockedByThreadId

Option[Long]

blockedByLock

string

holdingLocks

List[string]





ApplicationHeartbeatData

Name

Type

appId

string

lastHeartbeatTimestamp

Long

AppDiskUsageSnapshotData

Name

Type

start

Long

end

Long

topNItems

List[AppDiskUsageData]

AppDiskUsageData

Name

Type

appId

String

estimatedUsage

Long

estimatedUsageStr

String

MasterGroupData

Name

Type

groupId

String

leader

MasterLeader

masterCommitInfo

List[MasterCommitData]

MasterLeader

Name

Type

id

String

address

String

MasterCommitData

Name

Type

commitIndex

String

id

String

address

String

clientAddress

String

startUpRole

String

WorkerData

Name

Type

host

String

rpcPort

Int

pushPort

Int

fetchPort

Int

replicatePort

Int

internalPort

Int

slotsUsed

Int

lastHeartbeatTimestamp

Long

heartbeatElapsedSeconds

Long

diskInfos

Map[String, String]

resourceConsumption

Map[String, String]

workerRef

String

workerState

String

workerStateStartTime

Long

isRegistered

Boolean

isShutdown

Boolean

isDecommissioning

Boolean

WorkersResponse

Name

Type

workers

List[WorkerData]

lostWorkers

List[WorkerTimestampData]

excludedWorkers

List[WorkerData]

manualExcludedWorkers

List[WorkerData]

shutdownWorkers

List[WorkerData]

decommissioningWorkers

List[WorkerData]

WorkerInfoResponse

Name

Type

host

String

rpcPort

Int

pushPort

Int

fetchPort

Int

replicatePort

Int

internalPort

Int

slotsUsed

Int

lastHeartbeatTimestamp

Long

heartbeatElapsedSeconds

Long

diskInfos

Map[String, String]

resourceConsumption

Map[String, String]

workerRef

String

workerState

String

workerStateStartTime

Long

isRegistered

Boolean

isShutdown

Boolean

isDecommissioning

Boolean

WorkerTimestampData

Name

Type

worker

WorkerData

timestamp

Long

WorkerEventData

Name

Type

worker

WorkerData

event

WorkerEventInfoData

WorkerEventInfoData

Name

Type

eventType

String

eventTime

Long

HandleResponse

Name

Type

success

Boolean

message

String

ShufflePartitionsResponse

Name

Type

Desc

primaryPartitions

Map<String, Map<String, PartitionLocationData>>

Map<shuffleKey, Map<uniqueId, PartitionLocationData>>

replicaPartitions

Map<String, Map<String, PartitionLocationData>>

Map<shuffleKey, Map<uniqueId, PartitionLocationData>>

PartitionLocationData

Name

Type

idEpoch

String

hostAndPorts

String

mode

String

peer

String

storage

String

mapIdBitMap

String

WorkerId

Name

Type

host

String

rpcPort

int

pushPort

int

fetchPort

int

replicatePort

int

ExcludeWorkerRequest

Name

Type

add

List[WorkerId]

remove

List[WorkerId]

SendWorkerEventRequest

Name

Type

eventType

string

workers

List[WorkerId]

WorkerExitRequest

Name

Type

Desc

type

string

Legal exit types are 'Decommission', 'Graceful' and 'Immediately'

Compatibility, Deprecation, and Migration Plan

We will reserve the current API for compatibility, and mark the current API as deprecated and plan deprecation in the future version, and write the migration docs for migration.

Test Plan

The changes will be covered by unit tests, e2e tests and manual tests.

Rejected Alternatives

Retain the current API's naming convention and merely optimize the request and response formats by moving them under the `/api/v1` namespace.

For example:

  • /listDynamicConfigs → /api/v1/listDynamicConfigs
  • /listTopDiskUsedApps → /api/v1/listTopDiskUsedApps.
  • No labels