Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under discussionAccepted

Discussion thread: here

Vote thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

To check for the readiness and liveness of a distributed worker or the liveness of a standalone worker, a request can be issued to the GET /connectors/{connector}  endpoint. If a response with either a 200 or 404 status code is received, the worker can be considered healthy. This has the drawback of either requiring a connector to exist with the expected name (which may be inconvenient for cluster administrators to enforce), or requiring any automated tooling that interacts with the endpoint to count a 404 response as "healthy", which is highly abnormal. This also does not sufficiently confirm the readiness of a standalone worker.

...

If the worker has not yet completed startup, the response will have a 503 status code and its body will have include a different message:

Code Block
languagejs
titleGET /health (503)
{
  "status": "starting",
  "message": "Worker is still starting up"
}

If the worker has completed startup but is unable to respond in time, the response will have a 500 status code and its body will have include this message:

Code Block
languagejs
titleGET /health (500)
{
  "status": "unhealthy",
  "message": "Worker was unable to handle this request and may be unable to handle other requests"
}

Unlike other endpoints, the timeout for the health check endpoint will not be 90 seconds. If a consecutive number of N failures reported by this endpoint is required before automated tooling declares the worker unhealthy, then waiting N * 1.5 minutes for an issue with worker health to be detected is likely to be too long. Instead, the timeout for this endpoint will be 10 seconds. This timeout will apply regardless of whether the worker is currently starting up or has already completed startup; in other words, a request to the health check endpoint during worker startup will block for at least 10 seconds before receiving a 503 response. In the future, the timeout may be made user-configurable if, for example, KIP-882: Kafka Connect REST API configuration validation timeout improvements or something like it is adopted.

Note that the HTTP status codes and "status" fields in the JSON response will match the exact examples above. However, the "message" field may be augmented to include, among other things, more information about the operation(s) the worker could be blocked on (such as was added in REST timeout error messages in KAFKA-15563).

Proposed Changes

Distributed mode

...