Status

Current state: Accepted

Vote thread: https://lists.apache.org/thread/rysj5t3rfpz5pp1rjk0s7w9v7lq7dfpb

Discussion thread: https://lists.apache.org/thread/z8cwnksl6op4jfg7j0nwsg9xxsf8mwhh

JIRA: KAFKA-18666 - Getting issue details... STATUS

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The controller currently exposes metrics for the active and fenced broker counts. However, these metrics are not descriptive enough, since they do not tell us from the controller-side which broker is in which state. Instead, upon alerting on these metrics, the operator will have to look through the metadata log to find which broker(s) are in which states. This KIP proposes adding controller-side metrics to monitor the states of brokers.

Public Interfaces

Monitoring

NameTypeDescription
kafka.controller:type=KafkaController,name=ControlledShutdownBrokerCountIntegerThe number of brokers currently in controlled shutdown.
kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=XInteger

A per-broker metric which displays the following values for the states:

  • 10 when the broker is fenced
  • 20 when the broker is in controlled shutdown
  • 30 when the broker is active

This metric is added when a broker registers and removed when it unregisters. This state is derived from the registration records contained in the metadata log.

kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs,broker=XIntegerA per-broker metric which reports the time in milliseconds since the last heartbeat received by the controller.
The maximum value of this metric is the heartbeat session timeout limit, because the map whose values are being exposed removes the broker's contact time when the broker gets fenced. The default session timeout is 9 seconds.
Only the active controller reports this metric since it's soft state contained in BrokerHeartbeatTracker.
Similar to the other metric, this metric is added when a broker registers and removed when it unregisters.

We will use tagging for the per-broker metrics where X represents the ID of the broker.

Rationale

These metrics would be useful for monitoring the following use cases:

  • When multiple brokers are fenced – In addition to how many brokers are fenced, the operator will now know which brokers are fenced without having to look at the logs. Additionally, information provided by the heartbeat metrics can help the operator narrow down the cause of the fencing.
  • Expanding and shrinking the cluster – When expanding, the brokers being added are initially fenced until they catch up on metadata, so the benefits are similar to above. When shrinking, the operator can monitor that controlled shutdown of the to-be-removed brokers is not taking longer than expected.

Compatibility, Deprecation, and Migration Plan

These will be newly exposed metrics and there will be no impact on existing kafka versions.

Test Plan

We will add junit tests to verify the new metrics.

Rejected Alternatives

NameTypeDescription
kafka.controller:type=KafkaController,name=LongestPendingStartupTimeMsLongThe duration, in milliseconds, of the longest pending broker startup.
kafka.controller:type=KafkaController,name=LongestPendingStartupBrokerStringThe broker ID of the longest pending broker startup.
kafka.controller:type=KafkaController,name=NumberOfBrokersInStartupIntegerThe number of brokers currently starting up.
kafka.controller:type=KafkaController,name=LongestPendingControlledShudownTimeMsLongThe duration, in milliseconds, of the longest pending broker controlled shutdown.
kafka.controller:type=KafkaController,name=LongestPendingControlledShutdownBrokerStringThe broker ID of the longest pending broker controlled shutdown.
kafka.controller:type=KafkaController,name=NumberOfBrokersInControlledShutdownIntegerThe number of brokers currently in controlled shutdown.

These duration metrics would rely on soft state, so they would be reset when controller failover occurs. This is a bit confusing, and the scope of these metrics is limited to startup and controlled shutdown scenarios. Instead, more general purpose metrics would be applicable across more use cases.


However, if we want to add another value to BrokerRegistrationState  that maps to starting up brokers (i.e. never unfenced), this would require adding a boolean to the broker's registration record. Additionally, if we want to track is a broker has been uncleanly shutdown, we would likely need to store the broker epoch in its registration record.


  • No labels