...
Below summarizes the public API changes in this KIP.
Restoration metrics
All the metrics below would be We propose add metrics both on the thread-level (default reporting level is INFO) as well as on the task level (default reporting level is DEBUG).
Note that we will have separate thread handling restoration procedures, and hence their thread id would be different from stream threads.
Metric Thread-level metric tags are:
- type=stream-state-updater-metrics
- client-id=[clientId]
- thread-id=[threadId]
...
Task-level metric tags are:
- type=stream-task-metrics
- client-id=[clientId]
- thread-id=[threadId]
- task-id=[taskId]
The POC implementation of the proposed metrics can be found here: https://github.com/apache/kafka/pull/12391
Metric Name | Level | Type | Description | Notes |
---|---|---|---|---|
active-restoring-tasks | thread / INFO | count | The number of active tasks currently undergoing restoration | |
standby-updating-tasks | thread / INFO | count | The number of active tasks currently undergoing updating | |
active-paused-tasks | thread / INFO | count | The number of active tasks paused restoring | |
standby-paused-tasks | thread / INFO | count | The number of standby tasks paused updating | |
idle-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on being idle | idle-ratio + restore-ratio + checkpoint-ratio should be 1 |
restore-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on restoring active or standby tasks | idle-ratio + restore-ratio + checkpoint-ratio should be 1 |
checkpoint-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on checkpointing restored progress | idle-ratio + restore-ratio + checkpoint-ratio should be 1 |
active-restore-records- |
rate | thread / INFO | rate | The average per-second |
---|
number of records restored for all active tasks |
standby-records-updated-total
min(active-restore-records-rate, standby-update-records-rate) == 0 | |||
standby-update-records-rate | thread / INFO | rate | The average per-second |
---|
number of records updated for |
active-records-remaining
standby-records-remaining
records-restored-rate
all standby tasks | min(active-restore-records-rate, standby-update-records-rate) == 0 | |||
restore-call-rate | thread / INFO | rate | The average per-second number of restore calls triggered | |
---|---|---|---|---|
restore-total | task / DEBUG | count | The total number of records processed during restoration | |
restore-rate | task / DEBUG | rate | The average per-second number of records restored |
restore-remaining-records-total | task / INFO | count | The number of records remained to be restored |
---|
restore-call-rate
Along with these new metrics, we would also deprecate the metrics below:
...