Status
Current state: Under DiscussionAccepted
JIRA:
Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | KAFKA-10199 |
---|
|
Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | KAFKA-10575 |
---|
|
Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | KAFKA-16567 |
---|
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
Below summarizes the public API changes in this KIP.
Restoration metrics
All the metrics below would be We propose add metrics both on the thread-level (default reporting level is INFO) as well as on the task level (default reporting level is DEBUG).
Note that we will have separate thread handling restoration procedures, and hence their thread id would be different from stream threads.
Metric Thread-level metric tags are:
- type=stream-state-updater-metrics
- thread-id=[threadId]
Task-level metric tags are:
- type=stream-task-metrics
- thread-id=[threadId]
- task-id=[taskId]
The POC implementation of the proposed metrics can be found here: https://github.com/apache/kafka/pull/12391Recording level is: INFO
Metric Name | Level | Type | Description | Notes |
---|
active-restoring- |
---|
active-tasks | thread / INFO | count | The number of active tasks currently undergoing restoration |
---|
restoringstandbyupdating-tasks | thread / INFO | count | The number of active tasks currently undergoing |
---|
restorationactive-tasks | thread / INFO | count | The number of active tasks paused restoring |
|
---|
standby-paused |
---|
-standby-tasks | thread / INFO | count | The number of standby tasks paused |
---|
restoringupdating |
|
idle-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on being idle | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1 |
---|
active-restore-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on restoring active tasks | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1; only one of the restore/update-ratio should be non-zero |
---|
standby-update-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on updating standby tasks | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1; only one of the restore/update-ratio should be non-zero |
---|
checkpoint-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on checkpointing restored progress | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1 |
---|
restore-records-rate | thread / INFO | rate | The average per-second number of records restored/updated for all tasks |
|
---|
restore-call-rate | thread / INFO | rate | The average per-second number of restore calls triggered |
|
---|
restore-total | task / DEBUG | count | The total number of records |
---|
restoredrestore-records-raterestore-call-rate | processed during restoration for active task | the metric would persist even when the task completed restoration, and would be removed only when the task is removed from the thread. |
restore-rate | task / DEBUG | rate | The average per-second number of records restored |
---|
for active task | the metric would drop to zero when the task completed restoration, and would be removed only when the task is removed from the thread. |
update-total | task / DEBUG | count | The total number of records updated for standby task | same as above |
---|
update-rate | task / DEBUG |
---|
rate | The average per-second number of |
restore calls triggeredrecords updated for standby task | same as above |
restore-remaining-records-total | task / INFO | count | The number of records remained to be restored for active tasks |
|
---|
Along with these new metrics, we would also deprecate the metrics below:
...
Code Block |
---|
|
public interface StateRestoreListener {
void onRestoreStart(final TopicPartition topicPartition,
final String storeName,
final long startingOffset,
final long endingOffset);
void onRestoreEnd(final TopicPartition topicPartition,
final String storeName,
final long totalRestored);
...
/**
* NEW FUNC. Method called when restoring the {@link StateStore} is pausedsuspended due to the task being suspended from the host.
* If the task was resumed after suspension and restoration continues, another {@link onRestoreStart} would be called.
*/
default void onRestorePausedonRestoreSuspended(final TopicPartition topicPartition,
final String storeName,
final long totalRestored) {
// do nothing
}
} |
...