Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Switch iterator-duration metric to nanoseconds

Table of Contents

Status

Current state: Under DiscussionAccepted

Discussion thread:

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-15541

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Iterator leaks are caused when a user opens an Iterator on a StateStore, but never closes it. Despite Iterators being AutoCloseable, it's still possible for them to fail to close, if a user does not allocate them in a try-with-resources block. Iterator leaks can directly cause memory leaks, or other performance problems, depending on the StateStore implementation.

...

We introduce several new metrics to aid users in finding leaked Iterators, as well as identifying the cause of a high number of pinned blocks, or other kinds of memory leaks and performance problems.

Public Interfaces

New Metrics

Each added metric will be on store-level and have the following tags:

...

NameRecording LevelMetric TypeDescription
num-open-iteratorsINFOGaugeThe current number of Iterators on the store that have been created, but not yet closed.
iterator-duration-avgDEBUGAverageThe average time spent between creating an Iterator and closing it, in millisecondsnanoseconds.
iterator-duration-maxDEBUGMaximumThe maximum time spent between creating an Iterator and closing it, in millisecondsnanoseconds.
oldest-iterator-open-since-msDEBUGGaugeThe UNIX timestamp the oldest still open iterator was created, in milliseconds.

Proposed Changes

MeteredKeyValue store, which by default automatically wraps all state stores created by Kafka Streams, already defines a MeteredKeyValueIterator , which automatically wraps all Iterators provided by the underlying StateStore. We will add our new metrics to MeteredKeyValueIterator.

...

The iterator-duration metrics will be updated whenever an Iterator's close() method is called, recording the time since the Iterator was created. The measurement will be conducted using SystemTime.nanoTimenanoseconds() and reported in milliseconds, with a fractional component approximating the sub-millisecond precisionnanoseconds. e.g. 2.337583 2337583, would be approximately 2 milliseconds, 337 microseconds and 583 nanoseconds. Users would want to monitor this metric and take action if:

...

The oldest-iterator-open-since-ms metric will be updated by tracking all currently open Iterators and taking the min(startTime), where startTime is recorded in the constructor by calling SystemTime.currentTimeMillismilliseconds(). Users will want to take action if:

  • The metric continues to climb indefinitely over time, as it indicates that old Iterators are not being closed.
  • The metric is consistently very high, despite not continuing to climb, which indicates an Iterator that takes a long time to complete iteration and close. This could indicate a performance problem in the application, like iterating over every key in a very large StateStore.

Compatibility, Deprecation, and Migration Plan

  • No impact is expected for existing users, as no existing interfaces or behaviour is being changed.
  • The performance overheads of tracking the opening/closing of Iterators is expected to be negligible to the point that it will not be measurable in real-world applications.

Test Plan

MeteredKeyValueStoreTest will be updated with new tests for each metric.

Rejected Alternatives

  • Originally, this KIP focused on providing metrics exclusively for RocksDB Iterators. However, it was determined that it would be useful to have these metrics available for all StateStores. The original implementation involved using RocksDBs internal Tickers to track open/closed Iterators. Our new implementation maintains these counts in Kafka Streams, which enables these metrics to be reported for any StateStore.