Status

Current state: Accepted

Discussion thread: https://lists.apache.org/thread/l3opxt4mhlcxk3jftrf553z3sdwhyolh

Voting thread: https://lists.apache.org/thread/mhomyw91qprn4dwd4br4csmwz25b9yyv

JIRA: FLINK-37410 - Getting issue details... STATUS

Released: 2.1

Motivation

Overtime we accumulated logic for source splits behavior under watermark alignment and idleness. While at the task /
operator levels, users can monitor progress through metrics like currentInput/OutputWatermark and watermarkAlignmentDrift, The state of each source split within a source operator is never
reported.
While the issue is partially addressed by currentInputNWatermark  metric, it makes sense to have a well defined scope for
reporting both watermark and the different states a split can be at, providing fine grained progress visibility at real time.

Public Interfaces


The following new metrics will be introduced on the split level:

Scope: taskmanager.job.task.operator.split.watermark 

metric

type

description

currentWatermark

Gauge

The last watermark this split has received (in milliseconds).

activeTimeMsPerSecond

Gauge

The time (in milliseconds) this split is active (i.e. not paused due to watermark alignment or idle due to idleness detection) per second.

pausedTimeMsPerSecond

Gauge

The time (in milliseconds) this split is paused due to watermark alignment per second.

idleTimeMsPerSecond

Gauge

The time (in milliseconds) this split is marked idle by idleness detection per second.

accumulatedActiveTimeMs

Gauge

Accumulated time (in milliseconds) this split was active since registered

accumulatedPausedTimeMs

Gauge

Accumulated time (in milliseconds) this split was paused since registered

accumulatedIdleTimeMs

Gauge

Accumulated time (in milliseconds) this split was idle since registered


Proposed Changes

A new metric group SourceSplitMetricGroup will be introduced, as a sub-group of OperatorMetricGroup. The group, containing a watermark reference and pauseable timers for states, will be initialized per split by SourceOperator, which already keeps track of splits pause/resume splits. TimestampsAndWatermarks.WatermarkUpdateListener and WatermarkOutputMultiplexer.WatermarkUpdateListener interfaces will be extended to bubble idleness updates on the split level, upstream to SourceOperator. so it can switch the split idle timer on and off as well.

States definitions

  • Idle clock will tick once a split was marked idle by idleness detection, until it emits a watermark (or marked paused)
  • Paused clock logs time since a split was added to pausedSplits list by sourceOperator due to watermark alignment, until it is allowed to resume, (or marked idle)
  • Active time will be the amount of milliseconds the split was neither idle nor paused.

Compatibility, Deprecation, and Migration Plan

No existing behavior will be changed.

Test Plan

The new metric group, as well as state transitions reporting under alignment / idleness will be unit tested.

Rejected Alternatives

Export split - level metrics on the operator scope - same as currentInputNWatermark i.e. currentInputNActiveTimeMsPerSecond, currentInputNIdleTimeMsPerSecond, etc.

Rejected because the dynamic nature of currentInputNWatermark is already a challenge for monitoring systems (i.e. for keeping allow/block lists and writing queries), and also not aligned with most flink metrics scoping practices.

  • No labels