Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the first section we have illustrated the current difference between job cancellation and stop, as well as the issue and ambiguity when stopping a job with stateful operators and retained checkpointing. Let’s look deeper about why this happens, analogizing to the mature database system.

Checkpoint 

Both checkpoint in Flink and database are system-controlled process.

In database (take Take MySQL for example, ) there’s also a checkpoint concept [2]. Along with data ingestion, changes are made to data pages which are cached in the buffer pool and written to the data files some time later by a process known as flushing. The checkpoint is a record of the latest changes that have been successfully flushed.

...

  • During normal operation, it performs fuzzy checkpoints, just like the normal checkpointing process in Flink.
  • When the database shuts down, it performs a sharp checkpoint, which is missing in current Flink job stopping implementation.

...

  • process.

Cancel, Stop and Failover

...

  • Resuming a Flink job after cancellation involves source rewinding, just like database resume from killing needs a redo log replay.
  • Resuming a Flink job after stop should be able to load from the latest retained checkpoint without any source rewinding, just like no redo log replay needed for database resume from normal shutdown.

      ...

        • This is missing in current implementation.

      Savepoint

      We could map the Flink savepoint concept to database backups since the use-case of savepoints in Flink [6] includes:

      ...

      In Flink we have long supported cancel with savepoint, and recently in FLIP-34 we have implemented stop with savepoint, both could be mapping to automatically triggering a backup before killing/shutting down the database instance, and completely orthogonal with the fuzzy/sharp checkpoint process.

      To conclude, according to all above analogies, it's clear that we should always (automatically) do a checkpoint (with retained checkpoint configured) when stopping job, just like we (automatically) do a sharp checkpoint during database shutdown. And this is a necessary supplement to reinforce the job stop semantic.

      Proposed Changes

      ...

      FLIP-34 has introduced new required options for stop command including "-s" and "-d" [9], which actually changes user behavior since old command w/o such options cannot work any more.

      After changes for this FLIP, the normal stop command will work again:

      • If retained checkpoint is

      ...

      • enabled,

      ...

      •  we should always do a checkpoint when stopping job, unless there’s an "-s" option in the command which indicates a stop with savepoint.

      ...

      • If retained checkpoint is disabled, we fallback to the old stop process before FLIP-34, which will finish the source task and allow the job to finish processing all inflight data.

      It also makes sense if user issues another cancel command for quick job termination when observing the stop process got stuck, similar to killing the database instance if don’t want to wait for the normal shutdown. And we should make sure the after-stop cancel command could take effect.

      ...