...

Recover from failure
Refresh job environment
- Rescale or change job graph
- Upgrade and state migration
Switch state backend
Import and export (FLIP-43 [7])
Fork job for blue/red deployment

While relatively, database uses backups [87] to:

Recover from accidental deletes
Refresh development environments
Migrate databases or switch storage engine [98]
Import and export data
Create database copies for testing, training and demonstrations

In Flink we have long supported cancel with savepoint, and recently in in FLIP-34 [10] we we have implemented stop with savepoint, both could be mapping to automatically triggering a backup before killing/shutting down the database instance, and completely orthogonal with the fuzzy/sharp checkpoint process.

...

It also makes sense if user issues another cancel command for quick job termination when observing the stop process got stuck, similar to killing the database instance if don’t want to wait for the normal shutdown. And we should make sure the after-stop cancel command could take effect.

Implementation

After FLIP-34 we have introduced two different types for job stop:

Type

Source OPS

Task Status

Job Status

SUSPEND

Checkpoint Barrier,

End Of Stream

Finished

TERMINATE

MAX_WATERMARK, Checkpoint Barrier,

End Of Stream

Finished

And we need below implementations to support performing a checkpoint when stopping job when retained checkpoint is configured:

The Job Manager triggers a synchronous checkpoint at the source, that also indicates one of TERMINATE or SUSPEND
Sources send a MAX_WATERMARK in case of TERMINATE, nothing is done in case of SUSPEND
The Task Manager executes the checkpoint in a SYNCHRONOUS way, i.e. it blocks until the state is persisted successfully and the notifyCheckpointComplete() is executed.
The Task Manager acknowledges the successful persistence of the state for the checkpoint
The Job Manager sends the notification that the checkpoint is completed
The Task Manger unblock the synchronous checkpoint execution.
Finishing the job progress from the sources, i.e. they shut down and EOS message propagate through the job.
The Job Manager waits until the job state goes to FINISHED before declaring the operation successful.

More details please refer to PR#8617Similar to the current implementation in PR#8617, will add text description.

Note that currently user controls the life cycle of the retained checkpoint files, and restoring from retained checkpoint reuses the “flink run -s” command thus calling CheckpointCoordinator.restoreSavepoint, so it’s totally fine to restore from a retained checkpoint for multiple times or jobs if only users don’t delete it.

...

FLIP-34 has introduced new required options for stop command including "-s" and "-d" [119], which actually changes user behavior since old command w/o such options cannot work any more.

...

What’s the concept in Flink relative to database snapshots? Shall we introduce one?
- It may share the checkpoint format and allow difference between backends, but should be in different concept (not checkpoint or backup as in database)
- It should share the difference between database snapshot and backup [1210]
TBC