Authors: Yu Li, CongXian Qiu
Status
Page properties | |||||
---|---|---|---|---|---|
|
This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: "Under Discussion"
Discussion thread: TBD
...
|
...
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
However, for stateful operators with retained checkpointing, the stop call won’t take any checkpoint, thus when resuming the job it needs to recover from the latest checkpoint with source rewinding, which causes the wait for processing all inflight data meaningless (all need to be processed again). In another word, there’s no real difference between stop and cancel in this case, thus having conceptual ambiguity.
On the other hand, in latest master branch after FLIP-34, job stop will always be accompanied by a savepoint, which has below problems:
...
- Resuming a Flink job after cancellation involves source rewinding, just like database resume from resuming from killing needs a redo log replay.
- Resuming a Flink job after stop should be able to load from the latest retained checkpoint without any source rewinding, just like no redo log replay needed for database resume from resuming from normal shutdown.This is , which is missing in current implementation.
Savepoint
Flink savepoint concept could be analogized to database backups since the use-case of savepoints in Flink [6] includes:
...
In Flink we have long supported cancel with savepoint, and recently in FLIP-34 we have implemented stop with savepoint, both could be mapped to automatically triggering a backup before killing/shutting down the database instance, and completely orthogonal with the fuzzy/sharp checkpoint process.
Conclusion
According To conclude, according to all above analogies, it's clear that we should always (automatically) do a checkpoint (with retained checkpoint configured) when stopping job, just like we (automatically) do a sharp checkpoint during database shutdown. And this is a necessary supplement to reinforce the job stop semantic.
...
It also makes sense if user issues another cancel command for quick job termination when observing the stop process got stuck, similar to killing the database instance if don’t want to wait for the normal shutdown. And we should make sure the after-stop cancel command could take effect.
Note that currently user controls the life cycle of the retained checkpoint files, and restoring from retained checkpoint reuses the “flink run -s” command thus calling CheckpointCoordinator.restoreSavepoint, so it’s totally fine to restore from a retained checkpoint for multiple times or jobs if only users don’t delete it.
Implementation
After FLIP-34 we have introduced two different types for job stop:
...
More details please refer to PR#8617.Note that currently user controls the life cycle of the retained checkpoint files, and restoring from retained checkpoint reuses the “flink run -s” command thus calling CheckpointCoordinator.restoreSavepoint, so it’s totally fine to restore from a retained checkpoint for multiple times or jobs if only users don’t delete it.
We may also need to refactor the entire stop-processing framework with mailbox model (FLINK-12477) as the next step.
...