...
For private states, each TM claims a new sub-directory and DirectoryStateHandle under the taskmanager-owned path as long as a job restoration occurs. This is because the old private state files will not be included in the new checkpoint, rendering the closed files for private state useless for further file merging.
The sub-directory path contains information of current subtask/TM, which is enough for TMs to determine whether to reuse the old one or migrate to a new one.
- Path for shared states of each subtask: ${checkpointBaseDir}/shared/subtask-{index}-{parallelism}
- Path for private states of each task manager: ${checkpointBaseDir}/taskmanager-owned/${tmResourceId}
Creating a sub-directory for each granularity unit is a good plan, as it provides a quick way for JM to delete a batch of unclaimed files. Compared to deleting specified physical files, deleting directories does not require JM to know the complicated relationship between checkpoints and physical files, which is maintained by the TMs. Additionally, deleting the directories as a whole helps to remove leftover files (half-written files in the directory that have not been reported to JM), improving the stability of the system.
...
- Failover: The new subtask will take over the previously managed sub-directory and delete the un-referenced files in best effort.
- Checkpoint notification message lost: There is a possibility that the notification message from JM to TM may be lost due to some reason. To address this, TM uses a watermark to keep track of the latest checkpoint that utilizes a file or segment, and releases the file or segment when a later checkpoint is subsumed. This ensures that even if notification is lost, the arrival of later messages will still trigger the file cleaning procedure.later checkpoint is subsumed. This ensures that even if notification is lost, the arrival of later messages will still trigger the file cleaning procedure.
4.9. State ownership comparison with other designs
The FLINK-23342 and its design doc[1] provide several designs of state ownership to overcome the disadvantages of having the state files managed by JM. The state ownership design of this FLIP is pretty much like the option 3 in that doc. The main difference is that this FLIP implies the concept of 'epoch' of that doc in the folder path for each granularity (as described in 4.3). Comparing the JM-TM mixed ownership designs of the FLINK-23342 with this FLIP:
Current Flink | Option 5 from the doc: Split counters | Option 3 from the doc: Epochs | Design of this FLIP | |
Ownership | JM | Mixed | Mixed | Mixed |
Granularity of tracking shared state | File | File | Folder | Folder |
Fragility | Bad | Good | Good | Good |
Reupload | On abort | - | On epoch change | On job restart by user |
Cleanup efficiency | Bad | Medium (files but distributed) | Good (folder at once) | Good (folder at once) |
Cleanup reliability | Bad | Bad | Good | Good |
RPC overhead | Bad | Good | Good | Good |
Dev. time | - | Medium / Bad | Medium | Medium |
Invasive change of shared state registry | - | Bad | Good | Very Good |
5. Public interfaces and User Cases
...
11. Rejected Alternatives
None for now.
\[1\] The design doc of FLINK-23342: https://docs.google.com/document/d/1NJJQ30P27BmUvD7oa4FChvkYxMEgjRPTVdO1dHLl_9I/edit#