Current state: Under Discussion
Discussion thread: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Generally speaking, applications may consist of one or more jobs, and they may want to share the data with others. In Flink, the jobs in the same application are independent and share nothing among themselves. If a Flink application involves several sequential steps, each step (as an independent job) will have to write its intermediate results to an external sink, so that its results can be used by the following step (job) as sources.
Although functionality wise this works, this programming paradigm has a few shortcomings:
In order to share a result, a sink must be provided.
Complicated applications become inefficient due to large amount of IO on intermediate result.
User experience is weakened for users using programing API (SQL users are not victims here because the temporary tables are created by the framework)
It turns out that interactive programming support is critical to the user experience on Flink in batch processing scenarios. The following code gives an example:
In the above code, because b is not cached, it will be computed from scratch multiple times whenever referred later in the program.
To address the above issues, we propose to add support for interactive programming in Flink Table API.
1. Add the following two new methods to the Flink Table class.
2. Add a close method to the TableEnvironment
3. Add the following configuration to control whether enable automatic caching.
The default value is true, i.e. by default auto caching is enabled.
Cache intermediate results
As mentioned in the motivation section. The key idea of the FLIP is to allow the intermediate process results to be cached, so later references to that result does not result in duplicate computation. To achieve that, we need to introduce Cached Tables.
The cached tables are tables whose contents are saved by Flink as the user application runs. A cached Table can be created in two ways:
- Users can call cache() method on a table to explicitly tell Flink to cache a Table.
- The cache() method returns a new Table object with a flag set.
- The cache() method does not execute eagerly. Instead, the table will be cached when the DAG that contains the cached table runs.
- As the application runs, if Flink can save an intermediate result with little cost, Flink will do that, even users did not call cache() explicitly. Such case typically occurs at shuffle boundaries.
- Auto caching will be enabled by default but could be turned off by users.
Semantic of cache() method
The semantic of the cache() method is a little different depending on whether auto caching is enabled or not.
When auto caching is enabled (default behavior)
When auto caching is disabled
Scope of the cached result
The cached tables are available to the user application using the same TableEnvironment.
Release the cached results
The cached intermediate results will consume some resources and needs to be released eventually. The cached result will be released in two cases.
User application exits
When TableEnvironment is closed, the resources consumed by the cached tables will also be released. This usually happens when user application exits.
Explicit invalidateCache() invocation
Sometimes users may want to release the resource used by a cached table before the application exits. In this case, users can call invalidateCache() on a table. This will immediately release the resources used to cache that table.
Explicitly ignore the cached intermediate result
In some rare cases, users may want to explicitly ignore the cached intermediate result. In this case, user needs to give an explicit hint, such as:
Right now Flink does not have a hint mechanism yet. So before such a hint mechanism is available. Users are not able to explicitly ignore a cached intermediate result.
Cache a stream table
Theoretically speaking, user can also cache a streaming table. The semantic will be storing the result somewhere (potentially with a TTL). However, caching a streaming table is usually not that useful. For simplicity, we would not support stream table caching in the first implementation. When cache() is invoked on a stream table, it will be treated as a No-Op. This leaves us room to add caching for stream tables in the future without asking users to change their code.
To let the feature available out of the box, a default file system based cache service will be provided. This section describes the implementation details of the default table service.
Although the implementation details are transparent to the users, there are some related changes to make the default implementation work.
NOTE: There are a few phases to the default intermediate result storage.
Explicit cache, i.e. Table.cache()
Explicit cache removal, i.e. Table.invalidateCache()
RPC in TaskManager to remove result partitions.
Locality for default intermediate result storage.
Support of locality hint from ShuffleMaster
Pluggable external intermediate result storage.
Auto cache support, i.e cache at shuffle boundaries.
Old result partition eviction mechanism for ShuffleMaster / ShuffleService
Locality in general for external intermediate result storage and external shuffle service.
Custom locality preference mechanism in Flink
Support of sophisticated optimization (use or not use cache)
Statistics on the intermediate results.
Cross-application intermediate result sharing.
External catalog service
The following section describes phase 1, which does not support pluggable intermediate result storage or auto caching.
Default Intermediate Result Storage (Phase 1)
Intermediate result reuse
The architecture is illustrated below:
Each cached table consists of two pieces of information:
- Table metadata - name, location, etc.
- Table contents - the actual contents of the table
Default table service stores the metadata in client (e.g. TableEnvironment --> CacheServiceManager) and saves the actual contents in the cache service instances running inside the Task Managers, more specifically, in the network stack which is also used by the shuffle service.
The end to end process is the following:
Step 1: Execute JOB_1 (write cached tables)
- Users call table.cache(), the client
- adds a Sink to the cached node in the DAG. The default IntermediateResultStorage creates a BlockingShuffleSink
- compute an IntermediateResultId based on the RelNode DIGEST (DIGEST1 in this case)
- passes the IntermediateResultId created in 1c all the way from RelNode down to the Operators in the JobVertex.
- Set IntermediateDataSetId (currently random) to IntermediateResultId
- The JobGraphGenerator recognizes the BlockingShuffleSink, removes the Sink and sets the result partition type to BLOCKING_PERSISTENT.
- The client submits the job.
- Job master executes the job like usual. After the job finishes, the ShuffleMaster / ShuffleService keeps the BLOCKING_PERSISTENT result partitions instead of deleting them.
- After the job finishes, JobMaster reports the location of persisted ResultPartitions to the JobManager who then returns the mapping of [IntermediateDataSetID -> Locations] as a part of the JobExecutionResult to the client.
- A client maintains a mapping of DIGEST -> (IntermediateDataSetID, [Locations])
Step 2: Execute JOB_2 (read cached tables)
- Later on, when the client submits another job whose DAG contains a node of DIGEST1, the client
- looks up DIGEST1 in the available intermediate results.
- creates a Source node from IntermedateResultStorage with the location information. The default IntermediateResultStorage creates a BlockingShuffleSource
- replaces the node with DIGEST1 and its subtree with the source node created in 6b
- The JobGraphGenerator sees a BlockingShuffleSource node, replaces it with an ordinary JobVertex, sets isCacheVertex=true and adds an input edge reading from intermediate result of IRID_1.
- The clients submit the job.
- JobMaster does the following if JobVertex.isCacheVertex() returns true
- This assumes Scheduler understands the result partition location.
- Create InputGateDeploymentDescriptor (or ShuffleDeploymentDescriptor after ShuffleMaster is available).
- assign the result partitions to each subtask based on locality.
- Task managers will run the given tasks as usual.
Step 3: Clean up
- When the application exits, all the Task Managers will exit and the intermediate results will be released.
Invalidate intermediate results
- Users invoke Table.invalidateCache()
- Clients remove the intermediate result entry from local metadata.
- Clients send RPC to Dispatcher to delete the corresponding result partition.
- Dispatcher forwards the result partition deletion request to ResourceManager
- ResourceManager sends RPC to each TaskManager hosting the result partitions to release them.
If a Task Manager instance fails, Flink will bring it up again. However, all the intermediate result which has a partition on the failed TM will become unavailable.
In this case, the consuming job will throw an exception and the job will fail. As a result, CacheServiceManager invalidates the caches that are impacted. The TableEnvironment will resubmit the original DAG without using the cache. Note that because there is no cache available, the TableEnvironment (planner) will again create a Sink to cache the result that was initially cached, therefore the cache will be recreated after the execution of the original DAG.
The above process is transparent to the users.
In order to implement the default intermediate result storage, the following changes are needed.
- New result partition type: BLOCKING_PERSISTENT
- A BLOCKING_PERSISTENT result partition will not be deleted when job exits.
- JobMaster executes JobGraph without Sink or Source node.
- JobMaster reports IntermediateDataSetID to ResultPartitionDescriptor mapping to JobManager.
- JobManager returns the DIGEST to ResultPartitionDescriptor mapping to clients in JobExecutionResult.
- RPC call in TaskManagers to delete a ResultPartition. (As a part of FLIP-31)
- Replace random IntermediateDatasetID with something derived from RelNode DIGEST.
- Add IntermediateDataSetID (derived from DIGEST) to StreamTransformation and Operator
- This is required in order to map ResultPartitionDescriptor to DIGEST.
- Clients store the DIGEST → (IntermediateResultId, IntemediateResultDescriptor)
- Clients replace the matching digest with a source before optimization.
Impact to optimization
When users explicitly caches a table, the DAG is implicitly changed. This is because the optimizer may decide to change the cached node if it were not explicitly cached. As of now, when a node has more than one downstream nodes in the DAG, that node will be computed multiple times. This could be solved by introducing partial optimization for subgraphs, which is a feature available in Blink but not in Flink yet. This is an existing problem in Flink and this FLIP does not attempt to address that.
Integration with external shuffle service (FLIP-31)
To achieve auto caching, the cache service needs to be integrated with shuffle service. This is sort of a natural move. Inherently, external shuffle service and cache service share a lot of similarities:
Make cache service pluggable
In some cases, users may want to plugin their own cache service. In the future, we could add support for that.
Some API changes will be needed to support customized cache service. We will start another FLIP to discuss that. The change should not be much. Curious readers can read the google doc for some idea.
Add auto cache
Auto cache allows the BLOCKING shuffle boundaries to be persisted for later usage. It further relieves the users from thinking about when to explicitly cache in the program.
Add cache to DataStream API
As of now DataStream only supports stream processing. There is some idea of supporting both Stream and Batch (as finite stream) in DataStream. Once we do that, we can add the cache API to DataStream as well.
Compatibility, Deprecation, and Migration Plan
This FLIP proposes a new feature in Flink. It is fully backwards compatible.
Unit tests and Integration Tests will be added to test the proposed functionalities.
The semantic of the cache() / invalidateCache() API has gone through extended discussions. The rejected alternative semantics are documented below:
Rejected API Option 1
Simple and intuitive, users only need to deal one variable of Table class
Side effect: a table may be cached / uncached in a method invocation, while the caller does not know about this.
Rejected API Option 2
No side effect
Optimizer has no chance to kick in.
Users have to distinguish between original table / cached table.
Adding auto cache becomes a backward incompatible change.
Rejected API Option 3
No side effect
Users only deal with the variable.
Easy to add auto caching.
The behavior of t.foo() changes after t.cache(), the concern is that this is considered as “modifies” table t, which is against the immutable principle.