This document proposes the addition of a query results cache to Hive. Caching query results allows a previously computed query result to be re-used in the event that the same query is processed by Hive. This can save both time and resources spent running the cluster tasks required for the query.
Existing behavior for Hive query processing (very simplified):
Hive query compilation takes the query string and produces a QueryPlan. The QueryPlan contains both the list of cluster tasks required for the query, as well as the FetchTask used to fetch results. The cluster tasks are configured to output the final results to a designated temp directory. The FetchTask contains a FetchOperator which is configured to read from the designated temp directory, as well as the input format, column information, and other information.
During query execution (Driver.execute()), Hive will submit any cluster tasks (MR/Tez/Spark) required for the query.
When the cluster tasks are finished the queryPlan’s FetchTask is used to read the query results from the temp directory (Driver.getResults())
Cleanup of the query results is handled by the Driver’s Context object. Several directory paths are cleaned up in this step, including the query results directory and parent directories of the results directory.
Keeps a mapping of query string to cached results entry
Used to lookup cache entries based on the query string
Can receive query results from queries, for use by future queries
Responsible for invalidating cache stale entries, and cleaning up resources used by the invalidated cache entries
Changes to query processing:
During query compilation, check the results cache to see if it already has the query results. If there is a cache hit then the query plan will be set to a FetchTask that will read from the cached location.
During query execution:
If the results cache can be used for this query:
The query will simply be the FetchTask reading from the cached results directory. No cluster tasks will be required.
If the results cache cannot be used:
Run the cluster tasks as normal
Check if the query results that have been computed are eligible to add to the results cache.
If results can be cached, the temporary results generated for the query will be saved to the results cache. Steps may need to be done here to ensure the query results directory is not deleted by query cleanup.
At what level of the query processing should results caching be applied
There are several potential places the cache lookup can be done:
- Before parsing/compilation
- After parsing (with the AST available) but before full query optimization/planning
- After query compilation/planning
The current approach is for cache lookup to occur after AST generation but before full optimization. This can give us a chance to avoid full query optimization/planning. Access to the AST also allows Hive to fully qualify table references (to make sure the current query is using the same tables as the cached query), plus row filtering and column masking rewrites (from Ranger access rules) occur on the AST so this can still work with those.
Matching the current query with entries in the results cache
Hive will need to check if the results for an incoming query are already saved in the results cache. A simple way to match the query with results cache entries would be to compare the incoming query text with the query text of the results cache entry. Many of the existing DBMS results caching solutions seem to cache results based on the query string.
Unqualified table references need to be taken into account - depending on the current database at the time of the query, 2 queries running the exact same query string may access tables from completely different databases. So rather than performing the lookup based on the original SQL string, the query string should have any unqualified references resolved. This can be done using the AST and the UnparseTranslator.
There are similar issues with query string matching that apply to user-level security with row/column masking/filtering.
Is the results cache independent or shared between different Hive instances?
Currently each Hive instance has its own separate results cache. Originally the thinking was the query plan or AST would be part of the information saved in the cache, and might have been problematic to save and share between the different instances. But actually only the modified query string is saved for it which might make it possible to eventually have a shared cache. There are still a number of structures that are saved to properly set up the query (results schema, TableAccessInfo/ColumnAccessInfo, ReadEntities), so those would have to be saved somehow.
Which queries can use the results cache
Non-deterministic functions (rand(), unix_timestamp(), ...)
- Runtime constant functions (current_date, current_timestamp, current_user(), current_database(), ...)
External tables? If Hive relies on modifications to the underlying table to invalidate cached results, then external tables should not be allowed because these tables could be written outside of Hive.
If the query has no cluster tasks (fetch-only query), no need to cache
Saving query results from getting deleted by query cleanup
To keep the query results for possible re-use, we need to make sure that the results directory is not deleted as part of query cleanup. Note that the Driver Context cleans up not only the query results directory but the parent directories of the results directory, which may complicate trying to save the query results directory while also performing the rest of the query cleanup.
The solution chosen will be to move the query results directory to a designated results cache directory to make sure that it does not get removed during query cleanup. This requires no changes to the existing query cleanup, and makes for simpler management of cached directories - all cached results can be under a single cache directory.
Column/Row Masking and Filtering
Due to Ranger column/row masking, the query results from one user may look different that they would for a different user, even if the query text is the same. If results caching is enabled, then Hive will need to make sure that the cache lookup works correctly in this case, or else it will have to be disabled for masking/filtering.
The query rewrites that occur for Hive masking/filtering are done on the AST (SemanticAnalyzer.analyzeInternal()). So as long as the query string used for cache lookup can be derived from the rewritten AST (or the AST can be used during lookup), then query lookups should correctly distinguish cases where the query was modified from the original query text due to row/column filtering and masking.
Invalidating results cache entries
One issue with saving query results in a cache is that over time the entries in the results cache can become stale, if the underlying tables from the query are updated or dropped. A solution for this is to have a mechanism to invalidate the entries saved in the results cache. Some possible mechanisms for this:
Allow entries in the results cache to be valid for only a configured length of time. This is a simpler implementation which does not have to rely having to detect when the underlying tables of a cached query result have been changed. The disadvantage is that when the underlying tables are changed, any stale results would be served by the results cache until the cached result expires.
Add a command to clear the results cache. This can give the user a chance to prevent stale results from being served by the cache if they know that a table has been updated. Disadvantage is this requires manual intervention by the user to prevent stale results from being served.
Expire results cache entries if there are updates to the underlying tables. It may be possible to reuse the same mechanism that Materialized Views use to determine if any of the underlying tables of a cached query have been modified.
Cleanup of cache directories
Each Hive instance can keep track of its cached results directories, and set them to be deleted on process exit. Under normal conditions this may be able to take care of cache directory cleanup, but in the case that the Hive process is terminated without a chance to perform graceful cleanup, these directories may still be left around.
At the start of query execution the Hive Driver will acquire locks for all tables being queried. This behavior should still remain even for queries that are using the cache - the read lock will prevent other users from being able to write to the underlying tables and invalidating the cache at the time the results cache is being checked for the query results.