Query reexecution provides a facility to re-run the query multiple times in case of an unfortunate event happens.
Introduced in Hive 3.0 (HIVE-17626)
Enables to change the hive settings for all reexecutions which will be happening. It works by adding a configuration subtree as an overlay to the actual hive settings(reexec.overlay.*)
Every hive setting which has a prefix of "reexec.overlay" will be set for all reexecutions.
A more real life example would be to disable join auto conversion for all reexecutions:
During query execution; the actual number passing rows in every operator is tracked. This information is reused during re-planning which could result in a better plan.
Situation in which this would be needed:
- missing statististics
- incorrect statistics
- many joins
It's not that easy to craft queries which will lead to OOM situations; but to enable it:
Operator level statistics are matched to the new plan using operator subtree matching this also enables to match the information to a query which have "similar" parts.
reexecution plugins; currently overlay and reoptimize is supported
runtime statistics can be persisted:
number of reexecution that may happen
Enable to gather runtime statistics on all queries.
|hive.query.reexecution.stats.cache.batch.size||-1||If runtime stats are stored in metastore; the maximal batch size per round during load.|
|hive.query.reexecution.stats.cache.size||100 000||Size of the runtime statistics cache. Unit is: OperatorStat entry; a query plan consist ~100.|
|runtime.stats.clean.frequency||3600s||Frequency at which timer task runs to remove outdated runtime stat entries.|
|runtime.stats.max.age||3days||Stat entries which are older than this are removed.|