ID | IEP-37 |
Author | |
Sponsor | |
Created | 06 Sep 2019 |
Status | DRAFT |
Current SQL engine has a number of critical limitations:
The approach to solve the limitations implies more complex execution flow that brings a new abstraction: execution graph.
This graph consists of all execution steps (relational operators) like join, filter, sort, etc. The execution graph may be transformed into another one saving query semantic during query optimisation process using relational algebra and transformation rules. After transformation the graph is split into a set of dependent subgraphs where an output of dependent subgraph is an aggregated input of depending one (in other words we have more than one map-reduce steps).
Optimized query execution graph (or query plan) used to build final execution tasks.
So, the query flow gets a number of additional steps and will consist of next steps:
SELECT t1.name, t2.name as projectName FROM Persons t1, Projects t2 where t1.id == t2.responsiblePersonId
Let's assume there is no collocation and the data placed on different nodes.
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Scan (Persons t1)
Scan (Projects t2)
Exchange (SINGLE) // collecting
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.id)
Exchange (HASH t1.id) // repartitioning
Project (t1.name name, t1.id id)
Scan (Persons t1)
Exchange (HASH t2.id) // repartitioning
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)
1) Executes on a client node:
Receive (id = 1)
2) Executes on an aggregator/aggregators node(s):
Send (targetId = 1)
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Receive (id = 2 t1)
Receive (id = 3 t2)
3) Executes on data nodes:
Send (targetId = 2)
Project (t1.name name, t1.id id)
Scan (Persons t1)
4) Executes on data nodes:
Send (targetId = 3)
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)
Each node may have several roles, intermediate tasks count is unlimited, there may be any count of subsequent joins or sub-selects.
Apache Calcite library is going to be responsible for execution graph building/optimizing.
There are several example of successful Calcite integrations (Apache Drill, Apache Flink, Hive, etc)
The main issue is the new Calcite based engine (the engine) is completely different to current one. At first the engine will available via internal API. We need really good test coverage to make sure the feature works as expected in all possible scenarios.
http://apache-ignite-developers.2346864.n4.nabble.com/New-SQL-execution-engine-td43724.html