ID | IEP-37 |
Authors | |
Sponsor | |
Created | 06 Sep 2019 |
Status | DRAFT |
Ignite SQL engine is used across hundreds of deployments who are accelerating relational databases or use Ignite as a system of records. However, these days we see that many more companies are migrating to distributed databases that speak SQL natively. With a larger adoption comes a higher demand for SQL support that is comparable to RDBMS. For instance, if a couple of years ago 1 out of 10 use cases needed support for multi-joins queries or queries with subselects or efficient memory usage then today there are 5 out of 10 use cases of this kind; in the foreseeable future, it will be a 10 out of 10. So, the evolution and a major adoption of the distributed databases is in progress -- the relational world goes distributed. In result, it's getting time-consuming for both Ignite SQL maintainers (and experts who help to tune it for production usage) to carry on by having a dependency on H2.
Ignite community is willing to work on a prototype of alternate SQL engines and selected Apache Calcite as the first candidate.
Intro to Apache Calcite:
Below you can see a list of the limitations of the current SQL engine outlined technically:
SELECT t1.a, t2.b FROM t1, t2 WHERE t1.id = t2.id
- silently returns wrong results in case of t1 and t2 are not co-locatedSELECT * FROM person p WHERE p.salary > (SELECT avg(salary) from person)
- doesn’t work at all because it cannot be executed in two stepsClassical database query execution is done using roughly the following steps [1], [2]
The key point in the aforementioned plan is to have a relational algebra execution plan which can undergo arbitrary equivalence transformations by the rules of relational algebra. There is a well-studied optimization approach used in many production systems [3] for optimizing query execution plans.
The idea of this IEP is to introduce all missing intermediate steps into the query execution flow and operate on query execution graph.
This graph consists of all execution steps (relational operators) like join, filter, sort, etc. The execution graph may be transformed into another one saving query semantic during query optimisation process using relational algebra and transformation rules. After transformation the graph is split into a set of dependent subgraphs where an output of dependent subgraph is an aggregated input of depending one (in other words we have more than one map-reduce steps).
Optimized query execution graph (or query plan) used to build final execution tasks.
SELECT t1.name, t2.name as projectName FROM Persons t1, Projects t2 where t1.id = t2.responsiblePersonId
Let's assume there is no collocation and the data placed on different nodes.
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Scan (Persons t1)
Scan (Projects t2)
Exchange (SINGLE) // collecting
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.id)
Exchange (HASH t1.id) // repartitioning
Project (t1.name name, t1.id id)
Scan (Persons t1)
Exchange (HASH t2.id) // repartitioning
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)
1) Executes on a client node:
Receive (id = 1)
2) Executes on an aggregator/aggregators node(s):
Send (targetId = 1)
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Receive (id = 2 t1)
Receive (id = 3 t2)
3) Executes on data nodes:
Send (targetId = 2)
Project (t1.name name, t1.id id)
Scan (Persons t1)
4) Executes on data nodes:
Send (targetId = 3)
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)
Each node may have several roles, intermediate tasks count is unlimited, there may be any count of subsequent joins or sub-selects.
Apache Calcite library is going to be responsible for execution graph building/optimizing.
There are several example of successful Calcite integrations (Apache Drill, Apache Flink, Hive, etc)
Generate the same execution plan as H2 for commonly used queries (co-located queries) - only two phases, this means there is no intermediate local task having a Sender on top of execution sub-graph and a Receiver at the bottom for such query (except cases when such behavior is forced by hints - it's helpful to delegate results aggregation to server nodes in case a requesting client have a little free memory).
The list may be increased.
Apache Calcite uses graph based representation of relational operators, each node has node specific meta information (join type, projection columns, sort operation direction) and general outgoing data properties (traits in terms of Calcite). Each node may have limited count of trait types (equal to count of registered Trait definitions for a planner). Trait types are defined for whole graph before its construction. Calcite framework transforms nodes (node specific properties), data traits and even nodes position based on initial nodes meta information, data traits and relative position.
Example transformation (filter push down + projection recalculation):
In scope of the transformation we optimized joined rows count (by putting filter right after scan) and memory usage (cutting unused row columns with additional projection nodes)
In addition to node outgoing data traits, there is a special trait - Convention, it describes an execution approach (how the node produces outgoing data). There are two types of convention: logical - the graph is used for optimization related transformations only, nodes do not produce any data and only show the execution plan, and physical - the graph is used for execution task/tasks building (such graph, for example, may be cast to a cursor type and returned to a user).
Initially Calcite was a planning framework only, it knew nothing about distribution (and was used for non-distributed databases only) and required execution layer implementing. At now it has a number of distribution related rules/converters and may transform an execution graph into an iterator.
As Apache Drill guys did, we are going to introduce
Ignite logical convention (our own implementations for each type of relational nodes, which cannot be executed itself) is needed for next purposes:
Ignite physical convention is needed to have full control on what happens at execution time (is needed to have memory usage control, intermediate results swapping, execution time optimizations (like code generation, batching), concurrency level control, etc).
Intermediate serializable execution graph representation is needed to send an execution plan to actual executor, since a graph is not serializable, has links to a context and a planner it cannot be serialized without an additional transformation.
Ignite distribution trait definition is needed to make right traits conversion. We need an Ignite specific implementation instead of Calcite abstract one because we need some extra data in distribution trait to calculate target nodes list (remote executors list). Also we need a control on how a particular node distribution calculates, moreover, in Calcite terms HASH distribution isn't the same as a HASH distribution in terms of Ignite. So, most of calcite code may be reused, but some additional Ignite specific logic is needed to make distributions work in a proper way.
The execution flow will look like:
Logical to physical convention transformation example:
Lets assume we build a physical graph for the previously transformed logical one (the graph showed above), also lets assume the left table is replicated and the right one is partitioned, the data is co-located and only two phases is needed, after transformation the execution graph becomes:
Here we see an additional relational operator - Single exchange, it says what everything under the exchange block executes on all data nodes, the result is collected on a client node. The execution is the same as current H2 one.
But having both tables partitioned and non-colocated, the final execution will look like:
Here we see two additional blocks, representing a repartitioning operation. Technically Ignite cannot execute such query now because it requires an additional aggregation step.
In step diagram you may see a physical plan optimization step. Distribution is a physical trait and appears in physical plan, despite the fact the plan is already optimized, we can apply several rules to the plan to optimize exchanges. For example one of tables may have small count of rows and it may be cheaper to send all these rows to nodes, containing partitions of huge table, this way one exchange disappears and another one becomes a Broadcast exchange:
The main issue is the new Calcite based engine (the engine) is completely different to current one. At first the engine will available via internal API. We need really good test coverage to make sure the feature works as expected in all possible scenarios.
http://apache-ignite-developers.2346864.n4.nabble.com/New-SQL-execution-engine-td43724.html
[1] https://arxiv.org/pdf/1802.10233.pdf
[2] https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U
[3] https://cs.uwaterloo.ca/~david/cs848/volcano.pdf
Apache Calcite-powered SQL Engine Roadmap
- IGNITE-12248Getting issue details... STATUS
Tickets caused by H2 limitations:
1 Comment
Evgeny Stanilovsky
https://www.slideshare.net/ssuser9ebf46/the-volcanocascades-optimizer more interesting slides.