IDIEP-37
Authors
Sponsor
Created 06 Sep 2019
Status
DRAFT


Motivation

Ignite SQL engine is used across hundreds of deployments who are accelerating relational databases or use Ignite as a system of records. However, these days we see that many more companies are migrating to distributed databases that speak SQL natively. With a larger adoption comes a higher demand for SQL support that is comparable to RDBMS. For instance, if a couple of years ago 1 out of 10 use cases needed support for multi-joins queries or queries with subselects or efficient memory usage then today there are 5 out of 10 use cases of this kind; in the foreseeable future, it will be a 10 out of 10. So, the evolution and a major adoption of the distributed databases is in progress -- the relational world goes distributed. In result, it's getting time-consuming for both Ignite SQL maintainers (and experts who help to tune it for production usage) to carry on by having a dependency on H2.

Ignite community is willing to work on a prototype of alternate SQL engines and selected Apache Calcite as the first candidate.


Below you can see a list of the limitations of the current SQL engine outlined technically:

  • Query execution is hard-coded to a primitive map-reduce flow (SQL query is split into a 'map query' and 'reduce query'). This flow is fundamentally incomplete as it is impossible to execute the following queries:
    • SELECT t1.a, t2.b FROM t1, t2 WHERE t1.id = t2.id - silently returns wrong results in case of t1 and t2 are not co-located
    • SELECT * FROM person p WHERE p.salary > (SELECT avg(salary) from person) - doesn’t work at all because it cannot be executed in two steps
    • IGNITE-11448 - Getting issue details... STATUS
  • Query execution is done using H2 query engine which itself has several issues
    • Low control on how query executes internally, as a result, we have limited possibility to implement improvements/fixes (H2 is outside of ASF)
    • H2 is a local database and has no notion of distributed query execution
    • H2 lacks the proper planner which will take in count both data distribution and data statistics
    • H2 optimizer is very primitive. It can do only predicates push down, join order choosing and also some minor optimizations. It lacks many useful optimizations like the following  IGNITE-6085 - Getting issue details... STATUS

Classical database query execution is done using roughly the following steps [1], [2]

  1. SQL text is parsed and validated, which produces an Abstract Syntax Tree (AST) representing the query
  2. Query rewrite (output: rewritten SQL AST) - need to (for example) turn DML queries into selects, returning set of rows to change (queries like 'INSERT INTO table1 SELECT * FROM table 2').
  3. AST is transformed into Query Plan (output: query execution graph, which is identical to the original query but operates in terms of relational algebra)
  4. Query execution graph is optimized (output: optimized query execution graph with is semantically identical to the original query)
  5. Query execution (output: resulting cursor)

The key point in the aforementioned plan is to have a relational algebra execution plan which can undergo arbitrary equivalence transformations by the rules of relational algebra. There is a well-studied optimization approach used in many production systems [3] for optimizing query execution plans. 

Description

The idea of this IEP is to introduce all missing intermediate steps into the query execution flow and operate on query execution graph.

This graph consists of all execution steps (relational operators) like join, filter, sort, etc. The execution graph may be transformed into another one saving query semantic during query optimisation process using relational algebra and transformation rules. After transformation the graph is split into a set of dependent subgraphs where an output of dependent subgraph is an aggregated input of depending one (in other words we have more than one map-reduce steps).

Optimized query execution graph (or query plan) used to build final execution tasks.

Example:


Initial query:

SELECT t1.name, t2.name as projectName FROM Persons t1, Projects t2 where t1.id = t2.responsiblePersonId


Let's assume there is no collocation and the data placed on different nodes.

Initial execution graph:
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Scan (Persons t1)
Scan (Projects t2)
Transformed graph:
Exchange (SINGLE) // collecting
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.id)
Exchange (HASH t1.id) // repartitioning
Project (t1.name name, t1.id id)
Scan (Persons t1)
Exchange (HASH t2.id) // repartitioning
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)
Split tasks:

1) Executes on a client node:

   Receive (id = 1)


2) Executes on an aggregator/aggregators node(s):

   Send (targetId = 1)
Project (t1.name name, t2.name projectName)
Join (t1.id == t2.responsiblePersonId)
Receive (id = 2 t1)
Receive (id = 3 t2)


3) Executes on data nodes:

   Send (targetId = 2)
Project (t1.name name, t1.id id)
Scan (Persons t1)


4) Executes on data nodes:

   Send (targetId = 3)
Project (t2.name name, t2.responsiblePersonId id)
Scan (Projects t2)


Each node may have several roles, intermediate tasks count is unlimited, there may be any count of subsequent joins or sub-selects.

Apache Calcite library is going to be responsible for execution graph building/optimizing.

There are several example of successful Calcite integrations (Apache Drill, Apache Flink, Hive, etc)

Calcite based SQL engine requirements

  1. Generate the same execution plan as H2 for commonly used queries (co-located queries) - only two phases, this means there is no intermediate local task having a Sender on top of execution sub-graph and a Receiver at the bottom for such query (except cases when such behavior is forced by hints - it's helpful to delegate results aggregation to server nodes in case a requesting client have a little free memory).

  2. Provide an ability to execute any non-recursive non-collocated queries in reasonable period of time.
  3. Ability to obtain execution plan in human readable form (explain plan).
  4. Provide memory management abilities to defend the application from OOM (lazy result retrieval, memory quotes, using disk for intermediate results, etc).
  5. Provide SQL enhancement abilities (system functions, user defined functions, hints, etc).
  6. Generate optimal execution plan for non-recursive non-collocated queries taking into consideration two factors: a) transferring data amount, b) each local subtask execution complexity.
  7. Ability to setup query timeout and cancel running query manually.
  8. Leverage primary and secondary indices.
  9. Provide enhancement points for future improvements (new transformation rules, different source data structure types support - indexes and tables initially and prefix trees or spatial indexes in future, possible column based storage support in future, etc).

The list may be increased.

Implementation details

Apache Calcite uses graph based representation of relational operators, each node has node specific meta information (join type, projection columns, sort operation direction) and general outgoing data properties (traits in terms of Calcite). Each node may have limited count of trait types (equal to count of registered Trait definitions for a planner). Trait types are defined for whole graph before its construction. Calcite framework transforms nodes (node specific properties), data traits and even nodes position based on initial nodes meta information, data traits and relative position.

Example transformation (filter push down + projection recalculation):

Nodes transformation

In scope of the transformation we optimized joined rows count (by putting filter right after scan) and memory usage (cutting unused row columns with additional projection nodes)


In addition to node outgoing data traits, there is a special trait - Convention, it describes an execution approach (how the node produces outgoing data). There are two types of convention: logical - the graph is used for optimization related transformations only, nodes do not produce any data and only show the execution plan, and physical - the graph is used for execution task/tasks building (such graph, for example, may be cast to a cursor type and returned to a user).

Initially Calcite was a planning framework only, it knew nothing about distribution (and was used for non-distributed databases only) and required execution layer implementing. At now it has a number of distribution related rules/converters and may transform an execution graph into an iterator.

As Apache Drill guys did, we are going to introduce

  • two Ignite conventions: Logical and Physical
  • intermediate serializable execution graph representation.
  • a special TraitDef describing distribution traits of a data node.

Ignite logical convention (our own implementations for each type of relational nodes, which cannot be executed itself) is needed for next purposes:

  1. Provide Ignite specific costs system.
  2. Provide methods to build an intermediate graph representation in a visitor style (widely used in Calcite)
  3. Provide methods to traverse an execution graph (it's needed for metadata population/extraction, to do some kind of pre/post processing of an execution tree)

Ignite physical convention is needed to have full control on what happens at execution time (is needed to have memory usage control, intermediate results swapping, execution time optimizations (like code generation, batching), concurrency level control, etc).

Intermediate serializable execution graph representation is needed to send an execution plan to actual executor, since a graph is not serializable, has links to a context and a planner it cannot be serialized without an additional transformation.

Ignite distribution trait definition is needed to make right traits conversion. We need an Ignite specific implementation instead of Calcite abstract one because we need some extra data in distribution trait to calculate target nodes list (remote executors list). Also we need a control on how a particular node distribution calculates, moreover, in Calcite terms HASH distribution isn't the same as a HASH distribution in terms of Ignite. So, most of calcite code may be reused, but some additional Ignite specific logic is needed to make distributions work in a proper way.

The execution flow will look like:

Execution steps

Logical to physical convention transformation example:

Lets assume we build a physical graph for the previously transformed logical one (the graph showed above), also lets assume the left table is replicated and the right one is partitioned, the data is co-located and only two phases is needed, after transformation the execution graph becomes:

Physical plan

Here we see an additional relational operator - Single exchange, it says what everything under the exchange block executes on all data nodes, the result is collected on a client node. The execution is the same as current H2 one.

But having both tables partitioned and non-colocated, the final execution will look like:

Three steps query

Here we see two additional blocks, representing a repartitioning operation. Technically Ignite cannot execute such query now because it requires an additional aggregation step.


In step diagram you may see a physical plan optimization step. Distribution is a physical trait and appears in physical plan, despite the fact the plan is already optimized, we can apply several rules to the plan to optimize exchanges. For example one of tables may have small count of rows and it may be cheaper to send all these rows to nodes, containing partitions of huge table, this way one exchange disappears and another one becomes a Broadcast exchange:

Optimized physical plan

Expected integration steps

  1. Ignite logical convention implementing (Relational graph nodes, converter rules), so, Calcite can use Ignite's own operations costs, we have a control on what variant of graph is preferable.
  2. Index Scan rules implementing - Apache Phoenix experience may be reused. Range filters, sorted scans, some projections transform into index scans.
  3. Exchange related rules implementing (affinity aware) - Apache Drill experience may be reused. SINGLETON, RANDOM, HASH and BROADCAST distribution types needed.
  4. Sender/Receiver infrastructure implementing. - Each Exchange rewrites into a pair of Receiver and Sender where Receiver is a relation node and Sender is an infrastructure object which is used to stream target Exchange subgraph result to a particular remote receiver.
  5. Physical convention implementing - as a start point we may use one of provided by Calcite conventions (Bindable, Enumerable, Interpretable) rewriting particular relational nodes and converter/transform rules into our own implementations one by one.




Risks and Assumptions

The main issue is the new Calcite based engine (the engine) is completely different to current one. At first the engine will available via internal API. We need really good test coverage to make sure the feature works as expected in all possible scenarios.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/New-SQL-execution-engine-td43724.html

Reference Links

[1] https://arxiv.org/pdf/1802.10233.pdf

[2] https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U

[3] https://cs.uwaterloo.ca/~david/cs848/volcano.pdf

https://calcite.apache.org/

https://phoenix.apache.org/

https://drill.apache.org/

Apache Calcite-powered SQL Engine Roadmap

Apache Calcite-powered SQL Engine Roadmap

Tickets

IGNITE-12248 - Getting issue details... STATUS

Tickets caused by H2 limitations:

Key Summary T Created Updated Due Assignee Reporter P Status Resolution
Loading...
Refresh

Related documents

Volcano/Cascades optimizer

Columbia optimizer


  • No labels