ID	IEP-37
Authors	Igor Seliverstov Roman Kondakov
Sponsor
Created	06 Sep 2019
Status	DRAFT

Motivation

Current SQL engine has a number of critical limitations:

Query execution is hard-coded to a primitive map-reduce flow (SQL query is split into a 'map query' and 'reduce query'). This flow is fundamentally incomplete as it is impossible to execute the following queries:
- SELECT t1.a, t2.b FROM t1, t2 WHERE t1.id = t2.id - silently returns wrong results in case of t1 and t2 are not co-located
- SELECT * FROM person p WHERE p.salary > (SELECT avg(salary) from person) - doesn’t work at all because it cannot be executed in two steps
- IGNITE-11448 - Getting issue details... STATUS
Query execution is done using H2 query engine which itself has several issues
- Low control on how query executes internally, as a result, we have limited possibility to implement improvements/fixes (H2 is outside of ASF)
- H2 is a local database and has no notion of distributed query execution
- H2 lacks the proper planner which will take in count both data distribution and data statistics
- H2 optimizer is very primitive. It can do only predicates push down, join order choosing and also some minor optimizations. It lacks many useful optimizations like the following IGNITE-6085 - Getting issue details... STATUS

Classical database query execution is done using roughly the following steps [1], [2]

SQL text is parsed and validated, which produces an Abstract Syntax Tree (AST) representing the query
Query rewrite (output: rewritten SQL AST) - need to (for example) turn DML queries into selects, returning set of rows to change (queries like 'INSERT INTO table1 SELECT * FROM table 2').
AST is transformed into Query Plan (output: query execution graph, which is identical to the original query but operates in terms of relational algebra)
Query execution graph is optimized (output: optimized query execution graph with is semantically identical to the original query)
Query execution (output: resulting cursor)

The key point in the aforementioned plan is to have a relational algebra execution plan which can undergo arbitrary equivalence transformations by the rules of relational algebra. There is a well-studied optimization approach used in many production systems [3] for optimizing query execution plans.

Description

The idea of this IEP is to introduce all missing intermediate steps into the query execution flow and operate on query execution graph.

This graph consists of all execution steps (relational operators) like join, filter, sort, etc. The execution graph may be transformed into another one saving query semantic during query optimisation process using relational algebra and transformation rules. After transformation the graph is split into a set of dependent subgraphs where an output of dependent subgraph is an aggregated input of depending one (in other words we have more than one map-reduce steps).

Optimized query execution graph (or query plan) used to build final execution tasks.

Example:

Initial query:


SELECT t1.name, t2.name as projectName FROM Persons t1, Projects t2 where t1.id == t2.responsiblePersonId

Let's assume there is no collocation and the data placed on different nodes.

Initial execution graph:

Project (t1.name name, t2.name projectName)
   Join (t1.id == t2.responsiblePersonId)
      Scan (Persons t1)
      Scan (Projects t2)

Transformed graph:

Exchange (SINGLE) // collecting
   Project (t1.name name, t2.name projectName)
      Join (t1.id == t2.id)
         Exchange (HASH t1.id) // repartitioning
            Project (t1.name name, t1.id id)
               Scan (Persons t1)
         Exchange (HASH t2.id) // repartitioning
            Project (t2.name name, t2.responsiblePersonId id)
               Scan (Projects t2)

Split tasks:

1) Executes on a client node:

   Receive (id = 1)

2) Executes on an aggregator/aggregators node(s):

   Send (targetId = 1)
      Project (t1.name name, t2.name projectName)
         Join (t1.id == t2.responsiblePersonId)
            Receive (id = 2 t1)
            Receive (id = 3 t2)

3) Executes on data nodes:

   Send (targetId = 2)
      Project (t1.name name, t1.id id)
         Scan (Persons t1)

4) Executes on data nodes:

   Send (targetId = 3)
      Project (t2.name name, t2.responsiblePersonId id)
         Scan (Projects t2)

Each node may have several roles, intermediate tasks count is unlimited, there may be any count of subsequent joins or sub-selects.

Apache Calcite library is going to be responsible for execution graph building/optimizing.

There are several example of successful Calcite integrations (Apache Drill, Apache Flink, Hive, etc)

Calcite based SQL engine requirements.

It has to generate the same execution plan as H2 for commonly used queries (co-located queries) - only two phases, this means there is no intermediate local task having a Sender on top of execution sub-graph and a Receiver at the bottom for such query (except cases when such behavior is forced by hints - it's helpful to delegate results aggregation to server nodes in case a requesting client have a little free memory)
It has to provide an ability to execute any non-recursive non-collocated queries in reasonable period of time.
It has to provide memory management abilities to defend the application from OOM (memory quotes, using disk for intermediate results, etc)
It has to provide SQL enhancement abilities (system functions, user defined functions, hints, etc)
It has to generate optimal execution plan for non-recursive non-collocated queries taking into consideration two factors: a) transferring data amount, b) each local subtask execution complexity.
It has to provide enhancement points for future improvements (new transformation rules, different source data structure types support - indexes and tables initially and prefix trees or spatial indexes in future, possible column based storage support in future, etc)

The list may be increased.

Expected integration steps:

Ignite logical convention implementing (Relational graph nodes, converter rules), so, Calcite can use Ignite's own operations costs, we have a control on what variant of graph is preferable.
Index Scan rules implementing - Apache Phoenix experience may be reused. Range filters, sorted scans, some projections transform into index scans.
Exchange related rules implementing (affinity aware) - Apache Drill experience may be reused. SINGLETON, RANDOM, HASH and BROADCAST distribution types needed.
Sender/Receiver infrastructure implementing. - Each Exchange rewrites into a pair of Receiver and Sender where Receiver is a relation node and Sender is an infrastructure object which is used to stream target Exchange subgraph result to a particular remote receiver.
Physical convention implementing - as a start point we may use one of provided by Calcite conventions (Bindable, Enumerable, Interpretable) rewriting particular relational nodes and converter/transform rules into our own implementations one by one.

Risks and Assumptions

The main issue is the new Calcite based engine (the engine) is completely different to current one. At first the engine will available via internal API. We need really good test coverage to make sure the feature works as expected in all possible scenarios.