...
The SQL is the most popular API of Apache Flink user that connects so many other engines(i.e. Apache Hive, MySQL, PostgreSQL). Flink would have better compatibility with sql query to the underlying engines it adapters to if it has the built-in support for implicit type coercion.
...
Use Cases of Type Coercion:
...
- the inter-operation within builtin SQL operators, like binary arithmetic operations (PLUS, MINOUS and so on), binary comparison, IN, CASE WHEN, IF, SET operator (UNION, INTERSECT ..)
- the built in functions type coercion for input parameters
- the SQL INSERT source that target connector row type when are are row elements type mismatch
- the compatibility with other popular DB engines, like MySQL/Hive/PostgreSQL and so on
...
Popular DB Type Conversion Rules
...
Proposed Design for Flink SQL
The proposed design is a pluggable process during sql-to-rel conversion (E.G. AST node to relational expression).
The work follow describes below mainly happens in Apache Calcite code.
How the Type Coercion Works
The validator will check the operands/return types of all kinds of operators:
...
...
For some cases, although the validation succeed, we still need the type coercion, i.e. for expression 1 > '1', Calcite will just return false without type coercion, we do type coercion eagerly here and the expression would be coerced to 1 > cast('1' as int) whose result evaluates true.
The graph below illustrates how we coerce the operand types of a SqlCall:
...
The graph below illustrates how we coerce the operand type of a SqlNode with struct type:
Strategies for Finding Common Type
...
SQL Contexts | Expression subtype | Strategies |
Set Operation | union/except/intersect | compare the data type of each branch row to find the common type of each fields pair |
Arithmetic Expression | binary arithmetic: [+, -, &, |, ^, /, %, pmod] | 1. promote string operand to data type of the other numeric operand; 2. two strings would all be coerced to DECIMAL. |
binary comparison: [=, <, <=, >, >=, <>] | 1. promote string and timestamp to timestamp; 2. make 1=true and 0=false always evaluates true; 3. find common type for both operands if there is numeric type operand. | |
IN Expression | with subquery | compare type of LHS and RHS, find the common type, if it is struct type, find wider type for every field |
without subquery | if RHS is a expr list, compare every expr to find the wider type | |
Special AGG Function | promote string all to decimal type | |
Case When Expression | case when expression | find then and else operands common wider type |
Colesce FUNC | same as case when | |
[Date|Timestamp|String] +/- interval | promote string to timestamp | |
Function with Expected Inputs Type | builtin functions | look up the families registered in the operand type checker, find the family default type if rule allows it |
UDF/UDAF | try to coerce based on the argument operands types of eval() func |
Apache Flink SQL follows the behaviors of PostgreSQL and MS-SQL mainly, because their rules are more in line with the SQL standard.
For example, for binary arithmetic with strings, we coerce the string operand to the type of the other numeric operand: '9' / 2 (INT), we coerce '9' to int type, and the result type is also INT. While MySQL and Oracle would coerce all the string operands to DOUBLE.
Type Conversion Matrix
The table below illustrates the implicit type coercion rules for all kinds of engines:
From-To | boolean | tinyint | smallint | int | bigint | decimal | float/real | double | interval | date | time | timestamp | [var]char | [var]binary |
boolean | s | |||||||||||||
tinyint | m s | m s | m s | m s | m s | m s | m | m s | m | |||||
smallint | m s | m s | m s | m s | m s | m s | m | m s | m | |||||
int | m s | m s | m s | m s | m s | m s | m | m s | m | |||||
bigint | m s | m s | m s | m s | m s | m s | o | m | m o s | m | ||||
decimal | m s | m s | m s | m s | m s | m s | m | m s | m | |||||
float/real | m s | m s | m s | m s | m s | m s | m o s | m | ||||||
double | m s | m s | m s | m s | m s | m s | m o s | m | ||||||
interval | o | o | ||||||||||||
date | s | m o s | ||||||||||||
time | m | |||||||||||||
timestamp | m | m | m | m | m | s | m s | m | ||||||
[var]char | m o s | m o s | m o s | m o s | m o s | m o s | m o s | o | m o s | m | s | s | ||
[var]binary | m | m | m | m | m | m | m s |
f: | CalciteFlink | m: MS-SQL | o: Oracle | s: Spark |
Apache Flink SQL follows the type coercion matrix of MS-SQL and Apache Spark for most of the cases(based on the fact that the conversion does not break the semantics), it also supports additional cases like the conversion between INTERVAL and VARCHAR.
See CalciteImplicitCasts for the details.
Appendix
The proposed design is a pluggable process during AST to relational expression conversion (E.G. AST node to relational expression).
The work follow describes below mainly happens in Apache Calcite code.
How the Type Coercion Works
The validator will check the operands/return types of all kinds of operators:
- If the validation passes, the validator will just cache the data type (say RelDataType) for the SqlNode it has validated;
- If the validation fails, the validator will ask for the TypeCoercion component about if we can do an implicit type coercion, if the coercion rules passes, the TypeCoercion component will replace the SqlNode with a coerced one of desired type (the node may be an operand of an operator or a column of selected row);
- Then the TypeCoercion component would update the inferred type for the casted node and the containing operator/row column type;
- If the coercion rule fails again, the validator will just throw the exception as is before.
For some cases, although the validation succeed, we still need the type coercion, i.e. for expression 1 > '1', Calcite will just return false without type coercion, we do type coercion eagerly here and the expression would be coerced to 1 > cast('1' as int) whose result evaluates true.
The graph below illustrates how we coerce the operand types of a SqlCall:
The graph below illustrates how we coerce the operand type of a SqlNode with struct type:
Reference
...