Link: Unresolved issues in storm-sql
Current milestone
Storm SQL Phase II
JIRA link
- STORM-1433Getting issue details... STATUS
Remaining works
- - STORM-1443Getting issue details... STATUS
- - STORM-1446Getting issue details... STATUS
- - STORM-2073Getting issue details... STATUS
Others (non-epic)
How this milestone will help users to play with Storm SQL?
- Users will be able to run Storm SQL Runner without copying dependencies to extlib.
- Users will be able to use GROUP BY and JOIN statement in DML statement
- Note that these feature follows the Trident semantic of aggregation and join
- Aggregation is done within batch
- SQE does stateful aggregation, so Storm SQL might follow later
- Join is done with each of batches
- But respecting Streaming SQL semantic seems correct, which means that we could get rid of these features again and wait for Streaming SQL
- This is what Flink is doing, but Flink have Table API so Flink can support window aggregation and join without Calcite support
- Users will be able to use user defined aggregate function in Trident mode
- User defined function is already supported
- Users will be able to run 'explain' to see query plan for DML statement before submitting Trident topology
- Users will be able to specify parallelism of input data source which will be used unless repartitioning is made
- Users will be able to store rows to Redis
How this milestone will improve Storm SQL internally?
- - STORM-2073Getting issue details... STATUS will reduce redundant multiple Trident steps into one
- - STORM-1446Getting issue details... STATUS will do some query optimizations, and open the way to address next works, like automatic parallelism, pushdown, and so on
-
-
STORM-2125Getting issue details...
STATUS
will enable most of functionalities what Calcite supports.
- Except aggregate functions, but we may find the way to get them
Next works
Below works can be done without waiting some other works. We might want to pick several works from each category and create next milestone.
When building milestone, it would be better to clarify the goal - "How this milestone helps the users to play with Storm SQL?".
Automatic parallelism for input data source with metadata
JIRA link
- STORM-2147Getting issue details... STATUS
Things to do (not filed to issue yet)
- Apply this to Kafka input data source (maybe handled from STORM-2147)
- Do we want to add more input data sources? Then they need to be considered as well.
Schema support on input format and output format
JIRA link
- STORM-2149Getting issue details... STATUS
Things to do (not filed to issues yet)
- CSV
- Avro
- TSV
- Schema Registry
- And more
Supports more functions (scalar and aggregation)
JIRA link
None yet
Things to do (not filed to issues yet)
- DATE / TIMESTAMP related functions
- Calcite lacks here, so may need to refer one of RDBMS - MySQL / Oracle / PostgreSQL
- Functions which SQE supports now
- And more
Expand supporting external components
JIRA link
- STORM-2075Getting issue details... STATUS
Done
- Kafka as Input / Output
- Redis as Output
Remaining works
- - STORM-2082Getting issue details... STATUS
- - STORM-2102Getting issue details... STATUS
- - STORM-2103Getting issue details... STATUS
- And more
- Any external modules which support Trident state can be candidates.
Consideration
- They should be rewritten if we replaces the backend of Storm SQL to higher-level core API
- Need to determine 'Widely used' data sources and only provides them for now
Projection / Filter pushdown to data source
JIRA link
None yet
Note
- Not sure it helps stream data source
- It may help but we should make sure that Spout supports projection (maybe only column referring) / filter
- It definitely helps with input data sources which accepts query (for example, JDBC)
Depends on other works (Future work)
Change backend of SQL to higher-level core API (get rid of Trident)
JIRA link
None yet
Precondition
- Apache Storm adopts higher-level core API
- Higher-level core API supports exactly-once
Note
- Storm SQL may go back to basic feature: no aggregation, no join, no sort
- because higher-level core API cannot support aggregation and join without window
- current Storm SQL's aggregation and join semantic are very different from Streaming SQL
Support Streaming SQL
JIRA link
None yet
Precondition
- Calcite supports Streaming SQL : https://calcite.apache.org/docs/stream.html
- For now this page states that Streaming SQL features are NOT IMPLEMENTED yet
- This still seems to be a design phase.
- Julian Hyde had various talks in various place, but in parallel he posted to Calcite dev mailing list for design doc about join recent days.
Note
- issues 'rowtime' to row automatically
- group by window
- join between stream and table (without support temporal table)
- join between stream and stream
- join between stream and table (with support temporal table or similar)