Technical analysis of merging SQE and Storm SQL

DISCLAIMER: I (Jungtaek Lim) have been working on improving Storm SQL for some weeks, and have short term plan and long term plan. So this analysis could be biased to be more positive to Storm SQL. Please feel free to comment if you found the biased points and want to claim that.

Plan A : integrate SQE into Storm SQL

Overall

SQE provides useful operators and features, and many of them are relatively easy to apply to Storm SQL, which would take a few days to adopt them. Some features including Avro schema, MapState, Replay Filtering need modification from core of Storm SQL, but before doing that we need to discuss about its worth in general use cases.

In point of community’s view, a few contributors start contribution or at least show interest of Storm SQL. If we accelerate the development of Storm SQL with proper documentations / releasing, I expect we get more early adopters and contributors.

In point of JW Player’s view they need to overcome some learning curves and also wait for completion of merging process before adopting to the production. In progress of merging they can still use SQE on production so it’s not a great deal, but it may make JW Player losing their interest to contribute which we should be aware of.

Pros.

We can keep the Storm SQL’s design advantage: flexible data sources.
UX of Storm SQL doesn’t affect after merging.
Apart from some features which may need discussions, we can adopt features on SQE very easily.
Storm SQL already addressed many of future features on SQE, and even not addressed yet but relatively easier to be addressed thanks to Calcite.

https://github.com/jwplayer/SQE/wiki#potential-future-features

In point of community’s view, some contributors start contribution or at least show interest of Storm SQL.

There’re open issues and epics which show milestone for near future, so just good to go without any significant changes of milestone.

Cons.

It might takes some efforts to adopt some features from SQE which needs modification of core Storm SQL.
In point of JW Player’s view, they should be familiar with codebase of Storm SQL before starting contribution. It gives some learning curves in addition to learning curve of Calcite, and losing interest may occur.
JW Player cannot replace SQE with Storm SQL in production until merging process is completed. Merging process can be done from a few days to several weeks after IP clearance.
No one is using Storm SQL on production which SQE has been on production in JW Player.

Plan B. integrate Storm SQL into SQE

Overall

The greatest point of SQE is that SQE is running on production from JW Player, so that it can be treated as stable. SQE also supports various expressions which seems to be enough set for JW Player’s use case. But from the several places I felt SQE needs some efforts to become open-source ready with handling more general use cases. Furthermore, the needed effort for replacing SQL-like interface with SQL interface via Calcite seems to be nondeterministic.

In point of community’s view, SQE is completely new codebase for the community so potential contributors need to learn SQE before to contribute. It also stops the flow of current development of current milestone and wait for end of IP clearance, and needs to set up another milestone based on the result of merging.

In point of JW Player’s view, this way reduces learning curve, and enable iterative development with their current dev/production environment. JW Player don’t need to wait until merging progress is done. So this maximizes JW Player’s interest of contribution, but they still need to be familiar with Calcite anyway, and might need to learn Storm SQL codebase since adopting SQL interface is likely to come from there.

Pros.

SQE is running ‘on production’ from JW Player.
It could maximize JW Player’s interest of contribution, and ease of learning curve. They still need to learn Calcite but not whole codebase of Storm SQL.

This assumes we put the minimal effort for SQE to have SQL interface via Calcite. If it needs great effort and huge code changes, it will be another learning curve for them.

JW Player can adopt the change of SQE for each iteration of merge/development phase which could be helpful.
There’re some great features on SQE (like Avro schema) which we can enjoy right after IP clearance is done.

Cons.

SQE seems to be designed as in-house solution. No flexible for data sources, and it seems to require custom version of storm-redis and storm-kafka.

It might take some time to become open-source ready. That’s why Heron is still in beta even though Twitter uses it in production.

The needed effort for replacing SQL-like interface with SQL interface via Calcite is nondeterministic.

Most of code on storm-sql-core is purely for supporting SQL interface via Calcite.
Even though we drop ‘standalone’ mode of Storm SQL, still huge part of code on storm-sql-core would be needed.

When we address SQL interface via Calcite, SQL like JSON interface will be removed, and JW Player should make some efforts to change their queries anyway.
SQE has limited expressions what JW Player need in their use cases.

https://github.com/jwplayer/SQE/wiki/expressions
For example, MAX and SUM provided but not MIN and AVG, and RLIKE provided but not LIKE.
Limited expressions could be annoying for other use cases.
Storm SQL is trying to cover Calcite SQL language with best effort.

https://calcite.apache.org/docs/reference.html

Contributors of Storm SQL need to overcome learning curve of SQE.
We can’t continue putting the efforts to this area before IP clearance is done.

Appendix A. Analysis of the effort for integrating SQE to Storm SQL (in detail)

A. Already supported via Calcite but didn’t address to Storm SQL yet

Calcite provides implementations of various functions and it would be fairly easy to call it. In this case we don’t need to adopt SQE implementation and just use Calcite implementation instead.

EvaluateRegularExpression

Calcite provides implementations of ‘LIKE’ / ‘SIMILAR TO’.
‘SIMILAR TO’ handles regular expression so we can just use this expression.

GetIf

It works like elvis operator, but predicate should be just a boolean value.
It’d be ideal to implement ‘CASE’ instead of using this.

B. Easy to adopt

Scalar functions are easy to adopt. Most of functions in SQE are already supported from Storm SQL or Calcite.

Below functions are neither in Storm SQL nor Calcite so it would be great to adopt it.

Hash

supports only Murmur2
can be added to StormSqlFunctions (scalar function)
Function name needs to be clear on implementation

might be better to use MURMUR2 as function name?

FormatDate

Formats the date value according to the format string.
MySQL expr.: DATE_FORMAT(date, date_format)
can be added to StormSqlFunctions (scalar function)
might be better to use DATE_FORMAT as function name?

ParseDate

Parses the string into Date according to the format string and return it.
MySQL expr.: STR_TO_DATE(str_date, date_format)
GetTime can be implemented with this and UNIX_TIMESTAMP.
Might be better to use STR_TO_DATE as function name?

GetTime

Parses the string into Date according to the format string, and returns timestamp (long).
MySQL expr.: UNIX_TIMESTAMP(STR_TO_DATE(str_date, date_format))
can be added to StormSqlFunctions (scalar function)
can be also implemented with this and UNIX_TIMESTAMP.
might be better to also support UNIX_TIMESTAMP?

RoundDate

Converts Date into UNIX timestamp, and rounds it by amount of time unit (for example, 15 minutes), and re-converts to Date
MySQL expr.: FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(date))/(amount * <unit to secs>)*(amount * <unit to secs>))

might be easier way to do it

can be added to StormSqlFunctions (scalar function)
might be better to also support FROM_UNIXTIME?

C. Need to modify Storm SQL to adopt

Support Avro schema

needs to define table with the fields which are the fields of Avro schema.

First of all, how to define Avro schema and pass it to SQL runner / Topology?

seems not hard to do it with adopting GetTupleFromAvro.

D. Need some thoughts / ideas to adopt

Stateful, exactly-once aggregation via MapState

Normally query ends up having rows with/without aggregated results from the last Projection stage.

doesn’t end up with grouped stream even we use grouping query.

SQE seems to re-group rows with key fields (are they grouped keys?) if aggregation has been applied during query, just before doing insert.

Storm SQL might be able to do similar way and it would not be really hard to do it. I’m just not clear on how it works and why we need to.

Replay Filtering

https://github.com/jwplayer/SQE/wiki/replay-filtering
FilteredTridentKafkaSpout
If we can add StreamMetadata to tuples with only change of data source, I’m OK to adopt it.
If we should modify core of Storm SQL to inject StreamMetadata to tuples, it might be better to weigh this idea.
StreamMetadata contains the pid (hash value of the information of datasource), partition, and offset.
All data sources need to include that information for every tuples.

Parallelism Hint

SQE uses global static parallelism option which seems to be easy to adopt.
I’m thinking about it more: it would be better to receive max parallelism from the data source. (kind of a metadata of data source)

For example, in Kafka Spout it could be partition count.

ExpandKeys

I think it doesn’t make sense to expand N rows from 1 row in SQL.

ExpandValues

I think it doesn’t make sense to expand N rows from 1 row in SQL.

E. Need some explanations from JW Player (mainly how it works and use cases)

Hll(p)Aggregator

GetCardinalityEstimation

KeyType of KafkaStateAdapter (field, messagehash, streammetadata)

Page tree

Technical analysis of merging SQE and Storm SQL

Plan A : integrate SQE into Storm SQL

Overall

Pros.

Cons.

Plan B. integrate Storm SQL into SQE

Overall

Pros.

Cons.

Appendix A. Analysis of the effort for integrating SQE to Storm SQL (in detail)

A. Already supported via Calcite but didn’t address to Storm SQL yet

B. Easy to adopt

C. Need to modify Storm SQL to adopt

D. Need some thoughts / ideas to adopt

E. Need some explanations from JW Player (mainly how it works and use cases)