LensDriver is an interface which allows developers to integrate query processing engines into Lens.
Responsibilities of a Lens Driver:
- Provides query estimate
- Execute queries - need to support both blocking as non blocking execute calls.
- HQL to * mapping - convert HQL to query language understood by the backend engine.
For example JDBC driver has a columnar rewriter which converts HQL to ANSI SQL and does some optimizations required by InfoBright
- Provide in memory and persistent result sets
Adding a new Driver for a new query processing engine :
For adding a new driver, we need to consider all the following things.
- estimate implementation - estimate should at least do semantic validation. Actual cost estimate can be symbolic. For example currently we assume JDBC cost to be zero and Hive cost to be 1.
- does backend support HQL? If no, you will have to translate HQL to appropriate query language
- does backend support async execution of queries? If no, driver should handle that
- is there a get status API
- result set implementation -
Is it possible to persist result set to HDFS?
Is it possible to stream the result set instead of loading entirely in Lens memory?
- What happens when the backend engine goes down while Lens is running? Is there a way to recover queries?
- Does engine support cancel query? If not, what would happen to abandoned queries?
- Mapping of driver query statuses to Lens query statuses
- If Lens restarts while the backend engine is still running, is it possible to recover queries if you have a reference to the remote query?
- Do we need to maintain connections per user?
- Do we need to pool connections?
- Do we need a separate pool for estimate queries?
- Connection thread safety?
Drivers usage in Lens:
QueryExecutionService in lens loads drivers at initialization time. Driver lifecycle is managed by query service.
Driver state is persisted by query service. Not all drivers need persistent state, so it depends on the query engine. If driver needs to persist states across restarts, then it should implement writeExternal and readExternal properly.
Query execution workflow:
1. Submit query to query queue
2. Query submitter thread 'takes' from the queue
- Rewrites query for each driver
- Ask for cost estimate to each driver
- Select driver with minimum cost (driver selector)
- ExecuteAsync in the selected driver
Background thread polls query status from driver. Driver sets DriverQueryStatus