The distributed version is the key next milestone after the single node version, and aims to be a high-performance, elastic query process engine.
Its It consists of three components:
- Conductor consists of the parser, the optimizer, Foreman, and Catalog, and BlockLocator.
- BlockLocator caches the location of each block, and is used for exchanging Quickstep blocks between StorageManagers.
- Ensemble, which manages a cluster of Executors, and
- Executor consists of StorageManager and Worker., Shiftboss, and a group of Workers. Shiftboss manages Workers through the TMB in-memory implementation.
It also uses Transactional Message Bus (TMB) as the communication layer between the above components, and relies on HDFS (via libhdfs3) as the persistent storage.
- DistributedCli sends a query to Conductor.
- Conductor optimizes the query, and generates a query execution plan. The execution plan consists of a number of operators, which generates a bag of WorkOrders, each of which corresponds to a Quickstep block.
- Foreman in Conductor manages the query execution, and reports the query completion to the main thread of Conductor.
- Foreman sends a serialized WorkOrder to an Executor (actually Shiftboss) based on some policy (e.g., the data locality, or the minimal loads).
- The Executor notifies Foreman Shiftboss in Executor deserializes the WorkOrder, and dispatches it to a Worker.
- Worker in Executor executes the WorkOrder, including a possible data exchange from a StorageManager peer.
- Worker in Executor notifies Shiftboss regarding the completion of the WorkOrderthe WorkOrder, and Shiftboss in Executor forwards it to Foreman.
- Go to a), unless the query completes.
- The main thread of Conductor replies to DistributedCli regarding the query completion.
- DistributedCli reads the query result, and notifies Conductor regarding the finish of reading the query result.
- Conductor deletes the query result.