What is Reference Interpreter?
The Reference Interpreter is a quick dirty way for testing your Logical Plans directly to verify the expected outputs for any given data set. It is helpful when you have your implementation ready and want to test your logical plan directly on your sample data without having dependency on other parts of drill.
Making it Work
Currently we have a logical plan (simple_plan.json) in place which applies few transformations over the data provided in the donuts.json file. You can run the logical plan by the help of the Reference Interpreter by providing the path of the logical plan via the console. Feel free to explore the logical plan and the Reference Interpreter code to explore Drill.
Here is the code for achieving the same:
- To provide a simple way to run a Logical Plan against some sample data and get back the expected result
- Allow work to start on the parsers while we scale up the performance and capabilities of the execution engine and optimizer.
- Allow evaluation work on particular technical approaches such as exploring the impact of hierarchical and schema less data on query evaluation.
These goals do not include performance, memory handling, or efficiency. Currently, the interpreter is a single node/thread process. This will change shortly
so that it also run as a clustered process.
The entry point is inside the /sandbox/prototype/exec/ref module:
org.apache.drill.exec.ref.ReferenceInterpreter.main(); The example program utilizes two resources: simple-plan.json and donuts.json and outputs data to /opt/data/out.json.
Some of things that 'work'.
- Read/write basic json.
- ROPs (reference operators): Filter, Transform, Group, Aggregate (simple), Order, Union.
- Example aggregate and basic functions including sum, count, multiply, add, compare, equals.
Basic glossary/concepts (we'll get this on the wiki/javadocs):
- LOP: Logical Operator. An implementation agnostic data flow operator utilized by the Logical Plan.
- ROP: Reference Operator: A reference operator implementation that pairs with a LOP.
- FunctionDefinition: A definition of a particular function. Describes a set of aliases, an allowable set of input arguments and an interface that will attempt to determine output type.
- BasicEvaluator: An implementation of a particular non-aggregate expression. Receives a record pointer at creation time. Returns a DataValue.
- AggregateEvaluator: An implementation of a particular aggregating function. Is provided a record pointer at creation time. Expects regular calls to addRecord() followed by a call to eval() which provides the aggregate value.
- DataValue: A pointer to a particular data value. Implementation classes includes things like ScalarLong, ScalarBytes, SimpleMapValue and SimpleArrayValue.
The standard record iterator utilized between each ROP utilizes the org.apache.drill.exec.ref.RecordIterator interface. This is somewhat inspired by the AttributeSource concepts from within the Lucene project. (I'm planning to extend these concepts all the way to the individual DataValues.)
Add tests, finish adding ROPs, add local and remote exchange nodes (parallelization), add a bunch of documentation and extract out the Execution plan as a separate intermediate representation.
Where is help needed
It needs a lot more evaluators to be a true reference interpreter (as well as the rest of the ROPs). The existing ones can be utilized as prototypes. Anyone interested in ripping through a bunch of additional evaluators and associated FunctionDefinitions?
If so, let us know on the dev mailing list http://incubator.apache.org/drill/mailing-lists.html