Apache Drill provides query capabilities against a variety of data systems.
By enabling Drill for DFDL-described data, one could immediately query data that has a DFDL schema describing its format.
Metadata Mapping
TBD: does Drill support...
- nullable complex types (a column containing a sub-table, that is itself nullable?)
- date/time/datetime types
- big int, big decimal
- nullable strings (distinguished from empty strings)
- namespaces (of some sort)
TBD: should we be trying to simplify the metadata to make querying easier, or be ruthlessly uniform so that queries will be ugly but at least consistent?
TBD: should we be trying to handle XSD here (all of it) or just DFDL?
TBD: as with namespace-distinctions, where we warn when an element is only distinguishable by the namespace, which isn't represented in, for example JSON, we could also warn about Anonymous choices or other things that make metadata mapping to Drill (or NiFi or ... ) harder.
type (of element unless noted) | nillable (yes/no, * = don't care) | dimension (scalar, optional, array, * = don't care) | drill metadata | |
---|---|---|---|---|
* | * | array | sub-table with added index column to hold position (note: name of index column should not collide) | |
date/time | * | * | TBD: are there date/time types corresponding? If so use them, if not use strings in ISO8601 format | |
string | Must map any DFDL infoset illegal string characters to Drill-allowed characters (analogous to what we do with XML-illegal characters for converting the DFDL infoset to XML). | |||
string | * | scalar | String (non nullable) TBD: is empty string distinguished from null string in Drill? (ANSI SQL databases distinguish empty strings from null strings - DFDL also distinguishes these. Some other databases do not) | |
simple type | no | scalar | corresponding Drill type | |
simple type | yes | scalar | nullable corresponding drill type (TBD: no distinction from string. Combine with string if there is no distinction) | |
simple type | no | optional | nullable corresponding drill type (TBD: no distinction from string. Combine with string if there is no distinction) | |
simple type | yes | optional | nullable corresponding drill type (note: the two concepts of optional and nullable are collapsed) (TBD: no distinction from string. Combine with string if there is no distinction) | |
simple type | no | array | sub table with index and non-nullable value column (TBD: no distinction from string. Combine with string if there is no distinction) | |
simple type | yes | array | sub table with index and nullable value column (TBD: no distinction from string. Combine with string if there is no distinction) | |
bounded size unsigned integers (excluding unsignedLong) | * | * | next larger size signed integer | |
unsignedLong | TBD: Do we have bignum? TBD: should we just restrict this to range of signed long type? TBD: just use string? | |||
integer (unbounded) | TBD: Do we have a corresponding type? (if not use string) | |||
decimal | TBD: Do we have a corresponding type? (If not use string) | |||
complex (element with sequence or choice) | no | scalar | sub-map (is there such a thing?) | |
complex (element with sequence or choice) | no | optional or array | arrayMap | |
complex (element with sequence or choice) | yes | scalar, optional, or array | sub-table with index column and a map. | |
complex sequence | no | scalar | TBD: merge children into parent context? TBD: extend child element names with enclosing element name? TBD: name collisions? TBD: more than one child with same name? (non-array case) | |
complex sequence | yes | scalar | sub table | |
complex sequence | * | optional or array | sub table | |
complex choice | ||||
About Other Metadata
Bestides Apache Drill, there are other systems with similar metadata organization.
- Apache NiFi Records
- Apache Avro - adds a restriction. Unions cannot directly contain other unions
- Note that in a pure logical data system there is no need for unions that contain other unions, as such can always be flattened into a single union.
- In DFDL nested "choices" are not uncommon if one choice uses discriminators and another uses direct dispatch.
- This is a common idiom when trying to get direct dispatch plus "a default" choice branch for when the value is not found as a choice branch key.
- DFDL may be amended with a default choice branch feature to help with this.
- Apache Pulsar: supports several kinds of complex type structures. Both NiFi and Avro are supported.