This page first gives some important themes and topics to advance Apache Daffodil, so it meets the needs of more users, and really helps to kill the data format problem.
Important Ideas for the Future of Daffodil
There are a number of areas that are rather important to one constituency or another of Daffodil users. This section reviews some of these.
This is not a definitive list - the community's input here is most welcome, either as page edits or comments.
These ideas are longer time frame than any next-release cycle.
Tight Integration with Data Processing Frameworks and Tools
A tight integration requires a metadata bridge that projects a DFDL schema into however the processing framework describes data.
As an example, a DFDL schema for use with Apache Spark should create Spark Struct objects and the metadata that describes them bi-directionally to/from a DFDL schema.
Metadata Repository/Catalog Integration
DFDL schemas should be able to be found and accessed and associated with data sets in the same way other schemas (e.g., Apache Avro schemas) are found, and associated with data. Some of the data processing frameworks like Apache NiFi have integration with these repositories also and are a natural place for this integration with repositories to be introduced.
Finally, tools like Apache Drill - given DFDL/Daffodil integration, this can allow direct query of DFDL-described data using the Apache Drill query language. This goes a long way toward realizing a vision where "every bit of data has a URL".
The Daffodil CLI would go from being an educational and demonstration tool, to a really useful data processing tool if it embedded Apache Drill as a library, allowing one to somehow associate DFDL schemas with various data files, and then just start writing queries that work across the DFDL-described data and any other data.
Bugs, Missing Features, the JIRA Backlog, and Training Examples/Materials
The number of tickets seems to hover around 400. Just driving this count down is important. We often park issues for the future as JIRA tickets and that is one reason the count stays level. Each fixed issue often results in an idea for further work, so rather than closing a ticket, dropping the count, it stays level.
We do need a comprehensive test suite for sequences and separators, as that is one of the most subtle areas of the DFDL specification. Such a test suite should be built to allow cross-testing against IBM DFDL or other DFDL implementations that come along, including our own different runtime backends (we now have a C backend in addition to the original Scala backend)
At the same time, this test suite should be created with a notion that it, or parts of it, can be converted into tutorial materials that explain the concepts, illustrate their usage, and motivates the behaviors.
COBOL and Financial Services Data
There is a very tiny set of features missing from Daffodil before we can support COBOL data fully. This is worth attention as it would enable many applications that use these legacy data formats. DAFFODIL-853 (textNumberPattern 'V' and 'P' symbols) is the only ticket specific to COBOL data left.
Usability - Interactive Debugger, IDE
The daffodil-vscode interactive debugger needs to support the full edit-debug cycle for DFDL schema development, so as to provide an environment for learning DFDL and for creating and maintaining DFDL schemas. See the trackers for the daffodil-vscode repository on github.
Performance of Scala Runtime1
This needs attention, as there are clearly cases where we've observed optimizer flaws where optimizations should be possible, but they aren't be carried out, resulting in far slower execution than would be expected. In particular for the unparser, which is currently much slower than the parser, these optimizations are highly suspect.One of the largest overheads affecting applications using XML and JSON is the textualization overhead. There are examples of 11-bit long messages in dense binary formats which turn into over 4000 bytes of XML text. For users of XML and JSON, use of EXI (binary representation - which is not just for XML now, but also handles JSON) will massively reduce this overhead.
Refactoring for Separable Runtimes/Back-ends (DAFFODIL-2536)
Daffodil's layering structure is not right. People want to treat the daffodil schema compiler as a service that starts with a DFDL schema and creates the optimized intermediate objects that a runtime backend then converts into runtime objects (Runtime1), generated code (codegen-c), or other formal artifacts. One should be able to do this from an outside application of Daffodil, not be adding things into Daffodil. Today you can't do this.
Fixing this requires flipping the layering structure of Daffodil, so that daffodil-core, which contains the schema compiler, is split out from the API for generating an executable artifact. The schema compiler should be a library called from a higher layer.
The Daffodil API may need to evolve substantially to accommodate this. The notion that the Daffodil Compiler creates a ProcessorFactory which creates a DataProcessor is fundamentally flawed with respect to the way people need to use parts of Daffodil, and to decoupling runtimes/back-ends from the rest of Daffodil.
As an example, there are many ETL and EAI tools for transforming data. They each have their own ad-hoc format description language. A natural thing to do is to use Daffodil's compiler to convert from DFDL into the format description of one of these other ETL tools. This should be doable by reusing Daffodil's schema compiler to do all the checking associated with DFDL schemas, and to lower the representation to the optimized "Gram" objects that effectively represent the compiler output. A runtime, or converter tool, consumes this "Gram" object representation and outputs whatever artifact is needed.
There are numerous missing optimizations. For example
, if dfdl:initiatedContent='yes', and the terms with initiators all have initiators that are the same length and there is no framing before those initiators, then this should be optimized to avoid backtracking at runtime.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key DAFFODIL-2831
XML Feature Enhancements (and JSON)
Developers have a love/hate relationship with all things XML.
There are ideas for those who want ongoing improvement to the XML-oriented features of DFDL/Daffodil. Later there is a separate section from those who want to escape from all things XML.
XML Attribute Support (See: Proposal: Extend DFDL with XML Attribute Support)
An extension to DFDL to enable creation of XML attributes instead of everything being an XML element, is highly desirable. Some people have created transformers that convert Daffodil's output XML (all elements) to XML containing a mixture of attributes and child elements. This needs to be something enabled within the DFDL schema.
Improve XML by allowing Complex Types as the dfdlx:repType for Simple Types
This solves another XML annoyance called the "value element problem" (see: Proposal: dfdlx:repType to allow Complex Representation of Simple Types)
EXI is a binary standard representation of the XML infoset. It can be used to eliminate all the textualization overhead of XML text, and with EXI's schema-aware and compression features, it has been shown that storing data in EXI can be just as dense as original native dense-binary representations.
EXI engines also support binary JSON, so these same benefits can be had by those preferring JSON for their applications.
Escape from XML - A Non-XSD Syntax for DFDL
Many people want DFDL but are unwilling to deal with all the baggage of XML and XML Schema. XML is not ideal as a data representation for many reasons, and XML Schema has many complexities that are necessary only when it is being used with non-data XML 'documents'. E.g., the notion of a data document with a single root element is an artifact of XML/XSD. There is much to learn about XML and XML Schema that one must know simply so as to avoid its pitfalls.
DFDL needs a neutral syntax that is more natural than XSD.
One notion is XSCS (XML Schema Compact Syntax). This lacks a syntax for standardized annotations but could contribute ideas.
Another idea is to start from the data language of popular data-processing frameworks. E.g., the 'struct' language of Apache Spark. This could be extended to support DFDL, and this would provide a natural audience for the new notation.
A third idea comes from the PLC4X project. They actually considered and rejected DFDL because it was too XML-oriented for their developer community, and so they invented a notation for format specifications. Here's the example for BACNET IP.
It is not necessary that this new notation provide everything that XML Schema provides, nor that it be compatible or inter-convertible to/from DFDL schemas using XSD. As an example, a small and partial list of things that could be unsupported includes element references, elementFormDefault 'qualified', and no-namespace schemas.
Given DFDL without XML Schema, it also begs for a non-XML data language for those just using Daffodil to convert to a more accessible form. EXI-JSON may be a sufficient data language. There are other contenders here such as SISL (Simple Information Serialization Language).
The number of ad-hoc schema languages we could use as a starting syntax is very large, and includes Apache Spark Struct, Apache Avro schemas, Apache NiFi Records, Google Protocol Buffer syntax, OMG CORBA IDL. ASN.1, C language struct syntax, etc.
DFDL Language Wish List
There is also a DFDL Wish List page.
Machine-Readable DFDL Specification
The official DFDL specification today is a PDF document created using Microsoft Word.
If there is one lesson learned from using DFDL to create large data formats, it is that all large specifications should be machine readable so that they can be processed by applications that generate spec-documents of various kinds, as well as test suites, and parts of DFDL implementations, or DFDL Development Environment implementations. DFDL is no different in this way.
We need a conversion of the DFDL specification into a declarative XML document which can be processed to re-create any form of formal documentation (via docbook, HTML, etc.) or can be used to generate a hyperlinked index to all the DFDL properties (requested in the spec, but not possible to do automatically in MS-Word).
Release Plan (Suggested)
The table below should be updated as new releases come out, or the themes/emphasis of a release change.
Of course, this is all highly subject to change based on what the user community needs, and what community developers choose to work on.
The release numbering is also subject to change.
(could be 4.0.0)
Improved Usability, Debug, Trace
C Backend (aka "codegen-c")
Complete DFDL Implementation including all optional features
Bug Fixes prioritized by Issues Schema Authors are experiencing
Prefixed - fix remaining issues blocking use of lengthKind 'prefixed'
These ideas have been put forward as themes for future releases:
Theme: Finishing feature sets that have been incomplete for a long time
|COBOL - fix remaining issues blocking use of Daffodil with Cobol dataOptimizations/Performance
Prefixed - fix remaining issues blocking use of dfdl:lengthKind 'prefixed'
Extending XML Feature Set
Add the small set of features already identified by XML-centric users to make DFDL more friendly to those starting from XSD data, or who are trying to create more XML-user-friendly XML output from DFDL parsing.
Goal here is a release with no functional changes, perhaps even no bug fixes, just changing out infrastructure such as updating to more recent Scala, and refactoring required for use with Java modules and OGSI modules.