This page first gives some important themes and topics to advance Apache Daffodil, so it meets the needs of more users, and really helps to kill the data format problem.

Important Ideas for the Future of Daffodil

There are a number of areas that are rather important to one constituency or another of Daffodil users. This section reviews some of these.

This is not a definitive list - the community's input here is most welcome, either as page edits or comments.

These ideas are longer time frame than any next-release cycle.

Tight Integration with Data Processing Frameworks and Tools

A tight integration requires a metadata bridge that projects a DFDL schema into however the processing framework describes data.
As an example, a DFDL schema for use with Apache Spark should create Spark Struct objects and the metadata that describes them bi-directionally to/from a DFDL schema.

(Note: Apache Drill integration is in the works as of 2024-05-09)

Metadata Repository/Catalog Integration

DFDL schemas should be able to be found and accessed and associated with data sets in the same way other schemas (e.g., Apache Avro schemas) are found, and associated with data. Some of the data processing frameworks like Apache NiFi have integration with these repositories also and are a natural place for this integration with repositories to be introduced.

Finally, tools like Apache Drill - given DFDL/Daffodil integration, this can allow direct query of DFDL-described data using the Apache Drill query language. This goes a long way toward realizing a vision where "every bit of data has a URL".

The Daffodil CLI would go from being an educational and demonstration tool, to a really useful data processing tool if it embedded Apache Drill as a library, allowing one to somehow associate DFDL schemas with various data files, and then just start writing queries that work across the DFDL-described data and any other data.

Bugs, Missing Features, the JIRA Backlog, and Training Examples/Materials

The number of tickets seems to hover around 400. Just driving this count down is important. We often park issues for the future as JIRA tickets and that is one reason the count stays level. Each fixed issue often results in an idea for further work, so rather than closing a ticket, dropping the count, it stays level.

We do need a comprehensive test suite for sequences and separators, as that is one of the most subtle areas of the DFDL specification. Such a test suite should be built to allow cross-testing against IBM DFDL or other DFDL implementations that come along, including our own different runtime backends (we now have a C backend in addition to the original Scala backend)

At the same time, this test suite should be created with a notion that it, or parts of it, can be converted into tutorial materials that explain the concepts, illustrate their usage, and motivates the behaviors.

Usability - Interactive Debugger, IDE

The daffodil-vscode interactive debugger needs to support the full edit-debug cycle for DFDL schema development, so as to provide an environment for learning DFDL and for creating and maintaining DFDL schemas. See the trackers for the daffodil-vscode repository on github.

Performance of Scala Runtime1

This needs attention, as there are clearly cases where we've observed optimizer flaws where optimizations should be possible, but they aren't be carried out, resulting in far slower execution than would be expected. In particular for the unparser, which is currently much slower than the parser, these optimizations are highly suspect.

Refactoring for Separable Runtimes/Back-ends (DAFFODIL-2536)

Daffodil's layering structure is not right. People want to treat the daffodil schema compiler as a service that starts with a DFDL schema and creates the optimized intermediate objects that a runtime backend then converts into runtime objects (Runtime1), generated code (codegen-c), or other formal artifacts. One should be able to do this from an outside application of Daffodil, not be adding things into Daffodil. Today you can't do this.

Fixing this requires flipping the layering structure of Daffodil, so that daffodil-core, which contains the schema compiler, is split out from the API for generating an executable artifact. The schema compiler should be a library called from a higher layer.

The Daffodil API may need to evolve substantially to accommodate this. The notion that the Daffodil Compiler creates a ProcessorFactory which creates a DataProcessor is fundamentally flawed with respect to the way people need to use parts of Daffodil, and to decoupling runtimes/back-ends from the rest of Daffodil.

As an example, there are many ETL and EAI tools for transforming data. They each have their own ad-hoc format description language. A natural thing to do is to use Daffodil's compiler to convert from DFDL into the format description of one of these other ETL tools. This should be doable by reusing Daffodil's schema compiler to do all the checking associated with DFDL schemas, and to lower the representation to the optimized "Gram" objects that effectively represent the compiler output. A runtime, or converter tool, consumes this "Gram" object representation and outputs whatever artifact is needed.

Runtime Performance

There are numerous missing optimizations. For example DAFFODIL-2831 - Getting issue details... STATUS , if dfdl:initiatedContent='yes', and the terms with initiators all have initiators that are the same length and there is no framing before those initiators, then this should be optimized to avoid backtracking at runtime. 

XML Feature Enhancements (and JSON)

Developers have a love/hate relationship with all things XML.

There are ideas for those who want ongoing improvement to the XML-oriented features of DFDL/Daffodil. Later there is a separate section from those who want to escape from all things XML.

XML Attribute Support (See: Proposal: Extend DFDL with XML Attribute Support)

An extension to DFDL to enable creation of XML attributes instead of everything being an XML element, is highly desirable. Some people have created transformers that convert Daffodil's output XML (all elements) to XML containing a mixture of attributes and child elements. This needs to be something enabled within the DFDL schema.

   Improve XML by allowing Complex Types as the dfdlx:repType for Simple Types

This solves another XML annoyance called the "value element problem" (see: Proposal: dfdlx:repType to allow Complex Representation of Simple Types)

Escape from XML - A Non-XSD Syntax for DFDL

Many people want DFDL but are unwilling to deal with all the baggage of XML and XML Schema. XML is not ideal as a data representation for many reasons, and XML Schema has many complexities that are necessary only when it is being used with non-data XML 'documents'. E.g., the notion of a data document with a single root element is an artifact of XML/XSD. There is much to learn about XML and XML Schema that one must know simply so as to avoid its pitfalls.

DFDL needs a neutral syntax that is more natural than XSD.

One notion is XSCS (XML Schema Compact Syntax). This lacks a syntax for standardized annotations but could contribute ideas.

Another idea is to start from the data language of popular data-processing frameworks. E.g., the 'struct' language of Apache Spark. This could be extended to support DFDL, and this would provide a natural audience for the new notation.

A third idea comes from the PLC4X project. They actually considered and rejected DFDL because it was too XML-oriented for their developer community, and so they invented a notation for format specifications. Here's the example for BACNET IP

It is not necessary that this new notation provide everything that XML Schema provides, nor that it be compatible or inter-convertible to/from DFDL schemas using XSD. As an example, a small and partial list of things that could be unsupported includes element references, elementFormDefault 'qualified', and no-namespace schemas.

Given DFDL without XML Schema, it also begs for a non-XML data language for those just using Daffodil to convert to a more accessible form. EXI-JSON may be a sufficient data language. There are other contenders here such as SISL (Simple Information Serialization Language).

The number of ad-hoc schema languages we could use as a starting syntax is very large, and includes Apache Spark Struct, Apache Avro schemas, Apache NiFi Records, Google Protocol Buffer syntax, OMG CORBA IDL. ASN.1, C language struct syntax, etc. 

DFDL Language Wish List

There is also a DFDL Wish List page.

Machine-Readable DFDL Specification

The official DFDL specification today is a PDF document created using Microsoft Word.

If there is one lesson learned from using DFDL to create large data formats, it is that all large specifications should be machine readable so that they can be processed by applications that generate spec-documents of various kinds, as well as test suites, and parts of DFDL implementations, or DFDL Development Environment implementations. DFDL is no different in this way.

We need a conversion of the DFDL specification into a declarative XML document which can be processed to re-create any form of formal documentation (via docbook, HTML, etc.) or can be used to generate a hyperlinked index to all the DFDL properties (requested in the spec, but not possible to do automatically in MS-Word).

Note that as DFDL is now an ISO/IEC Standard (ISO 23415:2024), the next revision will have to conform to ISO formatting requirements. Other ISO standards groups use XML-based document-creation processes. We need to enable this practice. 

Release Plan (Suggested)

The table below should be updated as new releases come out, or the themes/emphasis of a release change.

Of course, this is all highly subject to change based on what the user community needs, and what community developers choose to work on.

The release numbering is also subject to change.

Release

Description

4.0.0

Major release

Scala 2.13 (or 3.x) - drop Scala 2.12

Drop support for Java 8 JVMs. (Support 11, 17, 21)

May change default settings to those that get better performance or improve schemas, but this may require schemas to evolve or add compatibility flags.

3.10

Bug fixing and performance

Bug Fixing prioritized by Issues Schema Authors are experiencing

  • Diagnostic messaging

These ideas have been put forward as themes for future releases:

Release

Description

5.x


Complete DFDL Implementation including all optional features

  • missing required features
  • optional features (with a few minor exceptions)
  • DFDL v2.0 extension features
Optimizations/Performance

C Backend (aka "codegen-c")

  • Extend to cover strings, arrays, etc. A useful subset of DFDL capabilities.



Release

Description

6.x

Extending XML Feature Set

Add the small set of features already identified by XML-centric users to make DFDL more friendly to those starting from XSD data, or who are trying to create more XML-user-friendly XML output from DFDL parsing. 

* allow complex type nilValue other than just %ES 
* lengthKind 'valuePattern'
* new dfdl:lengthKind 'dfdlx:patternMatch' (this one is pretty small work)
* Extend DFDL with XML Attribute Support

Non-XML Syntax for DFDL schemas


  • No labels