Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Title

...

Describes the feature as-is-built 2018-05-14.

This memo describes a proposed feature for expressing data layering of pre/post processing operations.

...

Where we must distinguish beween layers, we will use the terms "underlying layer" and "overlying layer" to distinguish the levels.

Most of the discussion here will use parsing as context, but where the unparsing is not clearly symmetric, unparsing will also be described.

New DFDL schema annotations are shown in the "daf:" namespace so as to be clear what are DFDL standard, and what the new proposed extensions are. Our hope would be that these extensions will be suitable for inclusion in a revision of the DFDL standard. (E.g., DFDL v2.0).

The Layering Properties

The following properties are added to dfdl:sequence (with corresponding short forms)

...

If the length of a layered sequence is needed, for example to store the length of the transformed representation using dfdl:outputValueCalc, then the layered sequence must be enclosed in an element, and the dfdl:contentLength(...) of that element provides the length of the transformed content.

Data Layers as Streams

A data layer is conceptually a stream of bytes. It can be an input layer for parsing, an output layer for unparsing.
Use of the term "stream" here is consistent with java's use of stream as in java.io.InputStream and java.io.OutputStream. These are sources and sinks of bytes. If one wants to decode characters from them you must do so by specifying the encoding explicitly.

A layer transform is a transformation that creates one layer of bytes from another. An underlying layer is encapsulated by a transformation to create an overlying layer.

When parsing, reading from the overlying layer causes reading of data from the underlying layer, which data is then transformed and becomes the bytes of the overlying layer returned from the read.

The layer properties apply to the underlying layer data and indicate how to identify its bounds/length, and if a layer transform is textual, what encoding is used to interpret the underlying bytes.

Some transformations are naturally binary bytes to bytes. Data decompress/compress are the typical example here. When parsing, the overlying layer's bytes are the result of decompression of the underlying layer's bytes.

If a transform requires text, then a dfdl:format encoding must be defined. For example, base64 is a transform that creates bytes from text. Hence, a layer encoding is needed to convert the underlying layer of bytes into text, then the base64 decoding occurs on that text, which produces the bytes of the overlying layer.

We think of some transforms as text-to-text. Line folding/unfolding is one such. Lines of text that are too long are wrapped by inserting a line-ending and either a space or tab. As a DFDL layer transform this line folding transform requires an encoding. The underlying bytes are decoded into characters according to the encoding. Those characters are divided into lines, and the line unfolding (for parsing) is done to create longer lines of data, the resulting data is then encoded from characters back into bytes using the same encoding.

(There may be opportunities to optimize/shortcut these transformations if the overlying layer is the data layer for an element with scannable text representation using the same character set encoding. The recoversion back to bytes, only to have to then decode bytes to characters of the same encoding again is overhead that can be avoided.)

DFDL can describe a mixture of character set decoding/encoding and binary value parsing/unparsing against the same underlying data representation; hence, the underlying data layer concept is always one of bytes.

(Note: bytes suffices even for mil-std-2045 which can hold a compressed VMF payload. This payload element is always byte aligned even in mil-std-2045, a very bit-oriented format. As of this writing we have no examples of layer transforms that require bit granularity; hence, this is a byte-oriented proposal.)

Daffodil parsing begins with a default standard data input stream. Unparsing begins with a default standard output stream. These are the ultimate underlying layer.

...

When unparsing, extra data may have to be created (padding/filling) to satisfy the layer unparsing algorithm. The DFDL schema for the xs:sequence content must create this padded/filled extra data. It is an Unparse Error if the data created when unparsing that is provided to the layer transform encoding algorithm does not satisfy its length requirements.

Examples using Data Layering

When a DFDL schema wants to describe say, gzip encoding, then the DFDL annotations might look like this:

...

In the above the base64 has been decoded into a long string of "Lorem ipsum" nonsense, and the line-folded comment has been unfolded. This data can be unparsed with the same DFDL schema to get back the data representation shown previously. That is to say this data "round trips" through parsing and unparsing.

Example of Multi-layer Transformation

Here's some CSV data

Code Block
languagetext
last,first,middle,DOB
smith,robert,brandon,1988-03-24
johnson,john,henry,1986-01-23
jones,arya,cat,1986-02-19

...

This schema will round-trip parse then unparse, then parse again, the data.

Summary

  • allows stacking transforms one on top of another. So you can have base64 encoded compressed data as the payload representation of
    a child element within a larger element.

  • allows specifying properties of the underlying data layers separately from the properties of the logical data.

  • scopes the transforms over a xs:sequence body only.

  • Avoids new annotation elements with particulars about scoping.
  • Simple: doesn't add new functions for layering use when existing dfdl:contentLength will already handle it.
  • Complex cases - e.g., initiator before layered data, are handled by encapsulating the layered sequence in another sequence or element that carries the initiator.
  • Layer annotations are only about the determining of the length of the layered region, and the algorithm for transforming the data.
  • Layer transforms have mandatory layer alignment (1 byte for now)

Open design issues

  • Parameterization of transform algorithms - many algorithms will have variations that can be controlled by parameters; however, whether there needs to be a parameterization method, or there can just be a large number of individual transforms each having specific configurations of parameters... it is unclear what is truly required and experience with these concepts will be needed before there will be enough information to proposed ideas here.
  • Debug and trace impact, and how to provide visibility to what is going on when an error occurs in the middle of parsing/unparsing when transforms are in use. E.g., the bit/byte position where a run time parse error occurs would be in some transformed stream, not the underlying stream. I suspect some experience with these transform concepts will be needed before there will be enough information to propose ideas here.


Below is For the Future, once Quoted Printable has been implemented.

VCalendar Example Using Quoted-Printable

Consider this VCALENDAR Data:

...