Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: updated to use dfdlx, and match spec as of 2021-10-06

Revised per changes on 2021-10-06

Describes the feature as-is-built 2018-05-14.

...

The following properties are added to dfdl:sequence (with corresponding short forms). They require the "dfdlx" (DFDL extension) namespace prefix. 

  • layerTransform (literal string or DFDL expression) - XSD NCNames - all reserved. In the future this may become extensible allowing QNames to be used.NCName
  • layerEncoding (literal string or DFDL expression)
  • layerLengthKind - Can be 'implicit', 'explicit', or 'boundaryMark'. Perhaps other values in the future (e.g., 'pattern') 
  • layerBoundaryMark (literal string or DFDL expression) - used with dfdl:layerLengthKind 'boundaryMark'
  • layerLength (literal string or DFDL expression) - used with dfdl:layerLengthKind 'explicit'
  • (TBD layer properties for when dfdl:layerLengthKind is 'prefixed' - such time as that is supported, if ever.)

The initial transform names and their supported layerLengthKinds are:

  • base64_MIME - layerLengthKind 'boundaryMark' only
  • gzip - layerLengthKind 'explicit' only
  • lineFolded_IMF - layerLengthKind 'boundaryMark' (without a layerBoundaryMark property - not used. Always CRLF), or layerLengthKind 'implicit' which extends to end of available data.
  • lineFolded_iCalendar - same as lineFolded_IMF

An example layer transform is provided also as a test case:

  • aisASCIIArmor - layerLengthKind is assumed to be 'boundaryMark' (the property layerLengthKind is ignored), and the boundary mark is assumed to be "," (Comma). This format is the ASCII-armoring used by the AIS (Automated Identification System) format used for ship identification.

...

Code Block
languagexml
<daf:defineFormat name="base64Format" >
    <dfdl:format dfdlx:layerTransform="base64_MIME" dfdlx:layerLengthKind="boundaryMark" />
</dfdl:defineFormat>

...

A data layer is conceptually a stream of bytes. It can be an input layer for parsing, an output layer for unparsing.
Use of the term "stream" here is consistent with java's use of stream as in java.io.InputStream and java.io.OutputStream. These are sources and sinks of bytes. If one wants to decode characters from them you must do so by specifying the encoding explicitly.

A layer transform is a transformation that creates one layer of bytes from another. An underlying layer is encapsulated by a transformation to create an overlying layer.

When parsing, reading from the overlying layer causes reading of data from the underlying layer, which data is then transformed and becomes the bytes of the overlying layer returned from the read.

The layer properties apply to the underlying layer data and indicate how to identify its bounds/length, and if a layer transform is textual, what encoding is used to interpret the underlying bytes.

Some transformations are naturally binary bytes to bytes. Data decompress/compress are the typical example here. When parsing, the overlying layer's bytes are the result of decompression of the underlying layer's bytes.

If a transform requires text, then a dfdl:format encoding must be defined. For example, base64 is a transform that creates bytes from text. Hence, a layer encoding is needed to convert the underlying layer of bytes into text, then the base64 decoding occurs on that text, which produces the bytes of the overlying layer.

We think of some transforms as text-to-text. Line folding/unfolding is one such. Lines of text that are too long are wrapped by inserting a line-ending and either a space or tab. As a DFDL layer transform this line folding transform requires an encoding. The underlying bytes are decoded into characters according to the encoding. Those characters are divided into lines, and the line unfolding (for parsing) is done to create longer lines of data, the resulting data is then encoded from characters back into bytes using the same encoding.

(There may be opportunities to optimize/shortcut these transformations if the overlying layer is the data layer for an element with scannable text representation using the same character set encoding. The recoversion re-conversion back to bytes, only to have to then decode bytes to characters of the same encoding again is overhead that can be avoided.)

DFDL can describe a mixture of character set decoding/encoding and binary value parsing/unparsing against the same underlying data representation; hence, the underlying data layer concept is always one of bytes.

(Note: bytes suffices even for mil-std-2045 which can hold a compressed VMF payload. This payload element is always byte aligned even in mil-std-2045, a very bit-oriented format. As of this writing we have no examples of layer transforms that require bit granularity; hence, this is a byte-oriented proposal.)

Daffodil parsing begins with a default standard data input stream. Unparsing begins with a default standard output stream. These are the ultimate underlying layer.

...

When unparsing, extra data may have to be created (padding/filling) to satisfy the layer unparsing algorithm. The DFDL schema for the xs:sequence content must create this padded/filled extra data. It is an Unparse Error if the data created when unparsing that is provided to the layer transform encoding algorithm does not satisfy its length requirements.the layer transform encoding algorithm does not satisfy its length requirements.

Parameterization and Computed Results (Checksums, CRC, Parity):

Layer transform algorithms can read and write DFDL Variables. Combining use of a layer with dfdl:newVariableInstance allows one to specify parameters to a particular layering transform, as well as to receive values back from the layer transform. This allows computation of things like checksums, CRCs, or parity across the contents of a layer. 

Examples using Data Layering

...

Code Block
languagexml
<annotation><appinfo source="http://www.ogf.org/dfdl/">
  <dfdl:defineFormat name="compressed">
    <dfdl:format dfdlx:layerTransform="gzip" dfdlx:layerLengthKind="explicit" />
  </dfdl:defineFormat>
</appinfo></annnotation>

<sequence dfdl:ref="tns:compressed">
  <group ref="tns:compressedGroupContents" dfdldfdlx:layerLength="{...}" />
</sequence>

...

Code Block
languagexml
<annotation><appinfo source="http://www.ogf.org/dfdl/">
    <dfdl:defineFormat name="compressed">
      <dfdl:format ref="ex:general" dfdlx:layerTransform="gzip" dfdx:layerLengthKind="explicit" dfdlx:layerLengthUnits="bytes" />
    </dfdl:defineFormat>
    <dfdl:format ref="ex:general" />
</appinfo></annnotation>

         ...
         <xs:sequence>
          <xs:element name="compressedPayloadLength" type="xs:int" dfdl:representation="binary"
            dfdl:outputValueCalc='{ dfdl:contentLength(../compressedPayload, "bytes") }' />

          <xs:element name="compressedPayload">
            <xs:complexType>
              <xs:sequence dfdl:ref="tns:compressed" dfdldfdlx:layerLength="{ ../compressedPayloadLength }">
                <xs:group ref="tns:compressedGroupContents" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>

          <xs:sequence>
            <xs:annotation>
              <xs:appinfo source="http://www.ogf.org/dfdl/">
                <dfdl:assert>{ compressedPayloadLength eq dfdl:contentLength(compressedPayload, "bytes") }</dfdl:assert>
              </xs:appinfo>
            </xs:annotation>
          </xs:sequence>
          <xs:element name="after" type="xs:string" dfdl:lengthKind="delimited" />
        </xs:sequence>
        ...

...

Code Block
languagexml
<annotation><appinfo source="http://www.ogf.org/dfdl/">
 <dfdl:defineFormat name="base64">
      <dfdl:format ref="ex:general" dfdlx:layerTransform="base64_MIME" dfdlx:layerLengthKind="boundaryMark" dfdlx:layerLengthUnits="bytes"
        layerEncoding="iso-8859-1" />
 </dfdl:defineFormat>
 <dfdl:defineFormat name="folded">
      <dfdl:format ref="ex:general" dfdlx:layerTransform="lineFolded_IMF" dfdlx:layerLengthKind="implicit" dfdlx:layerLengthUnits="bytes"
        layerEncoding="iso-8859-1" />
 </dfdl:defineFormat>
</appinfo></annnotation>

    <xs:element name="root" dfdl:lengthKind="implicit">
      <xs:complexType>
        <xs:sequence dfdl:ref="folded"> <!-- From here, everything is line-folded -->
          <xs:sequence>
            <xs:element name="marker" type="xs:string"
              dfdl:initiator="boundary=" dfdl:terminator="%CR;%LF;" />
            <xs:element name="contents" dfdl:lengthKind="implicit" 
              dfdl:initiator="{ fn:concat('--', ../marker, '%CR;%LF;') }">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="comment" type="xs:string" 
                    dfdl:initiator="Comment:%SP;" dfdl:terminator="%CR;%LF;" />
                  <xs:element name="contentTransferEncoding"  type="xs:string"
                    dfdl:initiator="Content-Transfer-Encoding:%SP;"
                    dfdl:terminator="%CR;%LF;" />
                  <xs:element name="body" dfdl:lengthKind="implicit" dfdl:initiator="%CR;%LF;">
                    <xs:complexType>
                      <xs:choice dfdl:choiceDispatchKey="{ ../contentTransferEncoding }">
                        <xs:sequence dfdl:choiceBranchKey="base64">
                          <xs:sequence dfdl:ref="tns:base64"
                            dfdl:layerBoundaryMark="{ 
                              fn:concat(dfdl:decodeDFDLEntities('%CR;%LF;'),'--', ../../marker, '--')
                             }"> <!-- base64_MIME encoding for this sequence -->
                            <xs:element name="value" type="xs:string" />
                          </xs:sequence> <!-- END base64_MIME encoding --> 
                        </xs:sequence>
                        <!--
                           This is where other choice branches than base64 would go. 
                         -->
                      </xs:choice>
                    </xs:complexType>
                  </xs:element> <!-- END element body --> 
                </xs:sequence>
              </xs:complexType>
            </xs:element> <!-- END element contents -->
          </xs:sequence>
        </xs:sequence> <!-- END line folding -->
      </xs:complexType>
    </xs:element>

...

Code Block
languagexml
    <xs:complexType name="fileType">
      <!--
           first we have the base64 details
       -->
      <xs:sequence dfdl:ref="ex:base64" dfdldfdlx:layerBoundaryMark="--END--">
        <xs:sequence>
          <!--
              now the gzip details, including the 4-byte gzLength element that stores how long
              the gzipped data is.
           -->
          <xs:element name="gzLength" type="xs:int" dfdl:representation="binary" dfdl:lengthKind="implicit"
            dfdl:outputValueCalc="{ dfdl:contentLength( ../data, 'bytes') }" />
          <!--
             this 'data' element is needed only because we have to measure how big it is when unparsing.
             If we were only worried about parsing, we woundn't need to have this extra 'data' element wrapped around
             the contents.
           -->
          <xs:element name="data" dfdl:lengthKind="implicit">
            <xs:complexType>
              <!--
                 now the gzipped layered sequence itself
               -->
              <xs:sequence dfdl:ref="ex:gzip" dfdldfdlx:layerLength="{ ../gzLength }">
                <!--
                  finally, inside that, we have the original fileTypeGroup group reference.
                  -->
                <xs:group ref="ex:fileTypeGroup" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
        </xs:sequence>
      </xs:sequence>
    </xs:complexType>

...

Code Block
languagexml
    <dfdl:defineFormat name="general">
      <dfdl:format ref="ex:GeneralFormat" lengthKind="delimited" outputNewLine="%CR;%LF;" dfdlx:layerEncoding="iso-8859-1"
        dfdlx:layerLengthUnits='bytes' />
    </dfdl:defineFormat>

    <dfdl:defineFormat name="base64">
      <dfdl:format ref="ex:general" dfdlx:layerTransform="base64_MIME" dfdlx:layerLengthKind="boundaryMark" />
    </dfdl:defineFormat>

    <dfdl:defineFormat name="gzip">
      <dfdl:format ref="ex:general" dfdlx:layerTransform="gzip" dfdlx:layerLengthKind="explicit" />
    </dfdl:defineFormat>

    <dfdl:format ref="ex:general" />

...

  • allows stacking transforms one on top of another. So you can have base64 encoded compressed data as the payload representation of
    a child element within a larger element.

  • allows specifying properties of the underlying data layers separately from the properties of the logical data.

  • scopes the transforms over a xs:sequence body only.

  • Avoids new annotation elements with particulars about scoping.
  • Simple: doesn't add new functions for layering use when existing dfdl:contentLength will already handle it.
  • Complex cases - e.g., initiator before layered data, are handled by encapsulating the layered sequence in another sequence or element that carries the initiator.
  • Layer annotations are only about the determining of the length of the layered region, and the algorithm for transforming the data.
  • Layer transforms have mandatory layer alignment (1 byte for now)

Open design issues

...

  • 1 byte for now)
  • Layer transforms can read DFDL variables for parameters, and write results to DFDL variables. 

Open Design Issues

  • Debug and trace impact, and how to provide visibility to what is going on when an error occurs in the middle of parsing/unparsing when transforms are in use. E.g., the bit/byte position where a run time parse error occurs would be in some transformed stream, not the underlying stream. I suspect some experience with these transform concepts will be needed before there will be enough information to propose ideas here.

...

Notice the CRLFs at the end. The CRs are represented as remapped to Private-Use-Area(PUA) E00D entities.

The DFDL schema for this, including the specification of the layering transform behaviors is below. This assumes a hypothetical layerLengthKind of 'pattern'. 

Code Block
<xs:schema ....>

 <dfdl:format separatorPosition="infix" lengthKinddfdlx:layerLengthKind="boundaryMark" encoding="utf-8"
  occursCountKind="parsed" separator="" sequenceKind="ordered"
  separatorPosition="infix"/>

 <dfdl:defineFormat name="folded">
  <dfdl:format dfdlx:layerTransform="foldedLines" dfdlx:layerLengthKind="boundaryMark" dfdlx:layerEncoding="us-ascii"/>
  <!-- boundaryMark here means to enclosing end-of-data, as no boundary mark delimiter is defined. -->
</dfdl:defineFormat>

<dfdl:defineFormat name="qp">
  <dfdl:format dfdlx:layerTransform="quotedPrintable" dfdlx:layerLengthKind="pattern"
        dfdlx:layerLengthPattern="[^\n]*?(?=(?<!=)\n)"/>
 
 <!-- QPs are terminated by a newline that is not preceded by an =. 
      This final newline is not consumed as part of the content. -->
  
 <!-- Alternatively, the QP transform itself can determine the length 
      by searching for this final newline (but leaving it there).
      In which case the lengthKind would be "implicit" -->
</dfdl:defineFormat>

 <xs:element name="VCalendar" dfdl:initiator="BEGIN:VCALENDAR%NL;" dfdl:terminator="END:VCALENDAR%NL; END:VCALENDAR">
  <xs:complexType>
    <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
      <xs:sequence dfdl:ref="tns:folded">
         <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:" minOccurs="0"/>
      </xs:sequence>
      <xs:element name="Version" type="xs:string" dfdl:initiator="VERSION:" minOccurs="0" />
      <xs:element name="VEvent" maxOccurs="unbounded" minOccurs="0" dfdl:occursCountKind="parsed"
        dfdl:initiator="BEGIN:VEVENT%NL;" dfdl:terminator="END:VEVENT">
        <xs:complexType>
          <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
            <xs:element name="DTStart" type="xs:string" dfdl:initiator="DTSTART:" />
            <xs:element name="DTEnd" type="xs:string" dfdl:initiator="DTEND:" />
            <!-- 
              content from here could have long lines, so must be folded 
            -->
            <xs:sequence dfdl:ref="tns:folded">
              <xs:element name="Location" type="xs:string" dfdl:initiator="LOCATION:" minOccurs="0"/>
              <xs:element name="UID" type="xs:string" dfdl:initiator="UID:" minOccurs="0"/>
              <xs:element name="Description" dfdl:initiator="DESCRIPTION:" minOccurs="0">
                <xs:complexType>
                  <xs:sequence>              
                   <xs:element name="Encoding" type="xs:string" 
                               dfdl:initiator="ENCODING=" dfdl:terminator=":" minOccurs="0" />
                     <xs:choice dfdl:choiceDispatchKey="{ if (fn:exists(./Encoding)) then ./Encoding else '' }">
                       <!-- 
                         we inspect the value of the Encoding element and decide what branch of the choice
                         based on it 
                        -->
                       <xs:sequence dfdl:choiceBranchKey="QUOTED-PRINTABLE">
                         dfdl:separator="" dfdl:sequenceKind="unordered">
                         <!--
                          Each branch starts with a distinct dummy element to satisfy the UPA rules of XML Schema
                         -->
                         <xs:element name="QP" type="xs:string" dfdl:inputValueCalc="{ '' }" />
                         <!--
                          Here notice thathat tthethe layerRef for the qp data is scoped to just this inner element.
                         -->
                         <xs:sequence dfdl:ref="tns:qp">
                           <xs:element name="Value" type="xs:string"/>
                         </xs:sequence><!-- end layer quoted printable -->
                       </xs:sequence>
                       <!-- 
                          repeat the above pattern for the choice branches for the various encodings 
                        -->
                    </xs:choice>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>           
              <xs:element name="Summary" type="xs:string"  dfdl:initiator="SUMMARY:" minOccurs="0"/>
              <xs:element name="Priority" type="xs:string" dfdl:initiator="PRIORITY:" minOccurs="0" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence><!-- end folded layer -->
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>


...