This proposal was implemented in Daffodil 3.4.0

Motivation

It is not uncommon for data formats to embed other formats. In some cases, this embedded data is XML that must be modeled as an xs:string. For example, a small schema to describe data with a length field and an XML payload might look like this:

<element name="format">
  <complexType>
    <sequence>
      <element name="length" type="xs:int" dfdl:terminator="%NL;" />
      <element name="xmlPayload" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../length }" />
    </sequence>
  </complexType>
</element>

Using this schema to parse this data:

52
<foo bar="baz">mixed content<qaz>complex</qaz></foo>

Would result in the following infoset:

<format>
  <length>52</length>
  <xmlPayload>&lt;foo bar=&quot;baz&quot;&gt;mixed content&lt;qaz&gt;complex&lt;/qaz&gt;&lt;/foo&gt;</xmlPayload>
</format>

Because the xmlPayload element contains XML characters that are included in the infoset as a simple xs:string type, the characters must be escaped. Not only is this field difficult to read and interpret, it is also difficult to validate against an XML Schema that describes the embedded XML.

This proposes a new feature to support handling XML data embedded in other data.

Implementation

This feature makes use of the Runtime Properties DFDL extension.  

Elements that should be treated as embedded XML should specify their type as "xs:string" and set the "stringAsXml" runtime property to true. For example, the above schema would be changed to this:

<element name="xmlPayload" type="xs:string" dfdlx:runtimeProperties="stringAsXml=true" ... />

Because the type is "xs:string"​, these elements are parsed and unparsed using normal DFDL string processing logic, with Daffodil's internal infoset containing the string value.

After the element has been parsed, when we project the internal infoset into an XML infoset, elements with the stringAsXml=true runtime property perform the following:

  1. The element that normally has simple text content is changed to one that has complex content
  2. A complex child element with QName "stringAsXml" and define xmlns="" is added as a child to this complex element​. Setting the default namespace to the empty string helps to avoid namespace conflicts between the infoset and the embedded XML in cases where the infoset defines a default namespace
  3. We check that the string is well-formed XML, parse it as XML, and add the resulting XML tree as a child to the "stringAsXml" complex element. This also performs normalization like removing DOCTYPE's, removing the XML declaration, translating XML escape characters, etc.

With the above schema modified to add the stringAsXml runtime property, the data now parses to this infoset:

<format>
  <length>52</length>
  <xmlPayload>
    <stringAsXml xmlns="">
      <foo bar="baz">mixed content<qaz>complex</qaz></foo>
    </stringAsXml>
  </xmlPayload>
</format>

Not only is this much easier to read, but it is also much easier to validate the xmlPayload using normal XML validation tools.

Note that this embedded XML need not obey the normal limitations of an DFDL infoset--it can contain any well-formed XML, including attributes, mixed content, comments, etc.

For unparsing, we do the reverse:

  1. When projecting the incoming XML infoset to the internal infoset, search for elements that have a runtime property of stringAsXML="true"
  2. For these elements, find the "stringAsXml" complex child element
  3. Read the children of the "stringAsXml" element and serialize them to a string
  4. Set that string to the value of the simple element in the internal infoset

We then unparse that element using normal string unparse logic.

Known Limitations

Normalization

Some XML strings may not be possible to round trip exactly. This is because projecting a string into an XML infoset may normalize or remove elements such as DOCTYPEs, XML declarations, escape characters, etc. When unparsing, these removed/modified elements maybe not be recoverable, or elements such as XML declaration may always be added even if the original XML did not have one.

Validation

The DFDL schema can no longer be used to validate the resulting XML infoset. This is because the DFDL schema defines the embedded XML element as a simple xs:string, but when projected into an XML infoset it becomes a complex element with complex content. Instead, a separate XSD must be created if validation of the resulting XML infoset is required. However, because the resulting XML is fairly limited and the algorithm well-defined, it is likely one could automate the creation of this alternative schema.

DFDL Expressions

Because the translation from xs:string to XML is done when projecting between the internal and XML infosets, and Daffodil only sees the internal infoset, things like DFDL expressions have no access to the content of this XML. From the viewpoint of expressions, there is only the single embedded XML element and it is only a string.

Backtracking

Because the translation from xs:sring to XML is done when projecting between the internal and XML infosets, which may occur far after the element was parsed, there is no way to perform speculative parsing and backtrack if the field is not well-formed XML. Instead, an error is thrown during the projection which is treated as a SchemaDefinitionError. An additional mechanism is required to enable checking for well-formedness while parsing.

There are likely a number of potential solutions for this. One option could have a dfdl:isWellFormedXML(xs:string)​ function for evaluating in a discriminator applied to embedded XML elements. Alternatively, new properties could be added (e.g. dfdl:textStringRep="xml")  which would behave just like normal xs:string processing but add a well-formed check immediately after parsing the content. Note that if we have a property like dfdl:textStringRep="xml", that could possibly be used in place of the stringAsXML runtime property. There are likely other approaches that are worth consideration, and could potentially be extended to support additional embedded string types (e.g. JSON).

Note that by doing the translation into XML when projecting from the internal infoset to XML infoset, it means that each individual projection type can decide how best to handle elements that specify the stringAsXml runtime property. For example, if we project our internal infoset to JSON where XML isn't valid, then we can simply ignore the stringAsXML runtime property and treat is as a JSON string like usual. Or if we projected into an infoset of JDOM objects we could parse the embedded XML into a JDOM tree and add that to the JDOM infoset.

Note that whatever the mechanism, it is likely important to have separate mechanisms for checking if a string is well-formed versus projecting that data into an infoset. For example, one could imagine extending this capability to support embedded JSON strings. We may want the capability to validate that the field is well-formed JSON but still project it into an XML infoset where it would be treated as a simple xs:string. So while well-formedness checking and projecting to an infoset likely overlap in some implementation details, the two mechanisms are orthoganal--you may only want to do one or the other depending on the type of the embedded data and the infoset type.

  • No labels

1 Comment

  1. In the case where the XML String is not well formed, then the Infoset Outputter which is creating XML could just escape the string for inclusion in XML, and output  

    <dafx:notWellFormedXML xmlns:dafx="urn"org.apache.daffodil.dafx">...escaped string ...</dafx:notWellFormedXML>

    This would then be well formed, but invalid, as this element would not be in the DFDL nor XSD schema at all. 

    This seems like it could be helpful. Unparsing would also fail on it.