Combining DFDL Schemas Together

This situation comes up often. The technique described here is for DFDL schemas. There's a separate snippet about Combining XML Schemas Together, in case you are working just with XML.

Q: I have many different DFDL schemas for my different message types. These messages are being mixed together in files. They are prefixed by a 4 byte integer type-code which indicates the specific message format. How can I parse these files?

A: You create a combined DFDL schema that accepts ANY of the message types, and distinguishes them by the type code.

Let's assume you have 3 different message types, A, B, and C. Each has a DFDL schema file a.dfdl.xsd, b.dfdl.xsd, and c.dfdl.xsd.

Each defines a message element and a target namespace for that message.

a.dfdl.xsd

<schema
 xmlns:a=".... a ...."
 targetNamespace="... a ...">
 
<element name="a_msg" type="a:a_msg_type"/>

... plus type definitions
</schema>

b.dfdl.xsd

<schema
 xmlns:b="... b..."
 targetNamespace="... b ...">
 
<element name="b_msg" type="b:b_msg_type"/>

... plus type definitions
</schema>

c.dfdl.xsd

<schema
 xmlns:c="... c..."
 targetNamespace="... c ...">
 
<element name="c_msg" type="c:c_msg_type"/>

... plus type definitions
</schema>

Now we can "glue" these together into a single combined DFDL schema. This imports each of the A, B, and C schemas, and then defines a schema for the envelope message type which uses the 4-byte type code element to distinguish these.

Below I'm leaving out all but the important DFDL properties to make the example more understandable:

combined_abc.dfdl.xsd

<schema
  targetNamespace="... abc ...">

<import namespace="... a ..." schemaLocation="a.dfdl.xsd"/>
<import namespace="... b ..." schemaLocation="b.dfdl.xsd"/>
<import namespace="... c ..." schemaLocation="c.dfdl.xsd"/>

<element name="envelope" type="abc:envelopeType"/>

<complexType name="envelopeType">
  <sequence>
    <element name="typeCode" type="xs:unsignedInt"/>
    <choice dfdl:choiceDispatchKey='{ xs:string(typeCode) }'>
         <element ref="m:a_msg" dfdl:choiceBranchKey="0" xmlns:m="...a..."/>
         <element ref="m:b_msg" dfdl:choiceBranchKey="1" xmlns:m="...b..."/>
         <element ref="m:c_msg" dfdl:choiceBranchKey="2" xmlns:m="...c..."/>
    </choice>
  </sequence>
</complexType>   

</schema>

You now have one single schema that can be used to parse any of your messages. The result of the parse will always be surrounded by the envelope element, but if the typeCode is 0 the element will be parsed as type A, 1 is B, 2 is C, and so on.

There is one other interesting variation worth discussing. What if the type code isn't actually part of the data. What if it is being slapped on the front so you can distinguish the different kinds of data, but you don't want it to show up in the infoset, nor be re-created when you unparse the data.

In that case we use a DFDL feature called hidden groups to hide the type code. We'll also play some tricks to prevent the type code from unparsing at all. Here's how the combined schema changes:

combined_abc.dfdl.xsd

<schema
  targetNamespace="... abc ...">

<import namespace="... a ..." schemaLocation="a.dfdl.xsd"/>
<import namespace="... b ..." schemaLocation="b.dfdl.xsd"/>
<import namespace="... c ..." schemaLocation="c.dfdl.xsd"/>

<element name="envelope" type="abc:envelopeType"/>

<complexType name="envelopeType">
  <sequence>

    <sequence dfdl:hiddenGroupRef="abc:h_typeCode"/>   <!-- DIFFERENT: Use a HIDDEN GROUP. --> 

    <choice dfdl:choiceDispatchKey='{ xs:string(typeCode) }'>
         <element ref="m:a_msg" dfdl:choiceBranchKey="0" xmlns:m="...a..."/>
         <element ref="m:b_msg" dfdl:choiceBranchKey="1" xmlns:m="...b..."/>
         <element ref="m:c_msg" dfdl:choiceBranchKey="2" xmlns:m="...c..."/>
    </choice>
  </sequence>
</complexType>   

<!--
This hidden group at parse time will populate the typeCode element.
At unparse time the elements in the hidden group are not part of the
infoset, so when unparsing this, the unparser will have nothing
to go on (dfdl:choiceDispatchKey is only evaluated when parsing, not
when unparsing) and will choose the first branch of the choice here, which
contains nothing so nothing gets unparsed. 
-->

<group name="h_typeCode">
  <choice dfdl:choiceDispatchKey='{ "parse" }'>
     <sequence dfdl:choiceBranchKey="unparse">
       <!-- Don't unparse anything. 
            By default, when unparsing a choice, 
            the first branch is taken -->
     </sequence>
     <sequence dfdl:choiceBranchKey="parse">
       <element name="typeCode" type="xs:unsignedInt"
          dfdl:outputValueCalc='{ 0 }'>
         <!--
	     The dfdl:outputValueCalc above is only present to satisfy the
         Daffodil schema compiler. (All elements in hidden groups must have
         a way to unparse, which means they are defaultable or have
         a dfdl:outputValueCalc, or it will not compile.
         Daffodil's compiler is not quite smart enough to realize this 
         element will never be unparsed so it insists on having 
         a dfdl:outputValueCalc even though it will never be used.)
         --> 
       </element>
     </sequence>
  </choice>
</group>


</schema>

So when parsing data with this schema there will be no typeCode element in the XML infoset. When unparsing no type code 4 bytes will be unparsed.

This schema is, on purpose, asymmetric between what it parses, and what it unparses. Often times we want what the parser accepts and what the unparser produces to be identical or nearly so. In this case we want to parse, use, but then drop the type code on unparse.

Page tree

Combining DFDL Schemas Together