This situation comes up often. The technique described here is for DFDL schemas. There's a separate snippet about Combining XML Schemas Together, in case you are working just with XML or have both DFDL-described and XML data in your system. 

Q: I have many different DFDL schemas for my different message types. These messages are being mixed together in files. They are prefixed by a 4 byte integer type-code which indicates the specific message format.  How can I parse these files?

A: You create a combined DFDL schema that accepts ANY of the message types, and distinguishes them by the type code. 

Caveat: For this to work, the schemas being combined must have target namespaces, and they all must be distinct target namespaces.

Issue: DAFFODIL-2916 - Getting issue details... STATUS is a bug (in Daffodil 3.8.0 and prior) which (until fixed) prevents combining from working for schemas with no namespace.  

Hint: if DFDL schemas follow the style suggestions in DFDL Schema Style Guide, they are much easier to combine/compose together. 

Let's assume you have 3 different message types, A, B, and C. Each has a DFDL schema file a.dfdl.xsd, b.dfdl.xsd, and c.dfdl.xsd. 

Each defines a message element and a target namespace for that message.  We're not bothering to show DFDL properties in these little examples because they don't matter for purposes of what we're trying to illustrate:

a.dfdl.xsd
<schema    
 xmlns="http://www.w3.org/2001/XMLSchema"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:a="urn:a"
 targetNamespace="urn:a">
 
<element name="a_msg" type="a:a_msg_type"/>

... plus type definitions
</schema>


b.dfdl.xsd
<schema    
 xmlns="http://www.w3.org/2001/XMLSchema"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:b="urn:b"
 targetNamespace="urn:b">
 
<element name="b_msg" type="b:b_msg_type"/>

... plus type definitions
</schema>


c.dfdl.xsd
<schema    
 xmlns="http://www.w3.org/2001/XMLSchema"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:c="urn:c"
 targetNamespace="urn:c">
 
<element name="c_msg" type="c:c_msg_type"/>

... plus type definitions
</schema>


Now we can "glue" these together into a single combined DFDL schema. This imports each of the A, B, and C schemas, and then defines a schema for the envelope message type which uses the 4-byte type code element to distinguish these.

Below I'm leaving out all but the important DFDL properties to make the example more understandable:

combined_abc.dfdl.xsd
<schema
  xmlns="http://www.w3.org/2001/XMLSchema"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:a="urn:a"
  xmlns:b="urn:b"
  xmlns:c="urn:c"
  xmlns:abc="urn:abc"
  targetNamespace="urn:abc">

<import namespace="urn:a" schemaLocation="a.dfdl.xsd"/>
<import namespace="urn:b" schemaLocation="b.dfdl.xsd"/>
<import namespace="urn:c" schemaLocation="c.dfdl.xsd"/>

<element name="envelope" type="abc:envelopeType"/>

<complexType name="envelopeType">
  <sequence>
    <element name="typeCode" type="xs:unsignedInt"/>
    <choice dfdl:choiceDispatchKey='{ xs:string(typeCode) }'>
         <element ref="a:a_msg" dfdl:choiceBranchKey="0" />
         <element ref="b:b_msg" dfdl:choiceBranchKey="1" />
         <element ref="c:c_msg" dfdl:choiceBranchKey="2" />
    </choice>
  </sequence>
</complexType>   

</schema>

You now have one single schema that can be used to parse any of your messages.  The result of the parse will always be surrounded by the envelope  element, but if the typeCode is 0 the element will be parsed as type A, 1 is B, 2 is C, and so on. 

A Variation: The Type Code isn't "Real": An Asymmetric DFDL Schema

There is one other interesting variation worth discussing. What if the type code isn't actually part of the data. What if it is being slapped on the front so you can distinguish the different kinds of data, but you don't want it to show up in the infoset, nor be re-created when you unparse the data. 

In that case we use a DFDL feature called  hidden groups to hide the type code. We'll also play some tricks to prevent the type code from unparsing at all. Here's how the combined schema changes:


combined_abc.dfdl.xsd
<schema
  xmlns="http://www.w3.org/2001/XMLSchema"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:a="urn:a"
  xmlns:b="urn:b"
  xmlns:c="urn:c"
  xmlns:abc="urn:abc"
  targetNamespace="urn:abc">

<import namespace="urn:a" schemaLocation="a.dfdl.xsd"/>
<import namespace="urn:b" schemaLocation="b.dfdl.xsd"/>
<import namespace="urn:c" schemaLocation="c.dfdl.xsd"/>

<element name="envelope" type="abc:envelopeType"/>

<complexType name="envelopeType">
  <sequence>
      <sequence dfdl:hiddenGroupRef="abc:h_typeCode"/>   <!-- DIFFERENT: Use a HIDDEN GROUP. -->       

      <choice dfdl:choiceDispatchKey='{ xs:string(typeCode) }'>
          <element ref="a:a_msg" dfdl:choiceBranchKey="0" />
          <element ref="b:b_msg" dfdl:choiceBranchKey="1" />
          <element ref="c:c_msg" dfdl:choiceBranchKey="2" />
    </choice>
  </sequence>
</complexType>   

<!--
Up to this point, only one line of the schema was different. The one that
references the hidden group below. 

This hidden group at parse time will populate the typeCode element.

The trick to keep it from unparsing is this:
At unparse time the elements in the hidden group are not part of the
infoset, so when unparsing, the unparser will have nothing
to go on when it encounters the need to unparse this hidden choice group.
Now, dfdl:choiceDispatchKey is only evaluated when parsing, not
when unparsing so the unparser has nothing to go on, so will choose the 
first branch of the choice here, which contains nothing so nothing gets unparsed. 
-->

<group name="h_typeCode">
  <choice dfdl:choiceDispatchKey='{ "parse" }'>
     <sequence dfdl:choiceBranchKey="unparse">
       <!-- Don't unparse anything. 
            By default, when unparsing a choice, 
            the first branch is taken -->
     </sequence>
     <sequence dfdl:choiceBranchKey="parse">
       <element name="typeCode" type="xs:unsignedInt"
          dfdl:outputValueCalc='{ 0 }'>
         <!--
         And there is this one additional annoying detail:

	     The dfdl:outputValueCalc above is only present to satisfy the
         Daffodil schema compiler. All elements in hidden groups must have
         a way to unparse, which means they are defaultable or have
         a dfdl:outputValueCalc, or it will not schema-compile.
         Daffodil's compiler is not quite smart enough to realize this 
         element will never be unparsed so the schema compiler insists on having 
         a dfdl:outputValueCalc even though it the property will never be used.
         --> 
       </element>
     </sequence>
  </choice>
</group>


</schema>

So when parsing data with this schema there will be no typeCode element in the XML infoset. When unparsing no type code 4 bytes will be unparsed. 

This schema is, on purpose, asymmetric between what it parses, and what it unparses. Often times we want what the parser accepts and what the unparser produces to be identical or nearly so. In this case we want to parse, use, but then drop the type code on unparse. 

  • No labels