This page is a collection of notes on how to create DFDL schemas in a way that really helps keep you out of various XSD snarls and complexities. 

As of this writing (2023-02-13) many of the DFDL Schemas we have created do not follow all these conventions perfectly. We have learned as we have gone along.

This set of notes represents best practices after learning from many debugging exercises. 

Avoid Element References and Global Element Declarations

DFDL Schemas should use elementFormDefault="unqualified" (which is the default for XML Schemas).  There's no need for every child element to have a namespace (hence prefix), when the tree they are part of has a namespace prefix somewhere further towards the root which makes the identity of those child elements unambiguous. 

Global elements should be defined only as an assistance for testing the schema. 

Those elements should do nothing more than use a complex type definition.

DFDL schemas should not use element references. 

The content of the schema should always be in a complex type definition. This gives the schema user the choice of what they want to call their elements, whether they want a global element, or to use the schema as a child element within a larger structure, without the burden of introducing global namespace prefix management to their schemas. 

Defining only global types and groups, leaving the global elements for the end-user of the schema provides greater flexibility. 

Hence, the standard start of a DFDL schema is doing to be:

<schema 
  targetNamespace="urn:mySchemaNamespace"
  xmlns:msns="urn:mySchemaNamespace" 
  ... >

... import/include and top level format annotations...

<!-- 
  This one-liner below is the ONLY global element in the entire schema, 
  and schema users can always ignore it and just use the complex type, so 
  they can call the element in their schema whatever they want.

  At the same time this single root allows users to easily 
  test the schema with the daffodil CLI or daffodil-vscode extension, 
  without having to specify a root element in a separate file. 
--> 

<element name="mySchema" type="msns:mySchemaType"/> 

<complexType name="mySchemaType">
     ... the real schema contents is all reachable from here. ...
</complexType>

... other types and groups ...

</schema>

Included files, and imported files that are part of the same schema project should either have no global elements at all, or one, like the above, to facilitate testing. 

But they should always include equivalent complex type or group definitions allowing those global elements to be bypassed/ignored. 

Rationale: This makes schemas more flexible for reuse because it takes no position on element names that the schema user can't avoid if they so choose. 

A second global element can also sometimes be useful for testing against files with multiple data items in it. This second global element would almost always look like:

<element name="mySchemaFile">
  <complexType>
    <sequence>
      <element name="mySchema" type="msns:mySchemaType" maxOccurs="unbounded" dfdl:occursCountKind='implicit'/>
    </sequence>
  </complexType>
</element>

Note how this does not have an element reference in it, but a local element declaration for the mySchema child element. 

Lastly, no other structured data system has anything like element references, so in the interests of being able to use DFDL and transform data into the data models used by other processing fabrics, element references should be avoided. 

Summary: schema files should have zero, one, or at most two global element declarations in them, and those are there for convenient testing, and may be ignored entirely when the schema is reused.

Namespaces, Namespace Prefixes, Import, Include, and the schemaLocation  Attribute

Namespaces and namespace prefixes in XSD seem simple enough until you start building a very large DFDL schema from multiple disjoint component schemas that are intended for reuse.

DFDL does not have any namespace features of its own, it simply passes through XML Schema's namespace and prefix system. 

(Note however: DFDL does not implement the XML Schema "redefine" construct, but neither do many regular XML Schema software platforms.)

Without following a reasonable set of standard practices it is quite easy to end up in what we call namespace hell. In this situation you get all sorts of diagnostic messages about symbols not being defined, but your import/include files seem to be well specified. Debugging this can be problematic, and you end up with roughly the situation that the guidance below specifies, just after much work and wasted time.

It's also the case that many DFDL applications do not use XML as their output data format. JSON is very popular also, and direct connectors to other data transformation and processing fabrics are in the works which have their own particular data models. XML's data model, and namespace system, really have no corresponding features in many of these other systems like JSON. (E.g., JSON does not have namespaces.) 

The practices here insure a DFDL schema's use of namespaces does not prevent parser/unparser creation/consumption of JSON, or other kinds of data output, using a DFDL processor. 

Staying out of Namespace Hell

The first set of simple rules for staying out of trouble is this:

  • For every target namespace, choose a unique prefix to use everywhere in your schema to refer to that namespace. 
    • The practice of using xmlns:tns prefix within schemas to refer to "this target namespace" should not be used.
  • Schema definitions should, with few exceptions, have a target namespace.
    • Issue: DAFFODIL-2916 - Getting issue details... STATUS (until fixed) means reuse of no-namespace schemas is nearly impossible. 
  • A default namespace should be used only for the XML Schema namespace to avoid having to type "xs:" or "xsd:" everywhere. 

Different schema projects can use different prefixes, but within one schema project one namespace should mean one prefix globally across all files. 

The most critical guidance rules are these:

  • For every target namespace, one file must be the single distinguished one for that namespace. It is the one-and-only schemaLocation  file that is xs:import -ed anywhere one must import that namespace.  
  • That distinguished file must xs:include  all the other files that share that target namespace. 

Note that cyclic usage between namespaces is allowed. Two schema files can xs:import  each other. So long as they have different target namespaces.

However, xs:include  relationships cannot be cyclic.

The rest of this section is effectively just providing rationale for the above guidance. 

Things that Don't Work

Sometimes people want to decompose one namespace into several sub-units, and only import the symbols for the  features of that namespace they need and are using. So they expect they can import a namespace by importing only a specific file that contributes part of the definitions for that namespace. 

This does not​ work.  To achieve that sort of modularity you must decompose to different namespaces. 

The best mental model to understand this is: imagine all the schemaLocation  attributes were erased from all xs:import statements. Imagine the namespace URIs are actually being used to retrieve the namespace file. With this erasure you can only have one place where everything is getting that namespace because that namespace is defined by its URI, and that's also how you retrieve it. 

That's how XSD works. One namespace == one source == one file providing its definition. 

Some people actually create schemas this way, without schemaLocation on xs:import statements. Then they use an XML Catalog to provide the 1 to 1 mapping of namespaces to the single distinguished file that provides its definition.

We have not used XMLCatalogs much and they are not recommended, as they introduce their own complexities. 

Going back to practices for xs:import, adding back in schemaLocation attributes, it should be clear now that all across a schema, there is a 1 to 1 association of namespaces to a specific schemaLocation. So every xs:import anywhere in your schema, for a given namespace X must provide the same exact schemaLocation Y. 

If you have, anywhere in your schema....

<xs:import namespace="ns" schemaLocation="location"/>

then for any specific ns, the location must always be the exact same location. 

What is the problem with the tns  prefix?

It often results in bigger XML due to the need to have xmlns:tns="...."  rebindings in multiple places in XML instance documents. When these are deep in the element nest they can be hard to find. 

It also makes XML instance documents harder to interpret (for people), as deep inside an XML document an element has tns:someName , but the binding of tns  prefix is far away (textually, for example many pages of text prior, but not necessarily at the start), and so not clear in that context.  Basically, when looking at an XML instance document, a person gets very little information from a tns  prefix. 

If tns  prefixes are used only for type and group references, and never for element references, one might find that this reduces some editing, and as element references are generally frowned upon this should not come up often. However, if the prefix definition xmlns:tns="...."  appears on the xs:schema element even when there is some other prefix also bound to the same namespace there is no telling whether a given XSD tool will actually use tns or the other prefix when identifying the root element in XML instance documents. So even if the schema author only ever uses tns for type and group references, the`tns` prefix can still show up and cause (albeit minor) confusion in XML instance documents. 

Best practice is just avoid this tns convention entirely.  

Suggested Conventions to use for Namespace URIs and Prefixes

Suppose you work for example.com, and you have XML Schemas, DFDL Schemas, and JSON schemas. 

Let's suppose you have a DFDL schema for a format named "ebx data".  Suppose there are various versions of this format.

The following is a useful namespace URI and prefix definition for this format:

xmlns:ebx="urn:example.com:schema:dfdl:ebxData:ebx"

The notion here is that the URI is a "URN" which means it is not an address to retrieve from. It is unique to your company, identifies it as a DFDL schema namespace, contains the format name, and the URI explicitly contains the suggested prefix to be used for this (by convention). 

Note also that there is no version information at the end of this URI. This turns out to be a best practice.

Everyone who sees this namespace URI alone as in an import statement like this:

<xs:import namespace="urn:example.com:schema:dfdl:ebxData:ebx" 
  schemaLocation="/com/example/schema/dfdl/ebxData.dfdl.xsd"/> 

From this one automatically knows the prefix to use by convention, because it is the last part of the namespace URI.


These conventions for the schemaLocation are also useful as they provide something like the Java package namespaces to avoid name collisions. 


Versioning - In the Infoset/Data, Not the Namespace URI

It's become clear in XML Schemas (not just DFDL) that having version specific namespace URIs causes difficulty. 

One issue is that the path expressions that navigate such elements become version specific even if the elements they are ultimately accessing are common to multiple versions. Such paths are monomorphic to specific versions. It is much nicer if path expressions are as polymorphic across versions as possible. 

Hence, define an element in your schema to hold the version information. Don't append a version number to a namespace URI. 


Avoid Child Elements with the Same Name

XML Schema has a data model with some flexibility needed only for markup languages intended for human authoring. 

DFDL uses XML Schema to describe structured data, where this flexibility is not needed. 

DFDL omits many XML Schema constructs, but DFDL version 1.0 still allows some things that are best avoided to insure the ability to interoperate with other data models. 

One such feature is the ability in XML Schema to have multiple child elements with the same name. So long as it is unambiguous what element declaration is intended, XML Schema allows things like:

...
<element name="foo" ..../>
<element name="bar" ..../>
<element name="foo" ..../>

This is allowed because the element bar  separates the two different declarations of the foo  element; hence, when parsing XML, the first foo  declaration is used until a bar  element is encountered, and after that the second foo  declaration is used. 

That's all interesting and useful for markup languages, but no other structured data system allows this. Hence, it is best avoided to enable DFDL schemas to be interfaced to data systems having other data models. 

Avoid Anonymous Choices

XML Schema allows a choice to be anonymous within the data model of an element. For example:

<element name="myElement">
  <complexType>
    <sequence>
       ... various elements ...
       <choice>
         ... choice branches ...
       </choice>
       ... various more elements
    </sequence>
  </complexType>
</element>

The choice above appears in the middle of a sequence group, with elements and/or other groups before and after it. Note that there is no element name associated with the choice. Rather in XML data, the choice branches would contain elements and these would appear as direct children of the myElement  parent element.

Many other data modeling languages do not have this capability. They require choices to be named. 

Hence, this is to be avoided. Choice groups should always be the model-groups of named elements.  

This is analogous to a DFDL restriction for optional/recurring data in DFDL. In DFDL, only elements, not sequences/choices, can be optional or recurring/array. 

By using only named choices, one insures one's DFDL schema can be mapped to the data structures of the other data systems which do not allow anonymous choices. 


Versioning and Choices - Using Marker Elements

Given two different versions of a schema, consider:

<choice>
  <element name="v1">
     <complexType>
        <sequence>
           <element name="a" .../>
           <element name="c" type="xs:int" dfdl:length="7"/>
        </sequence>
     </complexType>
  </element>
  <element name="v2">
     <complexType>
        <sequence>
           <element name="b" .../>
           <element name="c" type="xs:int" dfdl:length="6"/>
           <element name="spare" type="xs:unsignedInt" dfdl:length="1"/>
         </sequence>
     </complexType>
  </element> 
</choice>

Note both versions 1 and 2 have a child named 'c' which is an 'xs:int'.

This has the drawback that the path to reach element 'c' must have a parent that is version specific even though element 'c' is common to both versions. The two differ only by a DFDL property (dfdl:length). 

Consider instead using this technique:

<choice>
  <sequence>
    <element name="v1" type="pre:empty"/>
    <element name="a" .../>
    <element name="c" type="xs:int" dfdl:length="7"/>
  </sequence>
  <sequence>
    <element name="v2" type="pre:empty"/>
    <element name="b" .../>
    <element name="c" type="xs:int" dfdl:length="6"/>
    <element name="spare" type="xs:unsignedInt" dfdl:length="1"/>
  </sequence> 
</choice>

This uses a marker element which will be <v1/> or <v2/> before the other elements. A path to the 'c' element will not have a v1 nor v2 element parent. 

Such paths are then version polymorphic, which is very much preferable. 

The type pre:empty  can be defined to be an unaligned empty sequence so that it has no representation in the data stream. 



  • No labels

1 Comment

  1. See also this email about choices with empty branches, for example:

    <xs:choice>
    <xs:element name="foo" type="xs:int" />
    <xs:sequence />
    </xs:choice>

    This is best avoided as it causes incorrect XSD validation in current versions of Xerces C, a popular XML validator library. 

    See issue: XERCESC-2243 - Getting issue details... STATUS