This page is a collection of notes on how to create DFDL schemas in a way that really helps keep you out of various XSD snarls and complexities.
As of this writing (2023-02-13) many of the DFDL Schemas we have created do not follow all these conventions perfectly. We have learned as we have gone along.
This set of notes represents best practices after learning from many debugging exercises.
For those familiar with XML Schema (XSD), our schema style is an instance of what is called the "Venetian Blind" pattern (See Balisage2020), that one might call "Hard Venetian-Blind Type Library".
The "Hard" variation is because we place strongly minimize the use of global elements, namespaces, and some other XSD constructs.
The "Type-Library" variation is because we structure DFDL schemas so that there is always the option for a user to use the schema as a component or library within a larger encompassing DFDL schema by referencing a complex type definition provided by the schema.
Below are the details.
Avoid Element References and Global Element Declarations
DFDL Schemas should use elementFormDefault="unqualified" (which is the default for XML Schemas).
Global elements should be defined only as an assistance for testing the schema.
Those elements should do nothing more than type reference a complex type definition.
DFDL schemas should not use element references.
The content of the schema should always be in a complex type definition. This gives the schema user the choice of what they want to call their elements, whether they want a global element, or to use the schema as a child element within a larger structure, without the burden of introducing global namespace prefix management to their schemas.
Defining only global types and groups, leaving the global elements for testing, or for the end-user of the schema provides greater flexibility. All schemas are available to use as libraries.
Hence, the standard start of a DFDL schema is doing to be:
<schema targetNamespace="urn:mySchemaNamespace" xmlns:msns="urn:mySchemaNamespace" ... > ... import/include and top level format annotations... <complexType name="mySchemaType"> ... the real schema contents is all reachable from here. ... </complexType> ... other types and groups ... </schema>
Included files, and imported files that are part of the same schema project should have no global elements at all.
The only global elements defined should be defined like this:
<schema xmlns:msns="urn:mySchemaNamespace" ... > <import namespace="urn:mySchemaNamespace" schemaLocation=".../mySchemaType.dfdl.xsd"/> ... a top level dfdl:format declaration ... <element name="myRoot" type="msns:mySchemaType"/> <!-- a type reference only --> </schema>
Rationale: This makes schemas more flexible for reuse because it takes no position on element names that the schema user can't avoid if they so choose.
A second global element can also sometimes be useful for testing against files with multiple data items in it. This second global element would almost always look like:
<element name="mySchemaFile"> <complexType> <sequence> <element name="mySchema" type="msns:mySchemaType" maxOccurs="unbounded" dfdl:occursCountKind='implicit'/> </sequence> </complexType> </element>
Summary: schema files should have zero, one, or at most two global element declarations in them, and those are there for convenient testing, and may be ignored entirely when the schema is reused.
Namespaces, Namespace Prefixes, Import, Include, and the schemaLocation
Attribute
Namespaces and namespace prefixes in XSD seem simple enough until you start building a very large DFDL schema from multiple disjoint component schemas that are intended for reuse.
DFDL does not have any namespace features of its own, it simply passes through XML Schema's namespace and prefix system.
(Note however: DFDL does not implement the XML Schema "redefine" construct, but neither do many regular XML Schema software platforms.)
Without following a reasonable set of standard practices it is quite easy to end up in what we call namespace hell. In this situation you get all sorts of diagnostic messages about symbols not being defined, but your import/include files seem to be well specified. Debugging this can be problematic, and you end up with roughly the situation that the guidance below specifies, just after much work and wasted time.
It's also the case that many DFDL applications do not use XML as their output data format. JSON is very popular also, and direct connectors to other data transformation and processing fabrics are in the works which have their own particular data models. XML's data model, and namespace system, really have no corresponding features in many of these other systems like JSON. (E.g., JSON does not have namespaces.)
The practices here insure a DFDL schema's use of namespaces does not prevent parser/unparser creation/consumption of JSON, or other kinds of data output, using a DFDL processor.
Staying out of Namespace Hell
The first set of simple rules for staying out of trouble is this:
- For every target namespace, choose a unique prefix to use everywhere in your schema to refer to that namespace.
- The practice of using xmlns:tns prefix within schemas to refer to "this target namespace" should not be used.
- Schema type and group definitions should, with few exceptions, have a target namespace.
- A default namespace should be used only for the XML Schema namespace to avoid having to type "xs:" or "xsd:" everywhere.
Different schema projects can use different prefixes, but within one schema project one namespace should mean one prefix globally across all files.
The most critical guidance rules are these:
- For every target namespace, one file must be the single distinguished one for that namespace. It is the one-and-only
schemaLocation
file that isxs:import
-ed anywhere one must import that namespace. - That distinguished file must
xs:include
all the other files that share that target namespace.
Note that cyclic usage between namespaces is allowed. Two schema files can xs:import
each other. So long as they have different target namespaces.
However, xs:include
relationships cannot be cyclic.
The rest of this section is effectively just providing rationale for the above guidance.
Things that Don't Work
Sometimes people want to decompose one namespace into several sub-units, and only import the symbols for the features of that namespace they need and are using. So they expect they can import a namespace by importing only a specific file that contributes part of the definitions for that namespace.
This does not work. To achieve that sort of modularity you must decompose to different namespaces.
The best mental model to understand this is: imagine all the schemaLocation
attributes were erased from all xs:import statements. Imagine the namespace URIs are actually being used to retrieve the namespace file. With this erasure you can only have one place where everything is getting that namespace because that namespace is defined by its URI, and that's also how you retrieve it.
That's how XSD works. One namespace == one source == one file providing its definition.
Some people actually create schemas this way, without schemaLocation on xs:import statements. Then they use an XML Catalog to provide the 1 to 1 mapping of namespaces to the single distinguished file that provides its definition.
We have not used XMLCatalogs much and they are not recommended, as they introduce their own complexities.
Going back to practices for xs:import, adding back in schemaLocation attributes, it should be clear now that all across a schema, there is a 1 to 1 association of namespaces to a specific schemaLocation. So every xs:import anywhere in your schema, for a given namespace X must provide the same exact schemaLocation Y.
If you have, anywhere in your schema....
<xs:import namespace="ns" schemaLocation="location"/>
then for any specific ns, the location must always be the exact same location.
What is the problem with the tns
prefix?
If you follow this style guide and have no global elements in namespaces then this won't come up, but if you do have global elements in namespaces then using tns for "target namespace" as a prefix causes trouble.
It often results in bigger XML due to the need to have xmlns:tns="...."
rebindings in multiple places in XML instance documents. When these are deep in the element nest they can be hard to find.
It also makes XML instance documents harder to interpret (for people), as deep inside an XML document an element has tns:someName
, but the binding of tns
prefix is far away (textually, for example many pages of text prior, but not necessarily at the start), and so not clear in that context. Basically, when looking at an XML instance document, a person gets very little information from a tns
prefix.
If tns
prefixes are used only for type and group references, and never for element references, one might find that this reduces some editing, and as element references are generally frowned upon this should not come up often. However, if the prefix definition xmlns:tns="...."
appears on the xs:schema
element even when there is some other prefix also bound to the same namespace there is no telling whether a given XSD tool will actually use tns
or the other prefix when identifying an element in XML instance documents. So even if the schema author only ever uses tns for type and group references, the`tns` prefix can still show up and cause (albeit minor) confusion in XML instance documents.
Best practice is just avoid this tns
convention entirely, and avoid having any global elements in namespaces.
Suggested Conventions to use for Namespace URIs and Prefixes
Suppose you work for example.com, and you have XML Schemas, DFDL Schemas, and JSON schemas.
Let's suppose you have a DFDL schema for a format named "ebx data". Suppose there are various versions of this format.
The following is a useful namespace URI and prefix definition for this format:
xmlns:ebx="urn:example.com:schema:dfdl:ebxData:ebx"
The notion here is that the URI is a "URN" which means it is not an address to retrieve from. It is unique to your company, identifies it as a DFDL schema namespace, contains the format name, and the URI explicitly contains the suggested prefix to be used for this (by convention).
Note also that there is no version information at the end of this URI. This turns out to be a best practice.
Everyone who sees this namespace URI alone as in an import statement like this:
<xs:import namespace="urn:example.com:schema:dfdl:ebxData:ebx" schemaLocation="/com/example/schema/dfdl/ebxData.dfdl.xsd"/>
From this one automatically knows the prefix to use by convention, because it is the last part of the namespace URI.
These conventions for the schemaLocation are also useful as they provide something like the Java package namespaces to avoid name collisions.
Versioning - In the Infoset/Data, Not the Namespace URI
It's become clear in XML Schemas (not just DFDL) that having version specific namespace URIs causes difficulty.
One issue is that the path expressions that navigate such elements become version specific even if the elements they are ultimately accessing are common to multiple versions. Such paths are monomorphic to specific versions. It is much nicer if path expressions are as polymorphic across versions as possible.
Hence, define an element in your schema to hold the version information. Don't append a version number to a namespace URI.
Express DFDL Properties on the Simple Types, not the Elements
Data formats usually are repetitive. The same format properties are often needed repeatedly for many different elements in the overall format.
This is best captured by defining named types and groups. Redundancy is then avoided by sharing use of types for every element having that same format.
One then avoids repetitive DFDL properties by placing the properties on the simple type definitions rather than on the elements having that type.
It would be nice to say this applies for both simple and complex types, but alas the same exact style is not usable on complex type definitions, which do not carry DFDL properties in DFDL version 1.0. To avoid redundant properties on complex types it is suggested that named format definitions are created and used on each complex type variation. This is not quite as clean, but minimizes redundancy within what is allowed.
Note that the DFDL Workgroup is considering adding the ability to put DFDL properties on complex types in a future version of the DFDL standard.
Avoid Child Elements with the Same Name
XML Schema has a data model with some flexibility needed only for markup languages intended for human authoring.
DFDL uses XML Schema to describe structured data, where this flexibility is not needed.
DFDL omits many XML Schema constructs, but DFDL version 1.0 still allows some things that are best avoided to insure the ability to interoperate with other data models.
One such feature is the ability in XML Schema to have multiple child elements with the same name. So long as it is unambiguous what element declaration is intended, XML Schema allows things like:
... <element name="foo" ..../> <element name="bar" ..../> <element name="foo" ..../>
This is allowed because the element bar
separates the two different declarations of the foo
element; hence, when parsing XML, the first foo
declaration is used until a bar
element is encountered, and after that the second foo
declaration is used.
That's all interesting and useful for markup languages, but no other structured data system allows this. Hence, it is best avoided to enable DFDL schemas to be interfaced to data systems having other data models.
You can see why XML Schema allows this if you think about markup as in HTML. XML is for markup languages and XSD is for describing them. In a markup language you are often going to need lots of the same tag to appear within text repeatedly, separated by other tags at that same level of nesting. The fact that the instance data is XML means the tag-names make it easy to tease apart the document.
DFDL is for describing data that has no tags or specific syntax that the schema language can depend upon. So it provides only a subset of XSD features, and best practice is to avoid things that aren't typical in structured data systems.
Avoid Anonymous Choices
XML Schema allows a choice to be anonymous within the data model of an element. For example:
<element name="myElement"> <complexType> <sequence> ... various elements ... <choice> ... choice branches ... </choice> ... various more elements </sequence> </complexType> </element>
The choice
above appears in the middle of a sequence group, with elements and/or other groups before and after it. Note that there is no element name associated with the choice. Rather in XML data, the choice branches would contain elements and these would appear as direct children of the myElement
parent element.
Many other data modeling languages do not have this capability. They require choices to be named.
Hence, this is to be avoided. Choice groups should always be the model-groups of named elements.
This is analogous to a DFDL restriction for optional/recurring data in DFDL. In DFDL, only elements, not sequences/choices, can be optional or recurring/array.
By using only named choices, one insures one's DFDL schema can be mapped to the data structures of the other data systems which do not allow anonymous choices.
Versioning and Choices - Using Marker Elements
Given two different versions of a schema, consider:
<choice> <element name="v1"> <complexType> <sequence> <element name="a" .../> <element name="c" type="xs:int" dfdl:length="7"/> </sequence> </complexType> </element> <element name="v2"> <complexType> <sequence> <element name="b" .../> <element name="c" type="xs:int" dfdl:length="6"/> <element name="spare" type="xs:unsignedInt" dfdl:length="1"/> </sequence> </complexType> </element> </choice>
Note both versions 1 and 2 have a child named 'c' which is an 'xs:int'.
This has the drawback that the path to reach element 'c' must have a parent that is version specific even though element 'c' is common to both versions. The two differ only by a DFDL property (dfdl:length).
Consider instead using this technique:
<choice> <sequence> <element name="v1" type="pre:empty"/> <element name="a" .../> <element name="c" type="xs:int" dfdl:length="7"/> </sequence> <sequence> <element name="v2" type="pre:empty"/> <element name="b" .../> <element name="c" type="xs:int" dfdl:length="6"/> <element name="spare" type="xs:unsignedInt" dfdl:length="1"/> </sequence> </choice>
This uses a marker element which will be <v1/>
or <v2/>
before the other elements. A path to the 'c' element will not have a v1 nor v2 element parent.
Such paths are then version polymorphic, which is very much preferable.
The type pre:empty
can be defined to be an unaligned empty sequence so that it has no representation in the data stream.
2 Comments
Mike Beckerle
See also this email about choices with empty branches, for example:
This is best avoided as it causes incorrect XSD validation in current versions of Xerces C, a popular XML validator library.
See issue: XERCESC-2243 - Getting issue details... STATUS
Mike Beckerle
A recent (2024-4Q) style improvement has been to define only types and groups in a target namespace for a schema, and define top level elements in a no namespace schema which simply imports the schema defining the types and groups, and which defines only the top-level elements.
This schema will often define exactly two top-level elements. One is for an individual message or record of the data format. The second is for a repeating array of these. The latter allows for tests to parse files containing many data items, not just one, and is mostly for convenience of creating tests.
This has the advantage then that
As an example of this, see the fakeTDL data format example on github here: https://github.com/DFDLSchemas/faketdl/blob/253caa9800717a6a41006354e22f6de52d007bfd/src/fakeTDL.dfdl.xsd