Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This set of notes represents best practices after learning from many debugging exercises. 

Avoid Element References and Global Element Declarations

DFDL Schemas should use elementFormDefault="unqualified" (which is the default for XML Schemas).  There's no need for every child element to have a namespace (hence prefix), when the tree they are part of has a namespace prefix somewhere further towards the root which makes the identity of those child elements unambiguous. 

Global elements should be defined only as an assistance for testing the schema. 

...

Code Block
<schema 
  targetNamespace="urn:mySchemaNamespace"
  xmlns:msns="urn:mySchemaNamespace" 
  ... >

... import/include and top level format annotations...

<!-- 
  This one-liner below is the ONLY global element in the entire schema, 
  and schema users can always ignore it and just use the complex type, so 
  they can call the element in their schema whatever they want.

  At the same time this single root allows users to easily 
  test the schema with the daffodil CLI or daffodil-vscode extension, 
  without having to specify a root element in a separate file. 
--> 

<element name="mySchema" type="msns:mySchemaType"/> 

<complexType name="mySchemaType">
     ... the real schema contents is all reachable from here. ...
</complexType>

... other types and groups ...

</schema>

Included files, and imported files that are part of the same schema project should either have no global elements at all, or one, like the above, to facilitate testing. 

But they should always include equivalent complex type or group definitions allowing those global elements to be bypassed/ignored. 

Rationale: This makes schemas more flexible for reuse because it takes no position on element names that the schema user can't avoid if they so choose. 

A second global element can also sometimes be useful for testing against files with multiple data items in it. This second global element would almost always look like:

Code Block
<element name="mySchemaFile">
  <complexType>
    <sequence>
      <element name="mySchema" type="msns:mySchemaType" maxOccurs="unbounded" dfdl:occursCountKind='implicit'/>
    </sequence>
  </complexType>
</element>

Note how this does not have an element reference in it, but a local element declaration for the mySchema child element. 

Lastly, no other structured data system has anything like element references, so in the interests of being able to use DFDL and transform data into the data models used by other processing fabrics, element references should be avoided. 

Summary: schema files should have zero, one, or at most two global element declarations in them, and those are there for convenient testing, and may be ignored entirely when the schema is reused.

Namespaces, Namespace Prefixes, Import, Include, and the schemaLocation  Attribute

Namespaces and namespace prefixes in XSD seem simple enough until you start building a very large DFDL schema from multiple disjoint component schemas that are intended for reuse.

DFDL does not have any namespace features of its own, it simply passes through XML Schema's namespace and prefix system. 

(Note however: DFDL does not implement the XML Schema "redefine" construct, but neither do many regular XML Schema software platforms.)

Without following a reasonable set of standard practices it is quite easy to end up in what we call namespace hell. In this situation you get all sorts of diagnostic messages about symbols not being defined, but your import/include files seem to be well specified. Debugging this can be problematic, and you end up with roughly the situation that the guidance below specifies, just after much work and wasted time.

It's also the case that many DFDL applications do not use XML as their output data format. JSON is very popular also, and direct connectors to other data transformation and processing fabrics are in the works which have their own particular data models. XML's data model, and namespace system, really have no corresponding features in many of these other systems like JSON. (E.g., JSON does not have namespaces.) 

The practices here insure a DFDL schema's use of namespaces does not prevent parser/unparser creation/consumption of JSON, or other kinds of data output, using a DFDL processor. 

Staying out of Namespace Hell

The first set of simple rules for staying out of trouble is this:

  • For every target namespace, choose a unique prefix to use everywhere in your schema to refer to that namespace. 
    • The practice of using xmlns:tns prefix within schemas to refer to "this target namespace" should not be used.
  • Schema definitions should, with few exceptions, have a target namespace,.
  • A default namespace should be used only for the XML Schema namespace to avoid having to type "xs:" or "xsd:" everywhere. 

Different schema projects can use different prefixes, but within one schema project one namespace should mean one prefix globally across all files. 

The most critical guidance rules are these:

  • For every target namespace, one file must be the single distinguished one for that namespace. It is the one-and-only schemaLocation  file that is xs:import -ed anywhere one must import that namespace.  
  • That distinguished file must xs:include  all the other files that share that target namespace. 

Note that cyclic usage between namespaces is allowed. Two schema files can xs:import  each other. So long as they have different target namespaces.

However, xs:include  relationships cannot be cyclic.

The rest of this section is effectively just providing rationale for the above guidance. 

Things that Don't Work

Sometimes people want to decompose one namespace into several sub-units, and only import the symbols for the  features of that namespace they need and are using. So they expect they can import a namespace by importing only a specific file that contributes part of the definitions for that namespace. 

This does not​ work.  To achieve that sort of modularity you must decompose to different namespaces. 

The best mental model to understand this is: imagine all the schemaLocation  attributes were erased from all xs:import statements. Imagine the namespace URIs are actually being used to retrieve the namespace file. With this erasure you can only have one place where everything is getting that namespace because that namespace is defined by its URI, and that's also how you retrieve it. 

That's how XSD it works. One namespace == one source == one file providing its definition. 

Some people actually create schemas this way, without schemaLocation on xs:import statements. Then they use an XML Catalog to provide the 1 to 1 mapping of namespaces to the single distinguished file that provides its definition.

We have not used XMLCatalogs

...

much and they are not recommended, as they introduce their own complexities. 

Going back to practices for xs:import, adding back in schemaLocation attributes, it should be clear now that all across a schema, there is a 1 to 1 association of namespaces to a specific schemaLocation. So every xs:import anywhere in your schema, for a given namespace X must provide the same exact schemaLocation Y. 

If you have, anywhere in your schema....

Code Block
<xs:import namespace="ns" schemaLocation="location"/>

then for any specific ns, the location must always be the exact same location. 

What is the problem with the tns  prefix?

It often results in bigger XML due to the need to have xmlns:tns="...."  rebindings in multiple places in XML instance documents. When these are deep in the element nest they can be hard to find. 

...