Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Motivation

DFDL needs an extension that allows data much larger than memory to be manipulated.

...

An important use case for DFDL is to expose this metadata for easy use, and to provide access to the large data via a streaming mechanism akin to opening a file, rather than including large chunks of a hexBinary string in the infoset, as is common today.

In RDBMS systems, BLOB (Binary Large Object) and CLOB (Character large object) are the types used when the data row returned from an SQL query will not contain the actual value data, but rather a handle that can be used to open/read/write/close the BLOB or CLOB.

...

This also eliminates the limitation on object size.

Basic Blob Requirements

The basic requirement has almost nothing to do with DFDL.

We want to represent an image file in XML, except the BLOB of compressed image data, which we want to reliably incorporate by reference.

So instead of

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  <img>
   098fad0965edcab...giant.hexbinary.string...many megs or gigs in size
  </img>
</someImage>

Instead the bytes corresponding to the image data go in a separate file "img.dat", and the infoset becomes

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  ... some way of saying img.dat blob goes here...
</someImage>

A few requirements:

  1. The document must still be validated relative to its DFDL schema/XML Schema - so the BLOB must be content that can be validated. That suggests it is an element. This validation does not have to touch or even verify the existence of the BLOB file.
    1. This means the element must be expressed in DFDL's subset of XML Schema. Hence, it is not an element with attributes, as attributes aren't part of the DFDL schema language.
  2. The BLOB must be able to refer to a region of bytes within a file. This is so that DFDL can be used to identify the location of the BLOB in a file being parsed, without having to copy or bring into memory the BLOB data. Rather, the Infoset can contain a BLOB that identifies the original file, and the location within it.
    1. Note: This is a special case of a general capability for any element in a DFDL schema - users may want to know its exact starting position and length, measured in bits or bytes, if only for trace/debug or verification purposes.
  3. It should not require Daffodil to be used to manipulate these XML files that contain BLOB references. Interpreting the BLOB information should not require information bases that are maintained by Daffodil libraries (e.g., mappings from GUIDs to files)
    1. We may want to provide a convenient scala/java library for this, it should not be bundled into Daffodil libraries, but be easily isolated.

One concrete suggestion is:

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  <img><BLOB daf:BLOB="true">../blobs/img.dat?offset0b=0;kind=raw</BLOB></img>
</someImage>

In the above we've introduced an element named BLOB which which takes a special URI which can be absolute or relative, and it identifies the blob data. The offset0b is a zero-based byte-offset into the file where the BLOB data starts. The suffix "0b" on the name indicates that it is zero-based, to distinguish from XML normal conventions which are 1-based. The value of offset0b defaults to 0. An optional length=N attribute would constrain the length of the BLOB data, and the kind=raw gives that the data is not encoded or compressed in any way. (kind=raw would be the default.)

This URL would be parsed by conventional URL libraries. What is called the query, the part after the "?" is a list (";" separated) of pairs of "keyword=value" form.

BLOBs as Layers

A DFDL schema using a BLOB would look, for example, like this:

Code Block
<element name="img" >
  <complexType>
    <sequence>
      <element name="BLOB" daf:layerBoundaryMark="[END-IMAGE]"
         type="daf:URI4BLOB" daf:layerTransform="daf:BLOB" daf:layerLengthKind="boundaryMark"/>
    </sequence>
  </complexType>
</element>

A schema containing daf:URI4BLOB would be provided and would contain roughly:

Code Block
<simpleType name="URI4BLOB" dfdl:encoding="utf-8">
  <restriction base="xs:string">
     <pattern value="..regex for these URIs.."/>
  </restriction>
</simpleType>

Here we see that a BLOB is actually created by way of a layering. The BLOB layer implements isolation of the BLOB contents, and produces (when parsing),  bytes containing the URI in UTF-8 encoding. When unparsing, the layer transform takes the URI, and obtains the corresponding bytes by opening the URI to obtain a Java InputStream.

BLOB Use Cases

There are a few different use cases. The variations have to do with how the BLOB data is accessed, over what time span it is accessible, and when resources can be reclaimed.

Image Filtering In A Process Pipeline

Parser produces an infoset containing a durable blob handle. This blob handle provides access to the blob data even after the parser has terminated, and the process exited.

The blob handle can be opened, to get an input stream, and the bytes in it read like any Java InputStream.

The parser must be run in a BLOB='persistent' mode (API TBD for this.) which tells it to create permanent URIs, and never to release/delete the underlying resources in any automatic way.

An API provides the ability to create a blob handle for a Java OutputStream (the two can be created simultaneously), which can then be opened, written, closed/flushed, and then blob handle can be used as a replacement for an input blob handle.

The notion here is that one opens and reads-from the input blob handle, and one processes the data, and if modified, you supply, on output, a replacement blob handle.

The unparser consumes an infoset containing a blob handle, and reads from it the data, writing that as the "contents" of the corresponding element.

The parser and unparser are independent processes that do not necessarily overlap in time existence. Their only communication is through the blob handle. Hence, the blob objects are allocated at the system level, and are not part of the state of the parser nor unparser. (E.g., they could be files).

A blob handle survives a reboot of the computer system - it's state is durable, so that if you write out the infoset from a parse of data, as an XML text file, then reboot the computer, you can then read that XML text file, find the BLOB handles within it, and open them.

A blob handle is some opaque URI, supporting the openStream API.

Each BLOB must be explicitly discarded. A convenience API might walk an entire infoset (as XML), and discard each BLOB found.

A non-native attribute daf:BLOB='true' is the XML representation indicating a BLOB-valued element. The blob handle is the VALUE of the element.

The lifetime of the BLOB resources (typically files) is not controlled in any way here any more than the lifetime of the original file.

Single Process, Single Thread, SAX-style Event, Stateless

In this case,  a single process with code written in Scala/Java is performing parse, transform, and unparse of data. The code is single threaded.

The parser is generating SAX-style infoset events for the start and end of each element.

BLOBs are processed in a streaming mode (API call to set this TBD).

To process BLOB contents, the application's startElement() method would simply have to check for a blob (by calling isBLOB() method which is part of the extended API of an event handler).

(TBD: or we could require the handlers to be special blob-aware handler with a startBLOBElement() method and endBLOBElement() method. This potentially is lower overhead.)

The lifetime of this BLOB input stream is only until the SAX-style event callback returns. At that point the resources/storage can be reclaimed.

So the parser BLOB API is that the parser calls the SAX-style event handler with a BLOB method, handing it an open input stream.

The unparser BLOB API is to be such a SAX-style event handler, and implement the BLOB method, reading data from the open input stream, and unparsing it.

In this use case, the DFDL schema element corresponding to the BLOB object must carry an explicit BLOB annotation (extension to DFDL v1.0) indicating that it is to be treated as a BLOB, and that its 'value' is a BLOB handle (Which could be a BLOB URI).

However, in this case, the BLOB handle, if output as text (e.g., by printing the resulting XML instead of unparsing it), just serves to document the past skipping over of the BLOB.

It is possible to parse and unparse an arbitrarily large image file in only finite memory using this API, so long as the image file format is streamable. 

Implementation Note: For unparsing, DirectOrBufferedDataOutputStream may need to grow a special form of BufferedDataOutputStream which is a BLOB. No point in double buffering a BLOB, the BLOB object itself is very much a buffer.  We simply need to know how to recombine its data into the DirectDataOutputStream at the right point in tim

Implementation Concerns

...

Requirements

  1. Rather than hexBinary data appearing in the infoset, the infoset should include some unique identifier to a location external to the infoset where the hexBinary byte data can be found.
     

  2. The unique identifier must be expressed in DFDL's subset of XML Schema (e.g. things like an attribute containing the unique identifer is not allowed, as attributes aren't part of the DFDL schema language).
      
  3. Accessing, modifying, or creating custom BLOB resources should be possible without the use of Daffodil (e.g. no GUID mappaings stored in Daffodil memory).
      
  4. New API calls or changes to existing API calls should not be required if the BLOB feature is not enabled in a schema (e.g. maintain source backwads compatibility).
     
  5. Suport is only needed for wirting BLOB's out to files, one per BLOB. Future enhancements can be made to support alternative BLOB storage mediums (e.g. databases, offset into original file, API for custom stroage) if necessary.

Proposed Changes

DFDL Extension 'blob' Simple Type

A new type is defined internal to Daffodil in the DFDL exension namespace like so:

Code Block
languagexml
titledfdlx.xsd
<xs:schema
  targetNamespace="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  ...

  <xs:simpleType name="blob">
    <xs:restriction base="xs:anyURI">
  </xs:simpleType>

</xs:schema>

This defines a new simple type, dfdlx:blob, that can be used in a DFDL schema to specify that an element should use this new BLOB feature. An element with type dfdlx:blob has the exact  same restrictions and avialable properties as elements with the xs:hexBinary type. The only difference is that rather than the hexBinary data being output to the infoset, the hexBinary data will be written to a file (as bytes rather than a hexBinary string), and a URI that that identifies file will be inserted into the infoset.

An example of this usage in a DFDL schema may look something like this:

Code Block
languagexml
<xs:schema
  xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

  <xs:element name="data" type="dfdlx:blob" dfdl:lengthKind="explicit" dfdl:lengthUnits="bytes" dfdl:length="1024" />

</xs:schema>

The resulting infoset will look something like this:

Code Block
languagexml
<data>file:///path/to/blob/data</data>

With the 1024 bytes of data being written to a file at location /path/to/blob/data.

Because the dfdlx:blob type is based on the xs:anyURI type, non DFDL-aware schema validators are still able to validate that the BLOB elements are a valid URI. However, for this initial propsoal, the BLOB URI will always use the file scheme. Although this may be a restrictive limitation for some usecases, the flexibilty and generality of URI's allows for future enhancement to support different or even custom schemes if needed. 

A benefit of this proposal is its simplicity and non-reliance on other DFDL extensions (e.g. one does not need to implement the DFDL layer extension to support this).

Regarding compatability, because the dfdlx:blob  type is defined internal to Daffodil (in dfdlx.xsd), other DFDL implementations are likely to error when this type is used in a schema because it is an uknown type to them. However, because the properties and restrictions are exactly the same as those for xs:hexBinary, some solutions for to allow compatability:

  1. Implementations or users can define the "blob" DFDL extension simple type with a restriction of xs:hexBinary 
  2. DFDL schema aurthors could change all occurrences of dfdlx:blob  in a schema to xs:hexBinary


Each of these solutions provide the current DFDL behavior of outputting hexBinary data to the infoset as is standard.

Daffodil API

With a new simple type defined, some changes to the API are needed to specify where Daffodil should write these new BLOB files. A likely usecase is a need to  define a different BLOB output directory for each call to parse(). Thus, changes to the API must be made to define the output diretory either directly to the parse() function or to a paramter already passed to the parse function. Since the InfosetOutputter  is related parse output, and the BLOB file is a sort of output, it makes the most sense for such definitions that control BLOB file output to added to the InfosetOutputter.

Two functions are added to the InfosetOutputter.

The first API function allows a way to set the properties used when creating BLOB files, including the output directory, and prefix/suffix for the BLOB file.

Code Block
languagescala
/**
 * Set an output directory and file prefix/suffix to be used when creating
 * files containing BLOB data. These values will only be used when Daffodil
 * parses an element with the dfdlx:blob type to determine which file to write
 * the blob data.
 *
 * The directory, prefix, and suffix parameters will be passed directly to
 * File.createTempFile to create a new empty file.
 *
 * @param directory The output directory to write BLOB files to. If None, any
 *                  attempts to parse dfdlx:blob data will result in a Schema
 *                  Definition Error.
 * @param prefix    The prefix string to be used in generating the BLOB file name.
 *                  Must be at least three characters long.
 * @param suffix    The suffix string to be used in generating the BLOB file name.
 */

final def setBlobOutputFileProperties(directory: Option[File], prefix: String, suffix: String)

The second API function allows a way for the API user to get a list of all BLOB files that were created during parse().

Code Block
languagescala
/**
 * Get the list of all BLOB files creates during this call to parse. It is the
 * responsibility of the caller to delete these files when appropriate.
 *
 * @return Sequence of File's containing BLOB data
 */
final def getBlobFiles(): Seq[File]

Note that no changes to the unparse() API are required, since the BLOB URI provides all the necessary information to retrieve files contianing BLOB data.

Schema Compilation

Schema compilation remains mostly the same. Daffodil will treat elements with the special dfdlx:blob type exactly the same as if the type were xs:hexBinary. The only difference being that a new flag will be passed to the existing HexBinary primitives/processor to alter their behavior to output to a file rather than to the infoset.  

Parser Runtime

To support BLOBs, the BLOB parse logic follows these steps:

  1. As with hexBinary, determine the starting bitPosition and length of the hexBinary content
  2. Create a new BLOB file using directory/prefix/suffix information set in the InfosetOutputter. If the directory is None or creation of the temp file fails, throw a Schema Defintion Error.
  3. Open the newly created file using using a FileOutputStream. If opening of the file fails, throw a Schema Definition Error.
  4. Read length bytes of data from the ParseState dataInputStream and write them out to the FileOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point no more bytes are available, throw a PENotEnoughBits parse error. If there is an IOException, throw a Schema Definition Error.
  5. Close the file stream.
  6. Set the value of the current element to the URI of the File.

Additionaly, logic must be created to remove BLOB files if Daffodil backtracks past an already created BLOB. This involves modifying the InfosetImpl  so that when we restore state we inspect all removed children (potentially recursively?) for any that are BLOB types and delete the file specified by its URI. If we fail to delete a file, a Schema Defintion Warning should be emitted to alert users. This file should still be returned when getBlobFiles is called, even though the file will not be referened in the infoset. The caller is responsible for deleting this file since Daffodil could not remove it.

Unparser Runtime

To support BLOBs, the BLOB unparse logic follows these steps:

  1. Get the URI from the infoset and the file length. If the length cannot be determined, throw an UnparseError.
  2. As with hexBinary, determine the length of the hexBinary content and error if the BLOB file length is larger than the content length
  3. Open the File using a FileInputStream. If opening of the file falis, throw an UnparseError
  4. Read bytes from the FileInputStream and write them to the UState dataOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point there is an IOException, throw an UnparseError.
  5. As with hexBinary, write skip bits if the content length is not filled completely.

Note that we are explicitly not removing files after unparsing their data. It is the responsibility of the API user to determine when files are no longer needed and remove them.

TBD: We may need to add a feature so that if we unparse to a data output stream that is not direct (i.e. the backing OutputStream is a ByteArrayOutputStream with a 2GB limit), we should split it a new buffered output stream and continue writing to that.

DFDL Expression

There are going to be cases where expressions may want to reference elements with type dfdlx:blob.

This proposal adds the restriction that any expression access to the data of a BLOB element is not be allowed. This limitation is really for practical purposes. Presumably, the dfdlx:blob type is only to be used because the data is very large or meaningless, and so accessing the data is unnessary. This restriction minimizes comlexity since expression do not need to worry about converting blobs to byte arrays. If it is later determined that such a feature is needed, this restriction may be lifted. Any access to the data of a BLOB will result in a Schema Definition Error during schema compilation.

This proposal does allow for access to the length  of a BLOB element. This is almost certainly needed since it is very common in data formats to include both a BLOB payload and the length of that payload. On unparse, we certainly almost certainly need the ability to calculate the length of the BLOB data so that the value can be output in a length field in the data. Fortunately, the content/valueLength functions to not actually query the data, but instead query bitPositions in stored in the infoset. Thus, no changes should be necessary to support this. 

Minimal changes my be necessary to make the expression language aware of the dfdlx:blob  type and act accordingly.

TDML Runner

Because the parsing of BLOBs results in a random URI in the infoset, this provides challenges to the TDML runners ability to compare expected and actual infosets. To resolve this, the TDML Runner will be modified in the following ways:

  1. Use the new API to specify a temp directory for BLOBs to be stored
      
  2. Perform type aware comparisions for the dfdlx:blob type, similar to what we do now for xs:date, xs:dateTime, and xs:time. Type awareness will be enable by using the xsi:type  attribute on the expected infoset, since Daffodil does not currently supprt adding xsi:type information to the actual infoset as of yet. And example looks something like:

    Code Block
    languagexml
    <tdml:dfdlInfoset>
      <data xsi:type="dfdlx:blob">file:///path/to/blob/data</data>
    </tdml:dfdlInfoset>

    During type aware comparisons, the TDML Runner will extract and modify the path (e.g. treat it as not absolute) to be suitable for use in logic similar to finding files using the type="file" attribute for expected infosets. Once the expected file is found, it will compare the contents of that file with the contents of the URI specified in the actual infoset and report any differences as usual.
      

  3. After a test completes, delete all BLOB files listed in the InfosetOutputter

Command Line Interface

A new -b/--blob-dir option will be added to specify a custom blob directory, defaulting to "daffodil-blobs" in the current working directory if not specified. The directory should only be created when blobs are created.

The CLI will be modified to use the new BLOB API on the InfosetInputter to set the BLOB directory appropriately.

The CLI will never delete blob files.  

...