Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info

This proposal was implemented as part of Daffodil 2.5.0

This page is linked from https://s.apache.org/daffodil-blob-feature. If this page content moves, please update that link from https://s.apache.org.

Motivation

DFDL needs an extension that allows data much larger than memory to be manipulated.

...

  1. Rather than hexBinary data appearing in the infoset, the infoset should include some unique identifier to a location external to the infoset where the hexBinary byte data can be found.
     

  2. The unique identifier must be expressed in DFDL's subset of XML Schema (e.g. things like an attribute containing the unique identifer is not allowed, as attributes aren't part of the DFDL schema language).
      
  3. Accessing, modifying, or creating custom BLOB resources should be possible without the use of Daffodil (e.g. no GUID mappaings stored in Daffodil memory).
      
  4. New API calls or changes to existing API calls should not be required if the BLOB feature is not enabled in a schema (e.g. maintain source backwads compatibility).
     
  5. Suport is only needed for wirting BLOB's out to files, one per BLOB. Future enhancements can be made to support alternative BLOB storage mediums (e.g. databases, offset into original file, API for custom stroage) if necessary.

Proposed Changes

DFDL Extension 'blob' Simple Type

...

type

...

languagexml
titledfdlx.xsd

...

=

...

"

...

xs

...

:

...

anyURI" and dfdl:objectKind

DFDL is extended to allow simple types to have the xs:anyURI type. Elements with this type will be treated as BLOB or CLOB objects.  The dfdlx:objectKind property is added to define what type of object it is. Valid values for this property are "bytes" for binary large objects and "characters" for character large objects.

An example of this usage in a DFDL schema may look something like this:

Code Block
languagexml
<xs:schema
  xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

  <xs:element name="data" type="dfdlxxs:blobanyURI" dfdldfdlx:lengthKindobjectKind="explicitbytes" dfdl:lengthUnits="bytes" dfdl:length="1024" />

</xs:schema>

...

With the 1024 bytes of data being written to a file at location /path/to/blob/data.

Because the dfdlx:blob type is based on the xs:anyURI type, non DFDL-aware schema validators are still able to validate that the BLOB elements are a valid URI. However, for this initial propsoalFor this initial proposal, the BLOB URI will always use the file scheme and must be absolute. Although this may be a restrictive limitation for some usecasesuse cases, the flexibilty flexibility and generality of URI's allows for future enhancement to support different or even custom schemes if needed. 

A One benefit of this proposal is its simplicity and non-reliance on other DFDL extensions (e.g. one does not need to implement the DFDL layer extension to support this).

Regarding compatability, because the dfdlx:blob  type is defined internal to Daffodil (in dfdlx.xsd), other DFDL implementations are likely to error when this type is used in a schema because it is an uknown type to themcompatibility, any implementations that do not support this extension will likely error with an unsupported xs:anyURI  type. However, because the properties and restrictions are exactly the same as those for other syntax and behavior is very similar to types with xs:hexBinary, some solutions for to allow compatability:

...

Each of these solutions provide the current DFDL behavior of outputting hexBinary data to the infoset as is standard. modifications to switch from xs:anyURI to xs:hexBinary should be minimal. 

Daffodil API

With a new simple type defined, some changes to the API are needed to specify where Daffodil should write these new BLOB files. A likely usecase use case is a need to  define a different BLOB output directory for each call to parse(). Thus, changes to the API must be made to define the output diretory either directly to the parse() function or to a paramter parameter already passed to the parse function. Since the InfosetOutputter  is related parse output, and the BLOB file is a sort of output, it makes the most sense for such definitions that control BLOB file output to added to the InfosetOutputter.

...

Code Block
languagescala
/**
 * Set anthe outputattributes directoryfor and file prefix/suffix how to becreate used when creatingblob files.
 * files containing BLOB data. These values will only be used when Daffodil
 * parses@param andir elementthe withPath the dfdlx:blob type to determine which file to write
 * the blob data.
 *
 * The directory, prefix, and suffix parameters will be passed directly to
 * File.createTempFile to create a new empty file.
 *
 * @param directory The output directory to write BLOB files to. If None,the anydirectory
 *            does not exist, Daffodil will  attemptsattempt to parsecreate dfdlx:blob data will result in a Schemait before
 *            writing      Definition Errora blob.
 * @param prefix    Thethe prefix string to be used in generating thea BLOBblob file name.
 *                  Must be at least three characters long.
 * @param suffix    Thethe suffix string to be used in generating thea BLOBblob file name.
 */

final def setBlobOutputFilePropertiessetBlobAttributes(directory: Option[File]Path, prefix: String, suffix: String)

...

Code Block
languagescala
/**
 * Get the list of allblob BLOBpaths filesthat createswere duringoutput thisin callthe to parseinfoset.
 It is the*
 * responsibilityThis ofis the callersame toas deletewhat thesewould filesbe whenfound appropriate.
 *
 * @return Sequence of File's containing BLOB databy iterating over the infoset.
 */
final def getBlobFiles(): Seq[FilePath]

Note that no changes to the unparse() API are required, since the BLOB URI provides all the necessary information to retrieve files contianing containing BLOB data.

Schema Compilation

Schema compilation remains mostly the same. Daffodil will treat elements with the special dfdlxxs:blob type exactly the same as if the type were xs:hexBinary. The only difference being that a new flag will be passed to the existing HexBinary primitives/processor to alter their behavior to output to a file rather than to the infoset.  anyURI type similar to a primitive type (e.g. xs:hexBinary) except output will be written to a file in an efficient manner.

Parser Runtime

To support BLOBs, the BLOB parse logic follows these steps:

  1. As with hexBinary, determine the starting bitPosition and length of the hexBinary content
  2. Create a new BLOB file using directory/prefix/suffix information set in the InfosetOutputter. If the directory is None or creation of the temp file fails, throw a Schema Defintion Error.
  3. Open the newly created file using using a FileOutputStream. If opening of the file fails, throw a Schema Definition Error.
  4. Read length bytes of data from the ParseState dataInputStream and write them out to the FileOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point no more bytes are available, throw a PENotEnoughBits parse error. If there is an IOException, throw a Schema Definition Error.
  5. Close the file stream.
  6. Set the value of the current element to the URI of the File.

AdditionalyAdditionally, logic must be created to remove BLOB files if Daffodil backtracks past an already created BLOB. This involves modifying the InfosetImpl  so that when we restore state we inspect all removed children (potentially recursively?) for any that are BLOB types and delete the file specified by its URI. If we fail to delete a file, a Schema Defintion Warning should be emitted to alert users. This file should still be returned when getBlobFiles is called, even though the file will not be referened in the infoset. The caller is responsible for deleting this file since Daffodil could not remove itcan be handled by storing the list of BLOB files in the PState, and deleting the appropriate files in the list before resetting back to an early state.

Unparser Runtime

To support BLOBs, the BLOB unparse logic follows these steps:

  1. Get the URI from the infoset and the file length. If the length cannot be determined, throw an UnparseError.
  2. As with hexBinary, determine the length of the hexBinary content and error if the BLOB file length is larger than the content length
  3. Open the File using a FileInputStream. If opening of the file falisfails, throw an UnparseError
  4. Read bytes from the FileInputStream and write them to the UState dataOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point there is an IOException, throw an UnparseError.
  5. As with hexBinary, write skip bits if the content length is not filled completely.

Note that we are explicitly not removing files after unparsing their data. It is the responsibility of the API user to determine when files are no longer needed and remove them.

TBD: We may need to add a feature so that if we unparse to a data output stream that is not direct (i.e. the backing OutputStream is a ByteArrayOutputStream with a 2GB limit), we should split it a new buffered output stream and continue writing to that.

DFDL Expression

There are going to be cases where expressions may want to reference elements with type dfdlxxs:blobanyURI.

This proposal adds the restriction that any expression access to the data of a BLOB element is not be allowed. This limitation is really for practical purposes. Presumably, the dfdlxxs:blobanyURI type is only to be used because the data is very large or meaningless, and so accessing the data is unnessaryunnecessary. This restriction minimizes comlexity complexity since expression do not need to worry about converting blobs to byte arrays or some thing else. If it is later determined that such a feature is needed, this restriction may be lifted. Any access to the data of a BLOB will result in a Schema Definition Error during schema compilation.

This proposal does allow for access to the length  of a BLOB element. This is almost certainly needed since it is very common in data formats to include both a BLOB payload and the length of that payload. On unparse, we certainly almost certainly need the ability to calculate the length of the BLOB data so that the value can be output in a length field in the data. Fortunately, the content/valueLength functions to do not actually query the data, but instead query bitPositions in stored in the infoset. Thus, no changes should be necessary to support this. 

Minimal changes my be necessary to make the expression language aware of the dfdlx:blob  the xs:anyURI  type and act accordingly.

...

  1. Use the new API to specify a temp directory for BLOBs to be stored
      
  2. Perform type aware comparisions comparisons for the dfdlxxs:blobanyURI type, similar to what we do now for xs:date, xs:dateTime, and xs:time. Type awareness will be enable by using the xsi:type  attribute on the expected infoset, since Daffodil does not currently supprt adding xsi:type information to the actual infoset as of yet. And example looks something like:

    Code Block
    languagexml
    <tdml:dfdlInfoset>
      <data xsi:type="dfdlxxs:blobanyURI">file:///path/>path/to/blob/data</data>
    </tdml:dfdlInfoset>

    During type aware comparisons, the TDML Runner will extract and modify the path (e.g. treat it as not absolutefind the file and convert it to absolute in the infoset) to be suitable for use in logic similar to finding files using the type="file" attribute for expected infosets. Once the expected file is found, it will compare the contents of that file with the contents of the URI specified in the actual infoset and report any differences as usual.
      

  3. After a test completes, delete all BLOB files listed in the InfosetOutputter

Command Line Interface

A new -b/--blob-dir option will be added to specify a custom blob directory, defaulting to "daffodil-blobs" in the current working directory if not specified. The directory should only be created when blobs are created.The CLI will be modified to use the new BLOB API on the InfosetInputter to set the BLOB directory to "daffodil-blobs" appropriately. A future enhancement may add an option to configure a different blob directory.

The CLI will never delete blob files.