Some document formats are stored as packages — collections of files — typically contained in a zip file. These typically use a file format-specific extension - for example .docx, .xlsx, and .pptx in OOXML, and .odt, .ods, and .odp in ODF.

Similarly, HTML documents rely on multiple files — images, stylesheets, and scripts are generally separate from the main content of the document itself, and are referenced from special elements and attributes in a .html file. However unlike the formats above, they are generally stored in directories on the filesystem — the original reason being so that when the directory is made available via a web browser, clients can retrieve the resources independently.

During conversion, editing, and other processes, we often want to store data in memory only, rather than a zip file or a directory. For in-memory representations of various document formats, this typically involves multiple, named streams of data corresponding to the files or zip entries that came from or are destined for the filesystem.

The purpose of DFStorage is to provide an abstraction layer over all three ways of storing data - zip files, directories, and memory. It acts like a virtual filesystem layer, with typical functions for reading, writing, and deleting files, and obtaining a listing.

The DFStorage API differs from a typical filesystem API like POSIX in two key ways. The first is that there is no inherent concept of directories - all files are stored in a "flat" collection, although they are identified by path names, which can contain slashes. A directory hierarchy exposed via DFStorage contains a single list of all descendents of the storage root, with path names used to indicate their location relative to that root; simillarly, the list corresponds to the central directory listing of a zip file.

The second difference is that read and write operations are done on entire files; there is no streaming option available. This is primarily for convenience of implementation, and due to to the fact that in this project we are generally dealing with relatively small amounts of data which can easily fit in memory; at most you might have an image consuming several megabytes, which presents no problem for modern hardware. Thus a read operation returns the entire content of a file, and a write operation requires the entire content to be supplied in one go.

DFStorage is effectively an interface, with separate implementations. It is implemented using a function table. Multiple constructor functions are available, each dealing with a different implementation. A constructor function initialises the object with the appropriate function table, and all publicly-exposed APIs simply call through to the appropriate entry in the function table.

  • No labels