Overview
The file management component of a Catalog and Archive Service (CAS) should provide everything that you need to catalog, archive and manage files, and directories, and their associated metadata. This capability is largely provided by the existing CAS, however there are several capabilities that need to be extended, enhanced, or completely rethough and engineered to support current and upcoming missions and projects. These capabilities more or less fall into several areas, including:
Desired Capabilities
- Persisting archived files using dynamic metadata and flexible, adaptable policies based on product types, rather than the monolithic and inflexible existing method of ProductTypeRepository/ProductName/ProductVersion/ as the filesystem location to store products for all product types.
- Clearly separating out the Workflow aspects of the File Manager, from Product ingestion, and flexibly supporting association of Workflows and their subsequent Tasks with any event, not only ingestion.
- Supporting the separation of Metadata Stores, and Data Stores, and being able to persist Product and File information to potentially many data stores, at the same time persisting Product metadata to potentially many metadata stores (such as Lucene), rather than only supporting database systems.
- Being able to select the most appropriate method of data movement for ingestion, and distribution purposes, even potentially use some of them in tandem, rather than the existing approach of selecting a particular middleware implementation technology, and then being tied to that medium for data transfer and communication. Especially RMI, and CORBA. They are really old and stinky. Smelly. Just all around ugly. Let's agree that HTTP/REST, SOAP, WebDAV, and XML-RPC are light-weight, pervasive, standard and all around cool enough to be the mediums we'll consider.
- Leveraging existing transactional models such as Java's Transaction API to support transactional management rather than building our own API.
- If we do use any database communication, then making sure that all DB communication is dealt with using standard, available, existing db pooling APIs such as commons-dbcp, available from Apache.
- Separating out the Metadata Element Registry aspects of the existing CAS (such as Element Policy management), and moving those pieces to standard, or available Metadata Registries, * cough cough *, aren't we developing one of those?
- Clearly separating out the administrative portions of policy management from the existing webapp, and distinguishing what pieces of the webapp are user-centric, and what are administrative-centric.
- Supporting facet-based search and free-text based search from the File Management webapp.
- Supporting RSS based syndication of Product feeds, and create RSS channels for new Product Types, and overall Products, etc.
- Supporting heirarchical product structures, such as nested directories that contain many sub-directories, and sub-directories of those sub-directories, with files strewn about at all levels, rather than only supporting the existing method of flat product structures, where all files in a product are at the same tree level.
- Support metadata extraction based on product type or mime-type. This could leverage the same type of plugin architecture that search engines use. This method will allow for greater flexibility and reuse. The plugins will drive the metadata which gets extracted and the catalog will only be responsible for persisting and searching. This is the same approach that many search engines, including Nutch take. One really good thing about this is there are already a body of mime-type plugins available to extract metadata from word, pdf, mp3, jpeg, etc. files. Additionally, when a developer created a plugin to extract metadata from say a PDS Label file this could easily be reused.
- Support dynamic product types. The file management component should not need to know about every product type a priori. Minimally, a file based metadata extractor can be associated with any ingestion so the standard metadata, such as filename, filesize, last modified, etc. can be extracted. This would allow for greater flexibility and we would finally be able to archive files with just a basic installation without creating any additional policy.