Mime type detection with the AutoDetectProductCrawler

Introduction

The purpose of this page is to discuss technical details relating to how a File Manager may obtain the mime type of a particular file that is about to be ingested when an automatic product detection Crawler is being used.

This article assumes that you have a working knowledge of the File Manager and the Crawler.

Components and Component Configuration

The components involved are the File Manager and the Crawler.

I'm ingesting hdf5 files (ext .h5) products into my archive.

The File Manager is setup and configured to ingest hdf5 files and knows about the expected metadata for the catalogue.

The Crawler is in "automatic product detection mode". This is setup by setting the CrawlerID to AutoDetectProductCrawler.

The Crawler policy/mimetypes.xml and policy/mime-extractor-map.xml files contains the information that is used by the Auto Detect Product Crawler to detect the file type by its regex name pattern so that it can run the correct metadata extractor. In my case I want to detect files that are named something like 1234567890.h5 and run an external extractor (actually some python code). The external extrator is configured in the katfile.config file. I've also set a precondition to make sure that the file size is greater than zero.

policy/mimetypes.xml

...
<mime-info>
	<mime-type type="product/hdf5">
		<glob pattern="\d{10}\.h5$" isregex="true"/>
	</mime-type>
</mime-info>
...

policy/mime-extractor-map.xml

...
<cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
	<mime type="product/hdf5">
		<extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
			<config file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
			<preCondComparators>
				<preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
			</preCondComparators>
		</extractor>
	</mime>
</cas:mimetypemap>
...

Process Steps

The Crawler extracts meta data from the product and produces a <name>.met file.

The Crawler then detects the mime type of the file is detected in the following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java. The relevant code snippet is listed below. The mime-type is actually detected by the Tika library.

org.apache.oodt.cas.filemgr.structs.Reference.java

        ...
        try {
            this.mimeType = mimeTypeRepository

                    .getMimeType(new URL(origRef));

        } catch (MalformedURLException e) {

            e.printStackTrace();

        }
        ...

The Crawler then executes an XML-RPC ingestProduct method call on the File Manager XML-RPC interface. I've captured the methodCall and here is the mime-type member that is passed to the File Manager on a successful ingest:

<member>
    <name>references</name>
       ...
                        <member>
                            <name>mimeType</name>
                            <value>application/octet-stream</value>
                        </member>
                        <member>
                            <name>origReference</name>
                            <value>file:/var/kat/data/1329472755.h5</value>
                        </member>
       ...
</member>

Conclusions

For the auto detection product crawler, the product type is detected separately in two different places:

For metadata extraction, which is defined in the policy files.
By Tika to pass to the File Manager.

For my case the current version of Tika does not correctly identify *.h5 files, so it sets the mime-type to be application/octet-stream.

This issue will be resolved once JIRA https://issues.apache.org/jira/browse/OODT-385 has been completed.

Space shortcuts

Page tree

Introduction

Components and Component Configuration

Process Steps

Conclusions

2 Comments

Chris Mattmann

Thomas Bennett