MetExtractors for Crawler

In progress

This is a proposal for new classes to aid in the development of MetExtractors for the Crawler (client-side met extraction).

Use Case 1: Ingesting files with information in the filename.

Suppose I have files in the staging area ready to be ingested. These files usually have information encoded into the filename in order to distinguish the contents of one file from other files. For example book-1234567890.txt might be the contents of a book with ISBN 1234567890. Or page-1234567890-12.txt might be the text on page 12 of book with ISBN 1234567890.

ProdTypePatternMetExtractor

It would be useful to generate metadata from the information encoded in the filename (think: filename => metadata). The ProdTypePatternMetExtractor allows this in a flexible manner using regular expressions. Let's take a look at the config file for this met extractor.

prod-type-patterns.xml

<config>

  <!-- <element> MUST be defined before <product-type> so their patterns can be resolved -->
  <!-- name MUST be an element defined in elements.xml (also only upper and lower case alpha chars) -->
  <!-- regexp MUST be valid input to java.util.regex.Pattern.compile() -->
  <element name="ISBN" regexp="[0-9]{10}"/>
  <element name="Page" regexp="[0-9]*"/>
  
  <!-- name MUST be a ProductType name defined in product-types.xml -->
  <!-- metadata elements inside brackets MUST be mapped to the ProductType, as defined in product-type-element-map.xml -->
  <product-type name="Book" template="book-[ISBN].txt"/>
  <product-type name="BookPage" template="page-[ISBN]-[Page].txt"/>
  
</config>

This file defines a regular expression for the "ISBN" metadata element, in this case, a 10-digit number. Also, the "Page" metadata element is defined as a sequence of 0 or more digits.

Next, the file defines a filename pattern for the "Book" product type. The pattern is compiled into a regular expression, substituting the previously defined regexes as capture groups. For example, "book-[ISBN].txt" compiles to "book-([0-9]{10}).txt", and the ISBN met element is assigned to capture group 1. When the filename matches this pattern, 2 metadata assignments occur: (1) the ISBN met element is set to the matched regex group, and (2) the ProductType met element is set to "Book".

Similarly, the second pattern sets ISBN, Page, and ProductType for files matching "page-([0-9]{10})-([0-9]*).txt".

This achieves several things:

assigning met elements based on regular expressions
assigning product type based on easy-to-understand pattern with met elements clearly indicated
reuse of met element regexes

Differences from FilenameTokenMetExtractor:

Allows dynamic length metadata (does not rely on offset and length of metadata)
Assigns ProductType

Differences from AutoDetectProductCrawler:

Does not require definition of custom MIME type and MIME-type regex. Really, all you want is to assign a ProductType, rather than indirectly assigning a custom MIME type that maps to a Product Type.

Differences from FilenameRegexMetExtractor:

Assigns ProductType. FilenameRegexMetExtractor runs after ProductType is already determined.
Runs on the client-side (crawler). FilenameRegexMetExtractor runs on the server-side (filemgr).
Different patterns for different ProductTypes. FilenameRegexMetExtractor config applies the same pattern to all files.

Prerequisites:

<element> tag occurs before <product-type> tag
<element> @name attribute MUST be defined in FileManager policy elements.xml
<element> @regexp attribute MUST be valid input to java.util.regex.Pattern.compile()
<product-type> @name attribute MUST be a ProductType name (not ID) defined in product-types.xml
met elements used in <product-type> @template attribute MUST be mapped to the ProductType, as defined in product-type-element-map.xml

Words of Caution

Does not support nested met elements. Probably would have to assign capture groups to met elements, but this loses reusability. Maybe something like this?
```
<product-type name="PH" regexp="(([0-9]{3})[0-9]{7})" group1="PhoneNumber" group2="AreaCode"/>
```

Each pattern should map to one product type. Watch out for similar patterns. Don't do this:

  <element name="Page" regexp="[0-9]*"/>
  <element name="Chapter" regexp="[0-9]*"/>

  <product-type name="Page" template="data-[Page].txt"/>
  <product-type name="Chapter" template="data-[Chapter].txt"/>

Instead, encode the product type information into the filename, for example:

  <element name="Page" regexp="[0-9]*"/>
  <element name="Chapter" regexp="[0-9]*"/>

  <product-type name="Page" template="page-[Page].txt"/>
  <product-type name="Chapter" template="chapter-[Chapter].txt"/>

Use Case 2: Extracting Metadata in PGE tasks

It is a common use case to ingest the files output by a PGE task, and at the same time generate/extract metadata. PGE tasks use PcsMetFileWriter subclasses to generate a metadata file before ingesting the file+metadata. We should be able to reuse CmdLineMetExtractors (crawler met extractors) in PGE tasks. To accomplish this, we create a generic PcsMetFileWriter wrapper that invokes CmdLineMetExtractors with their accompanying config file.

Is this obsolete? I was looking for "FilenameExtractorWriter" and "PcsMetFileWriter", and they are no longer in OODT. In fact, they last appeared in v0.3. Does the 0.7 PGE task somehow invoke the crawler for ingestion?

Space shortcuts

Page tree

Use Case 1: Ingesting files with information in the filename.

ProdTypePatternMetExtractor

Use Case 2: Extracting Metadata in PGE tasks

3 Comments

Cameron Goodale

Cameron Goodale

Shakeh Elisabeth Khudikyan