In progress

This is a proposal for new classes to aid in the development of MetExtractors for the Crawler (client-side met extraction).

Use Case 1: Ingesting files with information in the filename.

Suppose I have files in the staging area ready to be ingested. These files usually have information encoded into the filename in order to distinguish the contents of one file from other files. For example book-1234567890.txt might be the contents of a book with ISBN 1234567890. Or page-1234567890-12.txt might be the text on page 12 of book with ISBN 1234567890.

ProdTypePatternMetExtractor

It would be useful to generate metadata from the information encoded in the filename (think: filename => metadata). The ProdTypePatternMetExtractor allows this in a flexible manner using regular expressions. Let's take a look at the config file for this met extractor.

prod-type-patterns.xml
<config>

  <!-- <element> MUST be defined before <product-type> so their patterns can be resolved -->
  <!-- name MUST be an element defined in elements.xml (also only upper and lower case alpha chars) -->
  <!-- regexp MUST be valid input to java.util.regex.Pattern.compile() -->
  <element name="ISBN" regexp="[0-9]{10}"/>
  <element name="Page" regexp="[0-9]*"/>
  
  <!-- name MUST be a ProductType name defined in product-types.xml -->
  <!-- metadata elements inside brackets MUST be mapped to the ProductType, as defined in product-type-element-map.xml -->
  <product-type name="Book" template="book-[ISBN].txt"/>
  <product-type name="BookPage" template="page-[ISBN]-[Page].txt"/>
  
</config>

This file defines a regular expression for the "ISBN" metadata element, in this case, a 10-digit number. Also, the "Page" metadata element is defined as a sequence of 0 or more digits.

Next, the file defines a filename pattern for the "Book" product type. The pattern is compiled into a regular expression, substituting the previously defined regexes as capture groups. For example, "book-[ISBN].txt" compiles to "book-([0-9]{10}).txt", and the ISBN met element is assigned to capture group 1. When the filename matches this pattern, 2 metadata assignments occur: (1) the ISBN met element is set to the matched regex group, and (2) the ProductType met element is set to "Book".

Similarly, the second pattern sets ISBN, Page, and ProductType for files matching "page-([0-9]{10})-([0-9]*).txt".

This achieves several things:

  1. assigning met elements based on regular expressions
  2. assigning product type based on easy-to-understand pattern with met elements clearly indicated
  3. reuse of met element regexes

Differences from FilenameTokenMetExtractor:

  1. Allows dynamic length metadata (does not rely on offset and length of metadata)
  2. Assigns ProductType

Differences from AutoDetectProductCrawler:

  1. Does not require definition of custom MIME type and MIME-type regex. Really, all you want is to assign a ProductType, rather than indirectly assigning a custom MIME type that maps to a Product Type.

Differences from FilenameRegexMetExtractor:

  1. Assigns ProductType. FilenameRegexMetExtractor runs after ProductType is already determined.
  2. Runs on the client-side (crawler). FilenameRegexMetExtractor runs on the server-side (filemgr).
  3. Different patterns for different ProductTypes. FilenameRegexMetExtractor config applies the same pattern to all files.

Prerequisites:

  1. <element> tag occurs before <product-type> tag
  2. <element> @name attribute MUST be defined in FileManager policy elements.xml
  3. <element> @regexp attribute MUST be valid input to java.util.regex.Pattern.compile()
  4. <product-type> @name attribute MUST be a ProductType name (not ID) defined in product-types.xml
  5. met elements used in <product-type> @template attribute MUST be mapped to the ProductType, as defined in product-type-element-map.xml

Words of Caution

  • Does not support nested met elements. Probably would have to assign capture groups to met elements, but this loses reusability. Maybe something like this?

    <product-type name="PH" regexp="(([0-9]{3})[0-9]{7})" group1="PhoneNumber" group2="AreaCode"/>
    
  • Each pattern should map to one product type. Watch out for similar patterns. Don't do this:

      <element name="Page" regexp="[0-9]*"/>
      <element name="Chapter" regexp="[0-9]*"/>
    
      <product-type name="Page" template="data-[Page].txt"/>
      <product-type name="Chapter" template="data-[Chapter].txt"/>
    

    Instead, encode the product type information into the filename, for example:

      <element name="Page" regexp="[0-9]*"/>
      <element name="Chapter" regexp="[0-9]*"/>
    
      <product-type name="Page" template="page-[Page].txt"/>
      <product-type name="Chapter" template="chapter-[Chapter].txt"/>
    

Use Case 2: Extracting Metadata in PGE tasks

It is a common use case to ingest the files output by a PGE task, and at the same time generate/extract metadata. PGE tasks use PcsMetFileWriter subclasses to generate a metadata file before ingesting the file+metadata. We should be able to reuse CmdLineMetExtractors (crawler met extractors) in PGE tasks. To accomplish this, we create a generic PcsMetFileWriter wrapper that invokes CmdLineMetExtractors with their accompanying config file.

Is this obsolete? I was looking for "FilenameExtractorWriter" and "PcsMetFileWriter", and they are no longer in OODT. In fact, they last appeared in v0.3.   Does the 0.7 PGE task somehow invoke the crawler for ingestion?

3 Comments

  1. Ricky,

    I am a huge fan of doing the design (and documentation) before the code, and this is an excellent example of doing just that.

    I also think this could have saved Paul R. and I a ton of config work for our Moon Mapping Project since this is a simpler alternative to declaring custom MIME Types.

    Over in your Words of Caution Section I personally would imagine having re-usable elements, so instead of this:

    <element name="Page" regexp="[0-9]*"/>
    <element name="Chapter" regexp="[0-9]*"/>
    
    <product-type name="Page" template="page-[Page].txt"/>
    <product-type name="Chapter" template="chapter-[Chapter].txt"/>
    

    I would favor this:

    <element name="Integer" regexp="[0-9]*"/>
    
    <product-type name="Page" template="page-[Integer].txt"/>
    <product-type name="Chapter" template="chapter-[Integer].txt"/>
    

    The best part is with the open flexibility I can do it either way.

    So last comment, do you have a 2nd Use Case in mind?

    Excellent work --I will be watching this page

    1. Starting a Conversation with myself here

      So I re-read what I wrote earlier (like 1 minute ago) and I see now why you wouldn't want the re-usable elements, because they are in fact METADATA elements, and should describe the data they contain, and Integer is far too vague to be useful metadata post-ingestion.

      So ignore my previous comment about the Integer thing, in retrospect my logic was flawed.

      C-