After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser, and later updated that example with how to get both text and metadata for nested documents using the AutoDetectParser.
If you parse an archive (zip, tar, etc.) the parsed document contains other documents, and any of those documents could also be archives containing other documents, and so on. The example on this page shows you how to do the following:
- Set up the parse context so nested documents will be parsed.
- Wrap the AutoDetectParser so you can get the text and metadata for each nested document.
Here is the full source for Jukka's example for how to get access to nested metadata and document body text. This example writes the metadata and body text for each nested document to standard output. More details about how Jukka's example works are in subsections below.
Main from Jukka's Example
Setting up Recursive Parsing
The example starts by setting up recursive parsing. If you are parsing text files, word documents, etc. then you'll never notice if recursive parsing is enable or not. If you are parsing containers like zip files and tar.gz files, the only way to get the text for the files contained by the containers is to enable recursive parsing.
The way to enable recursive parsing is to create a ParseContext and add a parser to it as shown on the line
context.set(Parser.class, parser). This is the parser that will be used to parse any nested documents.
Parsing a File
The rest of the main function parses a file. The parser used to parse the root document is the same parser that was added to the ParseContext as the parser to use for nested documents.
Looking at the Tika API (http://tika.apache.org/0.7/api/), I don't see a DefaultHandler class or a TikaInputStream. In the place of DefaultHandler you could use BodyContentHandler, and in the place of TikaInputStream you could use FileInputStream.
Jukka's RecursiveMetadata Parser
The parse method is where you get access to the metadata and the body text. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling
super.parse, the metadata object is empty. After
super.parse returns, the metadata object contains all of the metadata the decorated parser found and
System.out.println(metadata) prints all of the metadata to standard output.
By creating a new BodyContentHandler and passing that to
super.parse, the text for each document is captured without mixing it with text from other documents.
Tracking how far down the Rabbit Hole you have gone
When using the code above, if you have a container format that contains another container, you may wish to keep track of where in the stack you are. To do that, you'd want code something like:
Surprise! Zips Have Text Too!
The great thing about AutoDetectParser is that it can parse and extract text from almost anything. In particular, it can parse zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, using Jukka's example code you can get the text and metadata for each file nested inside of the zip file. What you might not expect is that you also get metadata and body text for the zip file itself.
Maybe this doesn't surprise you at all. My first reaction when I saw both metadata AND text for the zip file itself was "What text could a zip file possibly have?" My naive assumption was that a zip file wouldn't contain any text, and my assumption was wrong.
I was thinking that a zip, tar, or other archive file was simply a container for other files, and so didn't have any text of its own. Tika looks at archives differently; Tika sees an archive as being like a directory in a file system, and the text for an archive is a list of the contents of the archive.
If you have a zip file that contains 100 text files, after using the code on this page to get the text and metadata for each file, you will get the text and metadata for 101 files: 100 text files, and 1 zip file. The text for the zip file will list the names for each of the 100 text files it contains.
If you aren't interested in seeing text and metadata for the zip file itself, you'll want to take a look at
metadata.get(Metadata.CONTENT_TYPE)) for each file Tika parses so you can skip the archives themselves. For a zip file, the content type is "application/zip".
Integration of the RecursiveParserWrapper into Tika
A RecursiveParserWrapper that is based on Jukka and Nick's example above was added to Tika as of 1.7.
The wrapper returns a list of Metadata objects – the first contains the metadata+content for the container document and the rest contain the metadata+content for each embedded document. The content of each document is stored in "X-TIKA:content", and the embedded document's location in the container document is stored in "X-TIKA:embedded_resource_path" (e.g. "embedded-1/embed1.zip/embed2.zip/embed3.pdf").
A downside to the wrapper is that it breaks the Tika goal of streaming output – the wrapper caches all metadata+content in memory. This wrapper must be used with care.
As of Tika 1.7, a JSONified view of this output was integrated into tika-app (the -J option) and tika-server ("/rmeta").
This format serves as the basis for the upcoming tika-eval module that will help with comparisons of the output of different versions of Tika or other content extractors.