This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Using ParseContext to Control Parsing

The ParseContext is used to configure parsing for a given file.

The general use is

parseContext.set(MyClass.class, new MyClass());
parser.parse(inputStream, contentHandler, metadata, parseContext);

General

The following uses apply to several parsers:

  1. Handling embedded files
    1a. EmbeddedDocumentExtractor – for handling embedded files, the user can specify a custom EmbeddedDocumentExtractor.

1b. Parser – if the user fails to pass in an EmbeddedDocumentExtractor, the parsers will look for a Parser.class in the ParseContext, and Tika will build a ParsingEmbeddedDocumentExtractor based on that Parser automatically.
1c. NOTE: As of Tika 1.15, if the user doesn't specify an EmbeddedDocumentExtractor.class or a Parser.class, a ParsingEmbeddedDocumentExtractor will be automatically added with an AutoDetectParser. Before Tika 1.15, if a user failed to pass in an EmbeddedDocumentExtractor or a Parser, Tika would skip embedded files.

2. XMLParsing – Users can send in their own XMLReader (StAX), SAXParser (SAX), SAXParserFactory (SAX) or DocumentBuilder (DOM). Parsers that use XML parsing will use these resources for XML parsing.

3. PasswordProvider – If you know the password to password protected files, you can send in a PasswordProvider via the ParseContext.

4. ExecutorService – For parsers that use an ExecutorService, users can pass in their own ExecutorService.

Parser Specific

  1. HtmlParser

2. TesseractOcrParser

3. PDFParser

4. Microsoft Parser (as of Tika 1.15)

  • No labels