Security Warning

NOTE: The tika-pipes modules in combination with tika-server open potential security vulnerabilities if you do not carefully limit access to tika-server.  If the tika-pipes modules are turned on, anyone with access to your tika-server has the read and write permissions of the tika-server, and they will be able to read data and to forward the parsed results to whatever you've configured (see, for example: https://en.wikipedia.org/wiki/Server-side_request_forgery).  The tika-pipes modules for tika-server are intended to be run in tightly controlled networks

DO NOT use tika-pipes if your tika-server is exposed to the internet or if you do not carefully restrict access to tika-server.

Consider adding two-way TLS encryption to your client and server, a beta version of which is available in 2.4.0: TikaServer#SSL(Beta).

Overview

The tika-pipes modules enable fetching data from various sources, running the parse and then emitting the output to various destinations.  These modules are built around the RecursiveParserWrapper output model (-J option in tika-app and /rmeta endpoint in tika-server-standard).  Users can specify content format (text/html/body) and set limits (number of embedded files, max content length) via FetchEmitTuples.  Further, users can add Metadata Filters to select and modify the metadata that is extracted during the parse before emitting the output.

We need to improve how to add dependencies.  Very few of the fetchers/emitters are embedded in tika-app or tika-server-standard.  For now, users can download required jars from maven central, e.g. the S3Emitter is available: https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-s3/2.1.0/tika-emitter-s3-2.1.0.jar

I JUST WANT EXAMPLES.  SHOW ME THE EXAMPLES!!!

See below (tika-app) for fully worked examples of using tika-app to fetch from a local file share, parse and send the output to Solr.

Fetchers

Fetchers allow users to specify sources of inputstream+metadata for the parsing process.  Fetchers are currently enabled in all of tika-server-standard and in the async option (-a) in tika-app.

With the exception of the FileSystemFetcher, users have to add the other fetcher dependencies to their class path.

FileSystemFetcher

Class name: org.apache.tika.pipes.fetcher.fs.FileSystemFetcher

A FileSystemFetcher allows the user to specify a base directory in tika-config.xml and then at parse time, the user specifies the relative path for a file.  This class is included in tika-core and no external resources are required.

For example, a minimal tika-config.xml file for a FileSystemFetcher would be:

<properties>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
<params>
<name>fsf</name>
<basePath>/my/base/path1</basePath>
</params>
</fetcher>
</fetchers>
</properties>

HttpFetcher

Class name: org.apache.tika.pipes.fetcher.http.HttpFetcher

The HttpFetcher requires that this dependency be on your class path: https://mvnrepository.com/artifact/org.apache.tika/tika-fetcher-http

HttpFetcher
<properties>
 <fetchers>
  <fetcher class="org.apache.tika.pipes.fetcher.http.HttpFetcher">
   <params>
    <name>http</name>
	<!-- these are optional; timeouts are all in milliseconds -->
    <authScheme></authScheme>
    <connectTimeout>30000</connectTimeout>
    <ntDomain></ntDomain>
    <password></password>
    <proxyHost></proxyHost>
    <proxyPort></proxyPort>
    <requestTimeout></requestTimeout>
    <socketTimeout></socketTimeout>
    <userName></userName>
   </params>
  </fetcher>
 </fetchers>
</properties>


S3Fetcher

Class name: org.apache.tika.pipes.fetcher.s3.S3Fetcher

S3Fetcher
<properties>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
      <params>
        <name>s3f</name>
        <region>us-east-1</region>
        <bucket>my-bucket</bucket>
        <!-- either use the instance as a credential -->
        <credentialsProvider>instance</credentialsProvider>
        <!-- or use a profile -->
        <credentialsProvider>profile</credentialsProvider>
        <profile>myProfile</profile>
        
        <!-- whether or not to spool the s3 object to a local temp file 
             before fetching. Default: true -->
        <spoolToTemp>true</spoolToTemp>

        <!-- these are all optional -->
        <!-- if your pipes iterator is working on a list of files under my-prefix -->
        <prefix>my-prefix</prefix>
        <!-- extract the s3 object user metadata and inject it into the Tika metadata -->
        <extractUserMetadata>false</extractUserMetadata>
        <!-- the s3 api sets a fairly low max. If you are running a heavily concurrent application, you
             may need to bump this. -->
        <maxConnections>100</maxConnections>
      </params>
    </fetcher>
  </fetchers>
</properties>

GCSFetcher

Class name: org.apache.tika.pipes.fetcher.gcs.GCSFetcher

AZBlobFetcher

Class name: org.apache.tika.pipes.fetcher.azblob.AZBlobFetcher

MSGraphFetcher

Class name: org.apache.tika.pipes.fetchers.microsoftgraph.MSGraphFetcher

Introduced in: https://github.com/apache/tika/pull/1698


Emitters

The FileSystemEmitter requires the tika-serialization module and is not included in tika-core.  However, it is bundled with tika-app and tika-server-standard. For the other emitters, users have to add the other emitter dependencies to their class path.

FileSystemEmitter

A FileSystemEmitter allows the user to specify a base directory in tika-config.xml and then at parse time, the user specifies the relative path for the emitted .json file.

For example, a minimal tika-config.xml file for a FileSystemEmitter would be:

<properties>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
<params>
<name>fse</name>
<basePath>/my/base/extracts</basePath>
</params>
</emitter>
</emitters>
</properties>

S3Emitter

OpenSearchEmitter

SolrEmitter

PipesIterators

tbd

tika-app examples

From FileShare to FileShare

Process all files in a directory recursively and place the .json extracts in a parallel directory structure.

N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0

  • Place the tika-app jar and any other dependencies in a bin/ directory
  • Unzip this file (fs-to-fs-config.tgz) and place the config/ directory at the same level as the bin/ directory in the previous step
  • Open config/tika-config-fs-to-fs.xml and update the <basePath/> elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). Update the <basePath/> element in the pipesiterator section and make sure that it matches what you specified in the fetcher section.
  • Commandline:  java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-fs.xml

From file list on FileShare to FileShare

The input is a list of relative paths to files (e.g. file-list.txt) on a file share and the output is .json extract files on a file share. 

N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0.

  • Place the tika-app jar and any other dependencies in a bin/ directory
  • Unzip this file (file-list-config.tgz) and place the config/ directory at the same level as the bin/ directory in the previous step and the same level as the file-list.txt
  • Open config/tika-config-filelist.xml and update the <basePath/> elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter).
  • Commandline:  java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-filelist.xml

From Fileshare to Solr

These examples were tested with Solr 8.9.0 on Ubuntu in single core mode (not cloud).  These examples require Tika >= 2.1.0.

Index embedded files in a parent-child relationship

  • Create collection: bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
  • Set schema with this file solr-parent-child-schema.json: curl -F 'data=@solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema
  • Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a bin/ directory
  • Unzip this config/ directory solr-parent-child-config.tgz and put it at the same level as the bin/ directory
  • Open config/tika-config-fs-to-solr.xml and update the <basePath> elements in the fetcher AND the pipesiterator to point to the directory that you want to index
  • Run tika: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml

Treat each embedded file as a separate file

  • Create collection: bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
  • Set schema with this file solr-separate-docs-schema.json: curl -F 'data=@solr-separate-docs-schema.json' http://localhost:8983/solr/tika-example/schema
  • Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a bin/ directory
  • Unzip this config/ directory solr-separate-docs-config.tgz and put it at the same level as the bin/ directory
  • Open config/tika-config-fs-to-solr.xml and update the <basePath> elements in the fetcher AND the pipesiterator to point to the directory that you want to index
  • Run tika: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml

Legacy mode, concatenate content from embedded files

  • Create collection: bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
  • Set schema with this file solr-concatenate-schema.json: curl -F 'data=@solr-concatenate-schema.json' http://localhost:8983/solr/tika-example/schema
  • Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a bin/ directory
  • Unzip this config/ directory solr-concatenate-config.tgz and put it at the same level as the bin/ directory
  • Open config/tika-config-fs-to-solr.xml and update the <basePath> elements in the fetcher AND the pipesiterator to point to the directory that you want to index
  • Run tika: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml

From Fileshare to OpenSearch

The following require Tika >= 2.1.0. They will not work with the 2.0.0 release.  These examples were tested with OpenSearch 1.0.0 running in docker on an Ubuntu host.

Index embedded files in a parent-child relationship

This option requires specification of the parent child relationship in the mappings file.  The parent is currently hardcoded to be container, and the embedded files are embedded.  The OpenSearch emitter flattens relationships so that if there are deeply recursively embedded files, all embedded files are children of the single container/parent file; recursive relationships are not captured in the OpenSearch join relation.  However, the embedded path is stored in the X-TIKA:embedded_resource_path metadata value, and the recursive relations can be reconstructed from that path.

  • Place the tika-app jar and the tika-emitter-opensearch-2.1.0.jar in the bin/ directory
  • Unzip this file opensearch-parent-child-config.tgz and place the config/ directory at the same level as the bin/ directory
  • Open config/tika-config-fs-to-opensearch.xml and update the <basePath> elements in BOTH the fetcher and the pipesiterator to point to the directory that you want to index
  • Curl these mappings opensearch-parent-child-mappings.json to OpenSearch: curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test
  • Run tika app: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml

Treat each embedded file as a separate file

  • Place the tika-app jar and the tika-emitter-opensearch-2.1.0.jar in the bin/ directory
  • Unzip this file opensearch-parent-child-config.tgz and place the config/ directory at the same level as the bin/ directory
  • Open config/tika-config-fs-to-opensearch.xml and update the <basePath> elements in the fetcher and the pipesiterator to point to the directory that you want to index
  • Curl these mappings opensearch-mappings.json to OpenSearch: curl -k -I -T opensearch-mappings.json https://localhost:9200/tika-test -u admin:admin -H "Content-Type: application/json"
  • Run tika app: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml

Legacy mode, concatenate content from embedded files

This emulates the legacy output from tika-app and the /tika endpoint in tika-server-standard.  Note that this option hides exceptions from embedded files and metadata from embedded files.  The key difference between this config and the "treat each embedded file as a separate file" is the parseMode element in the pipesIterator:

  <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
    <params>
      <parseMode>CONCATENATE</parseMode>
  ... 


  • Place the tika-app jar and the tika-emitter-opensearch-2.1.0.jar in the bin/ directory
  • Unzip this file  opensearch-concatenate-config.tgzand place the config/ directory at the same level as the bin/ directory
  • Open config/tika-config-fs-to-opensearch.xml and update the <basePath> elements in the fetcher and the pipesiterator to point to the directory that you want to index
  • Curl these mappings opensearch-mappings.json to OpenSearch: curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test
  • Run tika app: java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml

tika-server

Fetchers in the classic tika-server endpoints

For the classic tika-server endpoints (/rmeta, /tika, /unpack, /meta), users specify fetcherName and fetchKey in the headers.  This replaces enableFileUrl from tika-1.x. Note that enableUnsecureFeatures must still be set via the tika-config.xml:


<properties>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
      <params>
        <name>fsf</name>
        <basePath>/my/base/path1</basePath>
      </params>
    </fetcher>
  </fetchers>
<server>
  <params>
    <enableUnsecureFeatures>true</enableUnsecureFeatures>
  <params>
</server>
</properties>


To parse /my/base/path1/path2/myfile.pdf:

curl -X PUT http://localhost:9998/tika --header "fetcherName: fsf" --header "fetchKey: path2/myfile.pdf"

If your file path has non-ASCII characters, you should specify the fetcherName and the fetchKey as query parameters in the request instead of in the headers:

curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=中文.txt' 
curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=%E4%B8%AD%E6%96%87.txt'


The /pipes endpoint

This endpoint requires that at least one fetcher and one emitter be specified in the config file and that enableUnsecureFeatures be set to true. In the following example, we have source documents in /my/base/path1, and we want to write extracts to /my/base/extracts. Unlike with the classic endpoints, users send a json FetchEmitTuple to tika-server. For full documentation of this object see: FetchEmitTuple

<properties>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
      <params>
        <name>fsf</name>
        <basePath>/my/base/path1</basePath>
      </params>
    </fetcher>
  </fetchers>
  <emitters>
    <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
      <params>
        <name>fse</name>
        <basePath>/my/base/extracts</basePath>
      </params>
    </emitter>
  </emitters>
  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
    </params>
  </server>
  <pipes>
    <params>
      <tikaConfig>/path/to/tika-config.xml</tikaConfig>
    <params>
  </pipes>
</properties>


To parse /my/base/path1/path2/myfile.pdf:

curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}' http://localhost:9998/pipes

Note, by default, the FileSystemEmitter automatically adds ".json" to the end of the emitKey.

The /async endpoint

The only difference in the /async handler is that you send a list of FetchEmitTuples:

curl -X POST -H "Content-Type: application/json" -d '[{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}]' http://localhost:9998/async

Modifying Docker to use the pipes modules

For examples of how to load the pipes modules with Docker see: tika-pipes and Docker.


  • No labels