Security Warning
NOTE: The tika-pipes modules in combination with tika-server open potential security vulnerabilities if you do not carefully limit access to tika-server. If the tika-pipes modules are turned on, anyone with access to your tika-server has the read and write permissions of the tika-server, and they will be able to read data and to forward the parsed results to whatever you've configured (see, for example: https://en.wikipedia.org/wiki/Server-side_request_forgery). The tika-pipes modules for tika-server are intended to be run in tightly controlled networks.
DO NOT use tika-pipes if your tika-server is exposed to the internet or if you do not carefully restrict access to tika-server.
Consider adding two-way TLS encryption to your client and server, a beta version of which is available in 2.4.0: TikaServer#SSL(Beta).
Overview
The tika-pipes modules enable fetching data from various sources, running the parse and then emitting the output to various destinations. These modules are built around the RecursiveParserWrapper output model (-J
option in tika-app
and /rmeta
endpoint in tika-server-standard
). Users can specify content format (text/html/body) and set limits (number of embedded files, max content length) via FetchEmitTuples. Further, users can add Metadata Filters to select and modify the metadata that is extracted during the parse before emitting the output.
We need to improve how to add dependencies. Very few of the fetchers/emitters are embedded in tika-app
or tika-server-standard
. For now, users can download required jars from maven central, e.g. the S3Emitter is available: https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-s3/2.1.0/tika-emitter-s3-2.1.0.jar
I JUST WANT EXAMPLES. SHOW ME THE EXAMPLES!!!
See below (tika-app) for fully worked examples of using tika-app to fetch from a local file share, parse and send the output to Solr.
Fetchers
Fetchers allow users to specify sources of inputstream+metadata for the parsing process. Fetchers are currently enabled in all of tika-server-standard
and in the async option (-a
) in tika-app
.
With the exception of the FileSystemFetcher
, users have to add the other fetcher dependencies to their class path.
FileSystemFetcher
Class name: org.apache.tika.pipes.fetcher.fs.FileSystemFetcher
A FileSystemFetcher allows the user to specify a base directory in tika-config.xml
and then at parse time, the user specifies the relative path for a file. This class is included in tika-core
and no external resources are required.
For example, a minimal tika-config.xml
file for a FileSystemFetcher would be:
<properties>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
<params>
<name>fsf</name>
<basePath>/my/base/path1</basePath>
</params>
</fetcher>
</fetchers>
</properties>
HttpFetcher
Class name: org.apache.tika.pipes.fetcher.http.HttpFetcher
The HttpFetcher requires that this dependency be on your class path: https://mvnrepository.com/artifact/org.apache.tika/tika-fetcher-http
S3Fetcher
Class name: org.apache.tika.pipes.fetcher.s3.S3Fetcher
GCSFetcher
Class name: org.apache.tika.pipes.fetcher.gcs.GCSFetcher
AZBlobFetcher
Class name: org.apache.tika.pipes.fetcher.azblob.AZBlobFetcher
MSGraphFetcher
Class name: org.apache.tika.pipes.fetchers.microsoftgraph.MSGraphFetcher
Introduced in: https://github.com/apache/tika/pull/1698
Emitters
The FileSystemEmitter
requires the tika-serialization
module and is not included in tika-core
. However, it is bundled with tika-app
and tika-server-standard
. For the other emitters, users have to add the other emitter dependencies to their class path.
FileSystemEmitter
A FileSystemEmitter
allows the user to specify a base directory in tika-config.xml
and then at parse time, the user specifies the relative path for the emitted .json file.
For example, a minimal tika-config.xml
file for a FileSystemEmitter
would be:
<properties>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
<params>
<name>fse</name>
<basePath>/my/base/extracts</basePath>
</params>
</emitter>
</emitters>
</properties>
S3Emitter
OpenSearchEmitter
SolrEmitter
PipesIterators
tbd
tika-app examples
From FileShare to FileShare
Process all files in a directory recursively and place the .json extracts in a parallel directory structure.
N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0
- Place the tika-app jar and any other dependencies in a
bin/
directory - Unzip this file (fs-to-fs-config.tgz) and place the
config/
directory at the same level as thebin/
directory in the previous step - Open
config/tika-config-fs-to-fs.xml
and update the <basePath/>
elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). Update the<basePath/>
element in thepipesiterator
section and make sure that it matches what you specified in thefetcher
section. - Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-fs.xml
From file list on FileShare to FileShare
The input is a list of relative paths to files (e.g. file-list.txt) on a file share and the output is .json extract files on a file share.
N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0.
- Place the tika-app jar and any other dependencies in a
bin/
directory - Unzip this file (file-list-config.tgz) and place the
config/
directory at the same level as thebin/
directory in the previous step and the same level as thefile-list.txt
- Open
config/tika-config-filelist.xml
and update the <basePath/>
elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). - Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-filelist.xml
From Fileshare to Solr
These examples were tested with Solr 8.9.0 on Ubuntu in single core mode (not cloud). These examples require Tika >= 2.1.0.
Index embedded files in a parent-child relationship
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
- Set schema with this file solr-parent-child-schema.json:
curl -F 'data=@solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema
- Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory - Unzip this config/ directory solr-parent-child-config.tgz and put it at the same level as the
bin/
directory - Open
config/tika-config-fs-to-solr.xml
and update the<basePath>
elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Treat each embedded file as a separate file
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
- Set schema with this file solr-separate-docs-schema.json:
curl -F 'data=@solr-separate-docs-schema.json' http://localhost:8983/solr/tika-example/schema
- Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory - Unzip this config/ directory solr-separate-docs-config.tgz and put it at the same level as the
bin/
directory - Open
config/tika-config-fs-to-solr.xml
and update the<basePath>
elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Legacy mode, concatenate content from embedded files
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
- Set schema with this file solr-concatenate-schema.json:
curl -F 'data=@solr-concatenate-schema.json' http://localhost:8983/solr/tika-example/schema
- Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/
directory - Unzip this config/ directory solr-concatenate-config.tgz and put it at the same level as the
bin/
directory - Open
config/tika-config-fs-to-solr.xml
and update the<basePath>
elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
From Fileshare to OpenSearch
The following require Tika >= 2.1.0. They will not work with the 2.0.0 release. These examples were tested with OpenSearch 1.0.0 running in docker on an Ubuntu host.
Index embedded files in a parent-child relationship
This option requires specification of the parent child relationship in the mappings file. The parent is currently hardcoded to be container
, and the embedded files are embedded
. The OpenSearch emitter flattens relationships so that if there are deeply recursively embedded files, all embedded files are children of the single container/parent file; recursive relationships are not captured in the OpenSearch join relation. However, the embedded path is stored in the X-TIKA:embedded_resource_path
metadata value, and the recursive relations can be reconstructed from that path.
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in thebin/
directory - Unzip this file opensearch-parent-child-config.tgz and place the
config/
directory at the same level as thebin/
directory - Open
config/tika-config-fs-to-opensearch.xml
and update the<basePath>
elements in BOTH the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-parent-child-mappings.json to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test
- Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Treat each embedded file as a separate file
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in thebin/
directory - Unzip this file opensearch-parent-child-config.tgz and place the
config/
directory at the same level as thebin/
directory - Open
config/tika-config-fs-to-opensearch.xml
and update the<basePath>
elements in the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-mappings.json to OpenSearch:
curl -k -I -T opensearch-mappings.json https://localhost:9200/tika-test -u admin:admin -H "Content-Type: application/json"
- Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Legacy mode, concatenate content from embedded files
This emulates the legacy output from tika-app
and the /tika
endpoint in tika-server-standard
. Note that this option hides exceptions from embedded files and metadata from embedded files. The key difference between this config and the "treat each embedded file as a separate file" is the parseMode
element in the pipesIterator
:
<pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <parseMode>CONCATENATE</parseMode> ...
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jar
in thebin/
directory - Unzip this file opensearch-concatenate-config.tgzand place the
config/
directory at the same level as thebin/
directory - Open
config/tika-config-fs-to-opensearch.xml
and update the<basePath>
elements in the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-mappings.json to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test
- Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
tika-server
Fetchers in the classic tika-server endpoints
For the classic tika-server endpoints (/rmeta, /tika, /unpack, /meta
), users specify fetcherName
and fetchKey
in the headers. This replaces enableFileUrl
from tika-1.x. Note that enableUnsecureFeatures
must still be set via the tika-config.xml:
<properties> <fetchers> <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/my/base/path1</basePath> </params> </fetcher> </fetchers> <server> <params> <enableUnsecureFeatures>true</enableUnsecureFeatures> <params> </server> </properties>
To parse /my/base/path1/path2/myfile.pdf:
curl -X PUT http://localhost:9998/tika --header "fetcherName: fsf" --header "fetchKey: path2/myfile.pdf"
If your file path has non-ASCII characters, you should specify the fetcherName and the fetchKey as query parameters in the request instead of in the headers:
curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=中文.txt' curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=%E4%B8%AD%E6%96%87.txt'
The /pipes
endpoint
This endpoint requires that at least one fetcher and one emitter be specified in the config file and that enableUnsecureFeatures
be set to true. In the following example, we have source documents in /my/base/path1
, and we want to write extracts to /my/base/extracts
. Unlike with the classic endpoints, users send a json FetchEmitTuple to tika-server. For full documentation of this object see: FetchEmitTuple
<properties> <fetchers> <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/my/base/path1</basePath> </params> </fetcher> </fetchers> <emitters> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> <params> <name>fse</name> <basePath>/my/base/extracts</basePath> </params> </emitter> </emitters> <server> <params> <enableUnsecureFeatures>true</enableUnsecureFeatures> </params> </server> <pipes> <params> <tikaConfig>/path/to/tika-config.xml</tikaConfig> <params> </pipes> </properties>
To parse /my/base/path1/path2/myfile.pdf:
curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}' http://localhost:9998/pipes
Note, by default, the FileSystemEmitter
automatically adds ".json" to the end of the emitKey
.
The /async
endpoint
The only difference in the /async handler is that you send a list of FetchEmitTuples
:
curl -X POST -H "Content-Type: application/json" -d '[{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}]' http://localhost:9998/async
Modifying Docker to use the pipes modules
For examples of how to load the pipes modules with Docker see: tika-pipes and Docker.