DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Security Warning
NOTE: The tika-pipes modules in combination with tika-server open potential security vulnerabilities if you do not carefully limit access to tika-server. If the tika-pipes modules are turned on, anyone with access to your tika-server has the read and write permissions of the tika-server, and they will be able to read data and to forward the parsed results to whatever you've configured (see, for example: https://en.wikipedia.org/wiki/Server-side_request_forgery). The tika-pipes modules for tika-server are intended to be run in tightly controlled networks.
DO NOT use tika-pipes if your tika-server is exposed to the internet or if you do not carefully restrict access to tika-server.
Consider adding two-way TLS encryption to your client and server, a beta version of which is available in 2.4.0: TikaServer#SSL(Beta).
Overview
The tika-pipes modules enable fetching data from various sources, running the parse and then emitting the output to various destinations. These modules are built around the RecursiveParserWrapper output model (-J option in tika-app and /rmeta endpoint in tika-server-standard). Users can specify content format (text/html/body) and set limits (number of embedded files, max content length) via FetchEmitTuples. Further, users can add Metadata Filters to select and modify the metadata that is extracted during the parse before emitting the output.
We need to improve how to add dependencies. Very few of the fetchers/emitters are embedded in tika-app or tika-server-standard. For now, users can download required jars from maven central, e.g. the S3Emitter is available: https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-s3/2.1.0/tika-emitter-s3-2.1.0.jar
I JUST WANT EXAMPLES. SHOW ME THE EXAMPLES!!!
See below (tika-app) for fully worked examples of using tika-app to fetch from a local file share, parse and send the output to Solr.
Fetchers
Fetchers allow users to specify sources of inputstream+metadata for the parsing process. Fetchers are currently enabled in all of tika-server-standard and in the async option (-a) in tika-app.
With the exception of the FileSystemFetcher, users have to add the other fetcher dependencies to their class path.
FileSystemFetcher
Class name: org.apache.tika.pipes.fetcher.fs.FileSystemFetcher
A FileSystemFetcher allows the user to specify a base directory in tika-config.xml and then at parse time, the user specifies the relative path for a file. This class is included in tika-core and no external resources are required.
For example, a minimal tika-config.xml file for a FileSystemFetcher would be:
<properties>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
<params>
<name>fsf</name>
<basePath>/my/base/path1</basePath>
</params>
</fetcher>
</fetchers>
</properties>
HttpFetcher
Class name: org.apache.tika.pipes.fetcher.http.HttpFetcher
The HttpFetcher requires that this dependency be on your class path: https://mvnrepository.com/artifact/org.apache.tika/tika-fetcher-http
S3Fetcher
Class name: org.apache.tika.pipes.fetcher.s3.S3Fetcher
GCSFetcher
Class name: org.apache.tika.pipes.fetcher.gcs.GCSFetcher
AZBlobFetcher
Class name: org.apache.tika.pipes.fetcher.azblob.AZBlobFetcher
MSGraphFetcher
Class name: org.apache.tika.pipes.fetchers.microsoftgraph.MSGraphFetcher
Introduced in: https://github.com/apache/tika/pull/1698
Emitters
The FileSystemEmitter requires the tika-serialization module and is not included in tika-core. However, it is bundled with tika-app and tika-server-standard. For the other emitters, users have to add the other emitter dependencies to their class path.
FileSystemEmitter
A FileSystemEmitter allows the user to specify a base directory in tika-config.xml and then at parse time, the user specifies the relative path for the emitted .json file.
For example, a minimal tika-config.xml file for a FileSystemEmitter would be:
<properties>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
<params>
<name>fse</name>
<basePath>/my/base/extracts</basePath>
</params>
</emitter>
</emitters>
</properties>
S3Emitter
OpenSearchEmitter
SolrEmitter
PipesIterators
tbd
tika-app examples
From FileShare to FileShare
Process all files in a directory recursively and place the .json extracts in a parallel directory structure.
N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0
- Place the tika-app jar and any other dependencies in a
bin/directory - Unzip this file (fs-to-fs-config.tgz) and place the
config/directory at the same level as thebin/directory in the previous step - Open
config/tika-config-fs-to-fs.xmland update the <basePath/>elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). Update the<basePath/>element in thepipesiteratorsection and make sure that it matches what you specified in thefetchersection. - Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-fs.xml
From file list on FileShare to FileShare
The input is a list of relative paths to files (e.g. file-list.txt) on a file share and the output is .json extract files on a file share.
N.B. For the logging to work correctly in the async pipes parser, you have to use >= 2.1.0.
- Place the tika-app jar and any other dependencies in a
bin/directory - Unzip this file (file-list-config.tgz) and place the
config/directory at the same level as thebin/directory in the previous step and the same level as thefile-list.txt - Open
config/tika-config-filelist.xmland update the <basePath/>elements in the fetcher and emitter sections to specify the absolute path to the root directory for the binary documents (fetcher) and to the target root directory for the extracts (emitter). - Commandline:
java -Xmx512m -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-filelist.xml
From Fileshare to Solr
These examples were tested with Solr 8.9.0 on Ubuntu in single core mode (not cloud). These examples require Tika >= 2.1.0.
Index embedded files in a parent-child relationship
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false - Set schema with this file solr-parent-child-schema.json:
curl -F 'data=@solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema - Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/directory - Unzip this config/ directory solr-parent-child-config.tgz and put it at the same level as the
bin/directory - Open
config/tika-config-fs-to-solr.xmland update the<basePath>elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Treat each embedded file as a separate file
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false - Set schema with this file solr-separate-docs-schema.json:
curl -F 'data=@solr-separate-docs-schema.json' http://localhost:8983/solr/tika-example/schema - Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/directory - Unzip this config/ directory solr-separate-docs-config.tgz and put it at the same level as the
bin/directory - Open
config/tika-config-fs-to-solr.xmland update the<basePath>elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
Legacy mode, concatenate content from embedded files
- Create collection:
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false - Set schema with this file solr-concatenate-schema.json:
curl -F 'data=@solr-concatenate-schema.json' http://localhost:8983/solr/tika-example/schema - Put the latest tika app jar and tika-emitter-solr-2.1.0.jar in a
bin/directory - Unzip this config/ directory solr-concatenate-config.tgz and put it at the same level as the
bin/directory - Open
config/tika-config-fs-to-solr.xmland update the<basePath>elements in the fetcher AND the pipesiterator to point to the directory that you want to index - Run tika:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-solr.xml
From Fileshare to OpenSearch
The following require Tika >= 2.1.0. They will not work with the 2.0.0 release. These examples were tested with OpenSearch 1.0.0 running in docker on an Ubuntu host.
Index embedded files in a parent-child relationship
This option requires specification of the parent child relationship in the mappings file. The parent is currently hardcoded to be container, and the embedded files are embedded. The OpenSearch emitter flattens relationships so that if there are deeply recursively embedded files, all embedded files are children of the single container/parent file; recursive relationships are not captured in the OpenSearch join relation. However, the embedded path is stored in the X-TIKA:embedded_resource_path metadata value, and the recursive relations can be reconstructed from that path.
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jarin thebin/directory - Unzip this file opensearch-parent-child-config.tgz and place the
config/directory at the same level as thebin/directory - Open
config/tika-config-fs-to-opensearch.xmland update the<basePath>elements in BOTH the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-parent-child-mappings.json to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test - Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Treat each embedded file as a separate file
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jarin thebin/directory - Unzip this file opensearch-parent-child-config.tgz and place the
config/directory at the same level as thebin/directory - Open
config/tika-config-fs-to-opensearch.xmland update the<basePath>elements in the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-mappings.json to OpenSearch:
curl -k -I -T opensearch-mappings.json https://localhost:9200/tika-test -u admin:admin -H "Content-Type: application/json" - Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
Legacy mode, concatenate content from embedded files
This emulates the legacy output from tika-app and the /tika endpoint in tika-server-standard. Note that this option hides exceptions from embedded files and metadata from embedded files. The key difference between this config and the "treat each embedded file as a separate file" is the parseMode element in the pipesIterator:
<pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
<params>
<parseMode>CONCATENATE</parseMode>
...
- Place the tika-app jar and the
tika-emitter-opensearch-2.1.0.jarin thebin/directory - Unzip this file opensearch-concatenate-config.tgzand place the
config/directory at the same level as thebin/directory - Open
config/tika-config-fs-to-opensearch.xmland update the<basePath>elements in the fetcher and the pipesiterator to point to the directory that you want to index - Curl these mappings opensearch-mappings.json to OpenSearch:
curl -k -T opensearch-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test - Run tika app:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -a --config=config/tika-config-fs-to-opensearch.xml
tika-server
Fetchers in the classic tika-server endpoints
For the classic tika-server endpoints (/rmeta, /tika, /unpack, /meta), users specify fetcherName and fetchKey in the headers. This replaces enableFileUrl from tika-1.x. Note that enableUnsecureFeatures must still be set via the tika-config.xml:
<properties>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
<params>
<name>fsf</name>
<basePath>/my/base/path1</basePath>
</params>
</fetcher>
</fetchers>
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
<params>
</server>
</properties>
To parse /my/base/path1/path2/myfile.pdf:
curl -X PUT http://localhost:9998/tika --header "fetcherName: fsf" --header "fetchKey: path2/myfile.pdf"
If your file path has non-ASCII characters, you should specify the fetcherName and the fetchKey as query parameters in the request instead of in the headers:
curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=中文.txt' curl -X PUT 'http://tika:9998/rmeta/text?fetcherName=fsf&fetchKey=%E4%B8%AD%E6%96%87.txt'
The /pipes endpoint
This endpoint requires that at least one fetcher and one emitter be specified in the config file and that enableUnsecureFeatures be set to true. In the following example, we have source documents in /my/base/path1, and we want to write extracts to /my/base/extracts. Unlike with the classic endpoints, users send a json FetchEmitTuple to tika-server. For full documentation of this object see: FetchEmitTuple
<properties>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
<params>
<name>fsf</name>
<basePath>/my/base/path1</basePath>
</params>
</fetcher>
</fetchers>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
<params>
<name>fse</name>
<basePath>/my/base/extracts</basePath>
</params>
</emitter>
</emitters>
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
</params>
</server>
<pipes>
<params>
<tikaConfig>/path/to/tika-config.xml</tikaConfig>
<params>
</pipes>
</properties>
To parse /my/base/path1/path2/myfile.pdf:
curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}' http://localhost:9998/pipes
Note, by default, the FileSystemEmitter automatically adds ".json" to the end of the emitKey.
The /async endpoint
The only difference in the /async handler is that you send a list of FetchEmitTuples:
curl -X POST -H "Content-Type: application/json" -d '[{"fetcher":"fsf","fetchKey":"path2/myfile.pdf","emitter":"fse","emitKey":"path2/myfile.pdf"}]' http://localhost:9998/async
Modifying Docker to use the pipes modules
For examples of how to load the pipes modules with Docker see: tika-pipes and Docker.