Table of Contents |
The current installation process post 1.23 and prior to 1.24 is a bit in flux. Read on below for some options:
Building from source
If you need to customise the server in some way, and/or need the very latest version to try out a fix, then to build from source:
- Checkout the source from SVN as detailed on the Apache Tika contributions page or retrieve the latest code from Github,
- Build source using Maven
- Run the Apache Tika JAXRS server runnable jar.
No Format |
git clone tika-trunk
cd ./tika-trunk/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar
Remember to replace x.x with the version you have built.
Running the Tika Server as a Jar file
The Tika Server binary is a standalone runnable jar. Download the latest stable release binary from the Apache Tika downloads page, via your favourite local mirror. You want the tika-server-1.x.jar file, eg tika-server-1.23.jar
You can start it by calling java with the -jar
option, eg something like java -jar tika-server-1.23.jar
You will then see a message such as the following:
No Format |
$ java -jar tika-server-1.23-SNAPSHOT.jar
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
19-Jan-2015 14:23:36 org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Started
Which lets you know that it started correctly.
You can specify additional information to change the host name and port number:
No Format |
java -jar tika-server-x.x.jar --host=intranet.local --port=12345
Below is some basic documentation on how to interact with the services using cURL and HTTP.
Using prebuilt Docker image
There is an unofficial image for Tika that has been available for years. You can download and start it with
No Format |
java -jar tika-server-x.x.jar --host=intranet.local --port=12345 |
With --rm
option it will be deleted as soon as container stopped. Dockerfile can be found at Github.
There is also an in-progress effort to publish an official Tika Docker image. That code can be found at and will eventually replace the version produced by LogicalSpark.
Running Tika Server as Unix Service
Shipping in Tika 1.24 is a new Service Installation Script that lets you install Tika as a service on Linux. This script was heavily influenced by the Apache Solr project's script, so read up on that documentation if you want to customize the script..
Currently the script only supports CentOS, Debian, Red Hat, Suse and Ubuntu Linxu distributions. Before running the script, you need to determine a few parameters about your setup. Specifically, you need to decide where to install Tika and which system user should be the owner of the Tika files and process
To run the scripts, you'll need the 1.24 (or later) Solr distribution. It will have a -bin
suffice eg tika-server-1.24-SNAPSHOT-bin.tgz
. Extract the installation script from the distribution via:
No Format |
tar xzf tika-server-1.24-bin.tgz tika-server-1.24-bin/bin/ --strip-components=2 |
This will extract the script from the archive into the current directory. If installing on Red Hat, please make sure lsof is installed before running the Solr installation script (sudo yum install lsof
). The installation script must be run as root:
sudo bash ./ tika-server-1.24-bin.tgz
By default, the script extracts the distribution archive into /opt/tika
, configures Tika to write files into /var/tika
, and runs Tika as the tika
user on the default port. Consequently, the following command produces the same result as the previous command:
sudo bash ./ tika-server-1.24-bin -i /opt -d /var/tika -u tika -s tika -p 9998
You can customize the service name, installation directories, port, and owner using options passed to the installation script. To see available options, simply do:
sudo bash ./ -help
Once the script completes, Tika will be installed as a service and running in the background on your server (on port 9998). To verify, you can do:
sudo service tika status
Your specific customization to Tika setup are stored in the /etc/init.d/tika
All services that take files use HTTP "PUT" requests. When "PUT" is used, the original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).
Additionally, TikaResource, Metadata and RecursiveMetadata Services accept POST multipart/form-data requests, where the original file is sent as a single attachment.
Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.
You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.
You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).
Resources may return following HTTP codes:
- 200 Ok - request completed sucessfully
- 204 No content - request completed sucessfully, result is empty
- 422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
- 500 Error - Error while processing document
Metadata Resource
No Format |
HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.
Some Example calls with cURL:
No Format |
$ curl -X PUT --data-ascii @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta
No Format |
Get metadata as JSON:
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"
Get specific metadata key's value as simple text string:
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"
No Format |
Get specific metadata key's value(s) as CSV:
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"
Note: when requesting specific metadata keys value(s) in XMP, make sure to request the XMP name, e.g. "dc:creator" vs. "Author"
Multipart Support
Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
curl -F upload=@price.xls URL http://localhost:9998/meta/form
Note that the address has an extra "/form" path segment.
Tika Resource
No Format |
HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.
Some Example calls with cURL:
Get HELLO message back
No Format |
$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT
Get the Text of a Document
No Format |
$ curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"
Use the Boilerpipe handler (equivalent to tika-app's --text-main
) with text output:
No Format |
$ curl -T price.xls http://localhost:9998/tika/main --header "Accept: text/plain"
Multipart Support
Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
curl -F upload=@price.xls URL http://localhost:9998/tika/form
Note that the address has an extra "/form" path segment.
Detector Resource
No Format |
HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.
Default return is a string of the Media type name.
Some Example calls with cURL:
PUT an RTF file and get back RTF
No Format |
$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream
PUT a CSV file without filename hint and get back text/plain
No Format |
$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream
PUT a CSV file with filename hint and get back text/csv
No Format |
$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream
Language Resource
No Format |
HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language.
NOTE: This endpoint does not parse files. It runs detection on a UTF-8 string.
Default return is a string of the 2 character identified language.
Some Example calls with cURL:
PUT a TXT file with English This is English! and get back en
No Format |
$ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
PUT a TXT file with French comme çi comme ça and get back fr
No Format |
curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
No Format |
HTTP PUTs or POSTs a text string to the LanguageIdentifier to identify its language.
Default return is a string of the 2 character identified language.
Some Example calls with cURL:
PUT a string with English This is English! and get back en
No Format |
$ curl -X PUT --data "This is English!" http://localhost:9998/language/string
PUT a string with French comme çi comme ça and get back fr
No Format |
curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
Translate Resource
No Format |
HTTP PUTs or POSTs a document to the identified *translator* and translates from *src* language to *dest*
Default return is the translated string if successful, else the original string back.
Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *src* should be the 2 character short code for the source language, e.g., 'en' for English * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
Some Example calls with cURL:
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24
No Format |
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
lack of practice in Spanish
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft
No Format |
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
I need practice in Spanish
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google
No Format |
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
I need practice in Spanish
No Format |
HTTP PUTs or POSTs a document to the identified *translator* and auto-detects the *src* language using LanguageIdentifiers, and then translates *src* to *dest*
Default return is the translated string if successful, else the original string back.
Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language
No Format |
$ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
so so
Recursive Metadata and Content
No Format |
Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".
No Format |
$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta
No Format |
{"Application-Name":"Microsoft Office Word",
"X-TIKA:content":"embed_0 "
"Content-Type":"text/plain; charset=ISO-8859-1"
The default format for "X-TIKA:content" is XML. However, you can select "text only" with
No Format |
HTML with
No Format |
and no content (metadata only) with
No Format |
Multipart Support
Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
curl -F upload=@test_recursive_embedded.docx URL http://localhost:9998/rmeta/form
Note that the address has an extra "/form" path segment.
Unpack Resource
No Format |
HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type. You can also use /unpack/all to get back both the text and metadata.
Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.
Please note the mapping of this resource was changed in Apache Tika 1.6 from /unpacker/id to /unpack/id /all/id & /unpack/all/id (TIKA-1324).
Some Example calls with cURL:
PUT zip file and get back met file zip
No Format |
$ curl -X PUT --data-binary http://localhost:9998/unpack --header "Content-type: application/zip"
PUT doc file and get back met file tar
No Format |
$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar
PUT doc file and get back the content and metadata
No Format |
$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/
Text is stored in TEXT
file, metadata cvs in METADATA
. Use "accept" header if you want TAR output.
Information Services
Available Endpoints
No Format |
Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc
Defined Mime Types
No Format |
Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML
Available Detectors
No Format |
The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML
Available Parsers
No Format |
Lists all of the parsers currently available
No Format |
List all the available parsers, along with what mimetypes they support
Specifying a URL Instead of Putting Bytes
In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with access to the service had the server's access rights; someone could request local files via file:///
or pages from an intranet that they might not otherwise have access to.
In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:
No Format |
$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl
This allows the user to specify a fileUrl
in the header:
No Format |
curl -i -H "fileUrl:" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
No Format |
curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. BE CAREFUL!
Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple TikaInputStream.get(new URL(fileUrl))
) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!
Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks
As of Tika 1.19, users can make tika-server more robust by running it with the -spawnChild
option. This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process.
The following options are available only with the -spawnChild
: restart the child process after it has processedmaxFiles
. If there is a slow building memory leak, this restart of the JVM should help. The default is 100,000 files. To turn off this feature:-maxFiles -1
. The child and/or parent will log the cause of the restart asHIT_MAX
when there is a restart because of this threshold.-taskTimeoutMillis
specifies how often to check to determine if a parse/detect task has timed outtaskTimeoutMillis
specifies how often for the parent process to ping the child process to check status.pingTimeoutMillis
how long the parent process should wait to hear back from the child process before restarting it and/or how long the child process should wait to receive a ping from the parent process before shutting itself down.
If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable
. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.
NOTE: to specify the JVM args for the child process, prepend the arguments with -J
as in -JXmx4g
after the -jar tika-server.x.x.jar
call as in:
No Format |
$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}
You can customize logging via the usual log4j
commandline argument, e.g. -Dlog4j.configuration=file:log4j_server.xml
. If using -spawnChild
, specify the configuration for the child process with the -J
prepended as in java -jar tika-server-X.Y-jar -spawnChild -JDlog4j.configuration=file:log4j_server.xml
. Some important notes for logging in the child process in versions <= 1.19.1: 1) make sure that the debug
option is off, and 2) do not log to stdout (this is used for interprocess communication between the parent and child!).
The default level of logging is debug
, but you can also set logging to info
via the commandline: -log info