Running untrusted parsers on untrusted data is inherently risky (see, for example: Kathleen Fisher's LangSec2021 talk). The Apache Tika team does what it can.
In rare cases, Tika can go into infinite loops or allocate surprising amounts of memory (OutOfMemoryExceptions (OOMs)). If you are processing enough documents in the wild, you will run into these challenges and you must defend against them.
Again, if you're processing untrusted files at scale, we strongly encourage not running Tika in the same jvm as, say, an indexer or search system or any other critical code.
The Tika project offers some defenses against these denial of service (DoS) vulnerabilities. All of these options spawn a forked process to do the actual parsing.
- The ForkParser – this forks a child process and will protect against OOM and infinite loops.
- tika-batch – if you are processing files at desktop/vm scale (not cloud scale), you can run tika-batch via tika-app:
java -jar tika-app.jar -i <input_dir> -o <output_dir>
tika-server – In Tika >= 2.x, the parsing is done in a forked process by default. Clients need to be able to handle tika-server going offline when the forked parsing process has to restart.
Use tika-pipes in Tika 2.x, programmatically, in tika-app with the -a option or in tika-server with the /async or /pipes endpoints.
The Tika project has taken the following steps to identify and fix catastrophic problems:
- We gathered a regression corpus ~2 million files from Common Crawl, and we run Tika against that before release to identify potential DoS vulnerabilities.
- We've done code reviews of some of our dependencies to identify common sets of vulnerabilities, such as read-a-length-then-allocate patterns.
We have recently added a basic fuzzing module to identify some of these vulnerabilities.
- We continue to engage with security researchers who have carried out code reviews or applied more advanced fuzzing to identify vulnerabilities.
- We have forked, fixed and released as our own at least three parsers that were not able to make the fixes on their own or did not respond in a reasonable amount of time.
We document our fixed vulnerabilities here: https://tika.apache.org/security.html.
We offer the MockParser in tika-core tests that will allow you to test the robustness of your system against infinite loops, out of memory exceptions and other serious problems.
In short, we do what we can, but given what we've seen before and given the size of our dependencies' codebases, we can't assert that Tika is safe. If you are processing high volumes of untrustworthy data, please, please avoid running Tika in the same process as anything that matters, such as your indexer or natural language processing code.
Finally, when you come across a file that causes catastrophic problems and if you are able to share that triggering file, we will try to fix the source of the problem if we can.
Please see slide 12 for more details: http://events17.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf