So, you've integrated Apache Tika into your framework, tried it on a couple of thousand files and all works well. Problem solved!
In very, very rare cases, Tika can do some really bad things. We try to fix these problems when we can, but if history is any indication (e.g. TIKA-1132 and TIKA-2040 to name a few), if you are processing millions/billions of files from the wild, you'll need to defend against:
- Regular catchable exceptions 2. OutOfMemory errors which can put the jvm in an unreliable state 3. Permanent hangs (Tika can chew up massive amounts of resources and go forever) 4. Security vulnerabilities (e.g. CVE-2016-6809 and CVE-2016-4434)
Please note that for 3., permanent hangs – you cannot terminate the Thread. Thread's stop, suspend, destroy sound like they'll do the trick, but they won't. You need to kill the entire process. See TIKA-456.
As of Tika 1.15, we added a MockParser in the tika-core-tests.jar that will allow you to test your framework against items 1-3. Simply add that jar to your class path and then include a <mock> xml file in your set of test documents, and crash, crash away.
If you'd like to test 4., you can do that too! While you should be protected from an XXE (let us know if you're not!), you could create a deserialization attack...just create your own malicious Throwable class, add it to the classpath and send in a mock file that includes:
Place the tika-app.jar and the tika-core-tests.jar in a "bin" directory.
java -cp "bin/*" org.apache.tika.TikaCLI mock_example.xml
Place the tika-server.jar and the tika-core-tests.jar in a "bin" directory.
java -cp "bin/*" org.apache.tika.server.TikaServerCli
Then curl away:
curl -T mock_example.xml http://localhost:9998/rmeta/text
Place the tika-core-tests.jar on your class path (NOT IN PRODUCTION!!!) and then add some mock.xml files to your batch of documents.
See the mock example.xml file in tika-parsers/src/test/resources/test-documents/mock.
This shows all of the examples of what you can do.