How to Run tika-eval On the VM
While users can run tika-eval on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered ~1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the candidate release to identify potential regressions.
Rackspace generously hosts this vm, and we are extremely grateful.
This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical for the full Tika eval or for sub projects. See TikaEval for more information on the tika-eval
module by itself. See this blog for a description of running this project on the Rackspace VM.
The driver appBatchExecutor.sh
, the various configuration files and the file list for PDFs are all available here: tika_eval_vm_scripts.tgz.
If you haven't done so in your .bashrc file, make sure to umask g+rw
before running anything.
An Example with Apache PDFBox
- Clean up from any previous runs
- Remove tika-app.jar from
/work/batch-apps/tika_working/bin
- Remove or rename
/work/batch-apps/tika_working/logs
- Remove or rename
/work/batch-apps/tika_working/nohup.out
- Remove tika-app.jar from
- Run the current "A" version
- Place the "A" version of tika-app.jar in
/work/batch-apps/tika_working/bin
- Modify
appBatchExecutor.sh
to- put the output in a new output directory
-o /data4/batch_runs/pdfboxA
- confirm that the correct file list is specified
-fileList pdf_files_single_col.txt
- put the output in a new output directory
- Execute:
nohup ./appBatchExecutor.sh &
- Wait for the "A" version to complete before starting the "B" version
- Place the "A" version of tika-app.jar in
- Build and run the "B" version
- Update PDFBox from SVN,
mvn install
- Update the PDFBox and Fontbox versions in the Tika project tika-parsers/pom.xml
- Run
mvn clean
on the whole Tika project and make sure that your IDE has picked up the changes - Run the PDFParser tests in tika-parsers/src/test/jva/ao.a.t.p.pdf.* to make sure that the Tika unit tests work.
- Build the entire Tika project (even though you'll only use tika-app.jar):
mvn install
- On the VM, remove the tika.app-A.jar from
/work/batch-apps/tika_working/bin
, renamenohup.out
tonohup-A.out
, rename/work/batch-apps/tika_working/logs
to/work/batch-apps/tika_working/logs-A
- Drop the new tika-app-B.jar into (you guessed it!):
/work/batch-apps/tika_working/bin
- Modify
appBatchExecutor.sh
to- put the output in a new output directory
-o /data4/batch_runs/pdfboxB
- confirm that the correct file list is specified
-fileList pdf_files_single_col.txt
- put the output in a new output directory
- Execute:
nohup ./appBatchExecutor.sh &
- Wait for the "B" version to complete before starting the comparisons and reports
- Update PDFBox from SVN,
- Make the comparisons and report
- In
/work/eval
, remove the existing db filepdfboxAvsB.mv.db
if you don't want to rename it. java -jar tika-eval.jar Compare -extractsA /data4/batch_runs/pdfboxA -extractsB /data4/batch_runs/pdfboxB -db pdfboxAvsB
- When that completes,
- Remove any files left over from the last run in
reports/
:rm -r reports
- Write the reports
java -Djava.io.tmpdir=tmp -jar tika-eval.jar Report -db pdfboxAvsB
– Note the -Djava.io.tmpdir=tmp – need to set the tmp directory to something writeable by 'collab'
- Remove any files left over from the last run in
- In
When this process completes, you'll have all of the reports written to /work/eval/reports
.
H2 to Postgresql and Reports
With the expansion of the regression corpus, I'm finding that H2 isn't able to write the reports – no matter the -Xmx, even after a few hours, it doesn't even get to the point of creating the reports directory.
I should set up postgres on our VM, but I haven't gotten around to that yet. For now, I'm copying the H2 db to Postgresql and then writing the reports from there. The code to copy H2->postgres is available here: tika-addons.
I had to modify the report SQL slightly to work with Postgresql, and I stripped out some of the reports/calculations that aren't critical to the full regression tests. The modified report SQL is available comparison-reports_pg.xml