This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

How to Run tika-eval On the VM

While users can run tika-eval on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered ~1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the candidate release to identify potential regressions.

Rackspace generously hosts this vm, and we are extremely grateful.

This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical for the full Tika eval or for sub projects. See TikaEval for more information on the tika-eval module by itself. See this blog for a description of running this project on the Rackspace VM.

The driver, the various configuration files and the file list for PDFs are all available here: tika_eval_vm_scripts.tgz.

If you haven't done so in your .bashrc file, make sure to umask g+rw before running anything.

An Example with Apache PDFBox

  1. Clean up from any previous runs
    1. Remove tika-app.jar from /work/batch-apps/tika_working/bin
    2. Remove or rename /work/batch-apps/tika_working/logs
    3. Remove or rename /work/batch-apps/tika_working/nohup.out
  2. Run the current "A" version
    1. Place the "A" version of tika-app.jar in /work/batch-apps/tika_working/bin
    2. Modify to
      1. put the output in a new output directory -o /data4/batch_runs/pdfboxA
      2. confirm that the correct file list is specified -fileList pdf_files_single_col.txt
    3. Execute: nohup ./ &
    4. Wait for the "A" version to complete before starting the "B" version
  3. Build and run the "B" version
    1. Update PDFBox from SVN, mvn install
    2. Update the PDFBox and Fontbox versions in the Tika project tika-parsers/pom.xml
    3. Run mvn clean on the whole Tika project and make sure that your IDE has picked up the changes
    4. Run the PDFParser tests in tika-parsers/src/test/jva/ao.a.t.p.pdf.* to make sure that the Tika unit tests work.
    5. Build the entire Tika project (even though you'll only use tika-app.jar): mvn install
    6. On the VM, remove the from /work/batch-apps/tika_working/bin, rename nohup.out to nohup-A.out, rename /work/batch-apps/tika_working/logs to /work/batch-apps/tika_working/logs-A
    7. Drop the new tika-app-B.jar into (you guessed it!): /work/batch-apps/tika_working/bin
    8. Modify to
      1. put the output in a new output directory -o /data4/batch_runs/pdfboxB
      2. confirm that the correct file list is specified -fileList pdf_files_single_col.txt
    9. Execute: nohup ./ &
    10. Wait for the "B" version to complete before starting the comparisons and reports
  4. Make the comparisons and report
    1. In /work/eval, remove the existing db file if you don't want to rename it.
    2. java -jar tika-eval.jar Compare -extractsA /data4/batch_runs/pdfboxA -extractsB /data4/batch_runs/pdfboxB -db pdfboxAvsB
    3. When that completes,
      1. Remove any files left over from the last run in reports/: rm -r reports
      2. Write the reports java -jar tika-eval.jar Report -db pdfboxAvsBNote the – need to set the tmp directory to something writeable by 'collab'

When this process completes, you'll have all of the reports written to /work/eval/reports.

H2 to Postgresql and Reports

With the expansion of the regression corpus, I'm finding that H2 isn't able to write the reports – no matter the -Xmx, even after a few hours, it doesn't even get to the point of creating the reports directory.

I should set up postgres on our VM, but I haven't gotten around to that yet. For now, I'm copying the H2 db to Postgresql and then writing the reports from there. The code to copy H2->postgres is available here: tika-addons.

I had to modify the report SQL slightly to work with Postgresql, and I stripped out some of the reports/calculations that aren't critical to the full regression tests. The modified report SQL is available comparison-reports_pg.xml

  • No labels