Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview of the 'tika-eval-app' Module

This page offers a first draft of the documentation for the tika-eval-app module, which was recently initially added to Tika 1.15-SNAPSHOT.

The module is intended to offer insight from the output of a single extraction tool or to enable some comparisons between tools. This module is designed to be used to help with Tika, but it could be used to evaluate other tools as well.

As part of Tika's periodic regression testing, we run this module against ~3 million files (for committers/PMC interested in running the regression testing on our Rackspace vmregression vm, see TikaEvalOnVM). However, it will not scale to 100s of millions of files as it is currently designed. Patches are welcomed!

...

You'll have a directory of .xlsx reports under the "reports" directory.  Note: if you don't need the full tika-eval-app, you can get many of these statistics at parse time via the TikaEvalMetadataFilter (see: ModifyingContentWithHandlersAndMetadataFilters).

Comparing Output from Two Tools/Settings (Compare)

...

Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'lucene-analyzers.json'!

Reading Extracts

alterExtract

...