Overview of = the 'tika-eval-app' Module

This page offers a first draft of the documentation for the tika-eval-ap= p module, which was initially added to Tika 1.15.

The module is intended to offer insight from the output of a single extr= action tool or to enable some comparisons between tools. This module is des= igned to be used to help with Tika, but it could be used to evaluate other = tools as well.

As part of Tika's periodic regression testing, we run this module agains= t ~3 million files (for committers/PMC interested in running the regression= testing on our regression vm, see TikaEvalOnVM). However, it will not scale to 100s of millions of files as it is curren= tly designed. Patches are welcomed!

Background

There are many tools for extracting text from various file formats, and = even within a single tool there are usually countless parameters that can b= e tweaked. The goal of 'tika-eval' is to allow developers to quickly compar= e the output of:

Two different tools
Two versions of the same tool ("Should we upgrade? Or are there problem= s with the newer version?")
Two runs with the same tool but with different settings ("Does increasi= ng the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI= and one with 300")
Different tools against a gold standard

In addition to this "comparison mode", there is also plenty of informati= on one can get from looking at a profile of a single run.

Some basic metrics for both the "comparison" and "profiling" mode might = include:

Exceptions =E2=80=93 how many and of what category? Are these regular c= atchable exceptions, evil OOMs or permahangs?
Metadata =E2=80=93 how many metadata values did we extract?
Embedded files/attachments =E2=80=93 how many embedded files were found=
Mime detection =E2=80=93 how many of what types of files do we have? Wh= ere do we see discrepancies between tools?
Content =E2=80=93 is the content extracted by tool A better than that e= xtracted by tool B? On which files is there a big difference in extracted c= ontent?

The tika-eval module was initially developed for text only. For those in= terested in evaluating structure/style components (e.g. <title/> or &= lt;b/> elements), see TikaEvalAndStructuralComponents.

Quick Start Usage

NOTE: tika-eval will not overwrite the contents of the = database you specify in Profile or Compare mode. Add -drop to = the commandline to drop tables if you are reusing the database.

The following assumes that you are using the default in-memory H2 databa= se. To connect tika-eval to your own db via jdbc see TikaEvalJdbc.

Single Output from One= Tool (Profile)

NOTE: assume the original input files are in a director= y named input_docs and that the text extracts are written to t= he extracts directory, with each extract file having the same = sub-directory path and same file name with '.json' or '.txt' appended to it= .

Create a directory of extract files that mirrors your input directory. = These files may be UTF-8 text files with '.txt' appended to the original fi= le's name or they may be the RecursiveParserWrapper's '.json' representatio= n: java -jar tika-app-X.Y.jar -J -t -i input_docs -o extracts<= /li>
Profile the directory of extracts and create a local H2 database:
java -jar tika-eval-X.Y.jar Profile -extracts extracts -db profiledb
Write reports from the database:
java -jar tika-eval-X.Y.jar Report -db profiledb

You'll have a directory of .xlsx reports under the "reports" directory.&= nbsp; Note: if you don't need the full tika-eval-app, you = can get many of these statistics at parse time via the TikaEvalMetadataFilt= er (see: ModifyingContentWithHandlersAndMetadataFilters).

Comparing= Output from Two Tools/Settings (Compare)

NOTE: assume the original input files are in a director= y named input_docs and that the text extracts from tool A are = written to the extractsA directory and the extracts from tool = B are written to extractsB.

Create two directories of extract files that mirror your input director= y. These files may be UTF-8 text files with '.txt' appended to the original= file's name or they may be the RecursiveParserWrapper's '.json' representa= tion.
Compare the extract directory A with extract directory B and write resu= lts to a local H2 database:
java -jar tika-eval-X.Y.jar Compare -e= xtractsA extractsA -extractsB extractsB -db comparisondb
Write reports from the database:
java -jar tika-eval-X.Y.jar R= eport -db comparisondb

You'll have a directory of .xlsx reports under the "reports" directory.<= /p>

Investigating the Database

Launch the H2 localhost server:
`java -jar tika-eval-X.Y.jar St= artDB` =E2=80=93 this calls `java -cp ... org.h2.tools.Console -= web`

Navigate a browser to `http://localhost:8082` and enter = the jdbc connector code followed by the full path to your = db file:
`jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb<= /code>`
If your reaction is: "You call this a database?!", please open tickets a= nd contribute to improving the structure. See TikaEvalDbDesi= gn for more information on the underlying structure of the database. More detailed usage Evaluating Success via = Common Words In the absence of ground truth, it is often helpful to count the number = of common words that were extracted (see TikaEvalMetrics for a discussion of this). "Common words" are specified per language in the "resources/commonwords"= directory. Each file is named for the language code, e.g. 'en', and each file is a UTF= -8 text file with one word per line. The token processor runs language id against content and then selects th= e appropriate set of common words for its counts. If there is no common wor= ds file for a language, then it backs off to the default list, which is cur= rently hardcoded to 'en'. Make sure that your common words have gone through the same analysis cha= in as specified by the Common Words analyzer in 'lucene-analyzers.json'! Reading Extracts alterExtract Let's say you want to compare the output of Tika to another tool that ex= tracts text. You happen to have a directory of .json files for Tika and a d= irectory of UTF-8 .txt files from the other tool. If the other tool extracts embedded content, you'd want to concatenate = all the content within Tika's .json file for a fair comparison: ja= va -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15= -db comparisondb -alterExtract concatenate_content If the other tool does not extract embedded content, you'd only want to= look at the first metadata object (representing the container file) in the= .json file: java -jar tika-eval-X.Y.jar Compare -extractsA tika_1= _14 -extractsB tika_1_15 -db comparisondb -alterExtract first_only Min/Max Extract Size You may find that some extracts are too big to fit in memory, in which c= ase use -maxExtractSize <maxBytes>, or you may want to f= ocus only on extracts that are greater than a minimum length: -minExt= ractSize <minBytes>. Reports
`The module tika-eval comes with a list of reports. However, you might wa= nt to generate your own. Each report is specified by SQL and a few other co= nfigurations in an xml file. See comparison-reports.xml and profile-reports.xml` to get a sense of the syntax.

To specify your own reports on the commandline, use `-rf` (re= port file):
`java -jar tika-eval-X.Y.jar Report -db comparisondb -r= f myreports.xml`

If you'd like to write the reports to a root directory other than 'repor= ts', specify that with `-rd` (report directory):
`java -= jar tika-eval-X.Y.jar Report -db comparisondb -rd myreportdir`

Again, see TikaEva= lDbDesign for more information on the underlying structure of the datab= ase.