Date: Thu, 28 Mar 2024 16:39:44 +0000 (UTC) Message-ID: <927731936.72039.1711643984164@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_72038_2137178535.1711643984163" ------=_Part_72038_2137178535.1711643984163 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This page offers a first draft of the documentation for the tika-eval-ap= p module, which was initially added to Tika 1.15.
The module is intended to offer insight from the output of a single extr= action tool or to enable some comparisons between tools. This module is des= igned to be used to help with Tika, but it could be used to evaluate other = tools as well.
As part of Tika's periodic regression testing, we run this module agains= t ~3 million files (for committers/PMC interested in running the regression= testing on our regression vm, see TikaEvalOnVM). However, it will not scale to 100s of millions of files as it is curren= tly designed. Patches are welcomed!
There are many tools for extracting text from various file formats, and = even within a single tool there are usually countless parameters that can b= e tweaked. The goal of 'tika-eval' is to allow developers to quickly compar= e the output of:
In addition to this "comparison mode", there is also plenty of informati= on one can get from looking at a profile of a single run.
Some basic metrics for both the "comparison" and "profiling" mode might = include:
The tika-eval module was initially developed for text only. For those in= terested in evaluating structure/style components (e.g. <title/> or &= lt;b/> elements), see TikaEvalAndStructuralComponents.
NOTE: tika-eval will not overwrite the contents of the =
database you specify in Profile or Compare mode. Add -drop
to =
the commandline to drop tables if you are reusing the database.
The following assumes that you are using the default in-memory H2 databa= se. To connect tika-eval to your own db via jdbc see TikaEvalJdbc.
NOTE: assume the original input files are in a director=
y named input_docs
and that the text extracts are written to t=
he extracts
directory, with each extract file having the same =
sub-directory path and same file name with '.json' or '.txt' appended to it=
.
java -jar tika-app-X.Y.jar -J -t -i input_docs -o extracts
<=
/li>
java -jar tika-eval-X.Y.jar Report -db profiledb
You'll have a directory of .xlsx reports under the "reports" directory.&= nbsp; Note: if you don't need the full tika-eval-app, you = can get many of these statistics at parse time via the TikaEvalMetadataFilt= er (see: ModifyingContentWithHandlersAndMetadataFilters).
NOTE: assume the original input files are in a director=
y named input_docs
and that the text extracts from tool A are =
written to the extractsA
directory and the extracts from tool =
B are written to extractsB
.
java -jar tika-eval-X.Y.jar Compare -e=
xtractsA extractsA -extractsB extractsB -db comparisondb
java -jar tika-eval-X.Y.jar R=
eport -db comparisondb
You'll have a directory of .xlsx reports under the "reports" directory.<= /p>
java -jar tika-eval-X.Y.jar St=
artDB
=E2=80=93 this calls java -cp ... org.h2.tools.Console -=
web
http://localhost:8082
and enter =
the jdbc connector code followed by the full path to your =
db file:jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb<=
/code>
If your reaction is: "You call this a database?!", please open tickets a= nd contribute to improving the structure.
See TikaEvalDbDesi= gn for more information on the underlying structure of the database.
In the absence of ground truth, it is often helpful to count the number = of common words that were extracted (see TikaEvalMetrics for a discussion of this).
"Common words" are specified per language in the "resources/commonwords"=
directory.
Each file is named for the language code, e.g. 'en', and each file is a UTF=
-8 text file with one word per line.
The token processor runs language id against content and then selects th= e appropriate set of common words for its counts. If there is no common wor= ds file for a language, then it backs off to the default list, which is cur= rently hardcoded to 'en'.
Make sure that your common words have gone through the same analysis cha= in as specified by the Common Words analyzer in 'lucene-analyzers.json'!= p>
Let's say you want to compare the output of Tika to another tool that ex= tracts text. You happen to have a directory of .json files for Tika and a d= irectory of UTF-8 .txt files from the other tool.
ja=
va -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15=
-db comparisondb -alterExtract concatenate_content
java -jar tika-eval-X.Y.jar Compare -extractsA tika_1=
_14 -extractsB tika_1_15 -db comparisondb -alterExtract first_only
=
li>
You may find that some extracts are too big to fit in memory, in which c=
ase use -maxExtractSize <maxBytes>
, or you may want to f=
ocus only on extracts that are greater than a minimum length: -minExt=
ractSize <minBytes>
.
The module tika-eval comes with a list of reports. However, you might wa=
nt to generate your own. Each report is specified by SQL and a few other co=
nfigurations in an xml file. See comparison-reports.xml
and
To specify your own reports on the commandline, use -rf
(re=
port file):java -jar tika-eval-X.Y.jar Report -db comparisondb -r=
f myreports.xml
If you'd like to write the reports to a root directory other than 'repor=
ts', specify that with -rd
(report directory):java -=
jar tika-eval-X.Y.jar Report -db comparisondb -rd myreportdir
Again, see TikaEva= lDbDesign for more information on the underlying structure of the datab= ase.