...
Overriding Default Configuration
When using the OCR Parser Tika will use the following default settings:
- Tesseract installation path = ""
- Language dictionary = "eng"
- Page Segmentation Mode = "1"
- Minmum file size = 0
- Maximum file size = 2147483647
- Timeout = 120
To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.
It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:
java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI
...
In Tika 2.x, users can modify configurations via a tika-config.xml
. With the exceptions of the paths, we document the defaults in the following:
Code Block | ||||
---|---|---|---|---|
| ||||
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<!-- this is not formally necessary, but prevents loading of unnecessary parser -->
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<!-- these are the defaults; you only need to specify the ones you want
to modify -->
<param name="applyRotation" type="bool">false</param>
<param name="colorSpace" type="string">gray</param>
<param name="density" type="int">300</param>
<param name="depth" type="int">4</param>
<param name="enableImagePreprocessing" type="bool">false</param>
<param name="filter" type="string">triangle</param>
<param name="imageMagickPath" type="string">/my/custom/imageMagicPath</param>
<param name="language" type="string">eng</param>
<param name="maxFileSizeToOcr" type="long">2147483647</param>
<param name="minFileSizeToOcr" type="long">0</param>
<param name="pageSegMode" type="string">1</param>
<param name="pageSeparator" type="string"></param>
<param name="preserveInterwordSpacing" type="bool">false</param>
<param name="resize" type="int">200</param>
<param name="skipOcr" type="bool">false</param>
<param name="tessdataPath" type="string">/my/custom/data</param>
<param name="tesseractPath" type="string">/my/custom/path</param>
<param name="timeoutSeconds" type="int">120</param>
</params>
</parser>
</parsers>
</properties> |
OCR and PDFs
See also PDFParser notes for more details on options for performing OCR on PDFs.
...