...
- OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disableTikaOCR#disable-ocr) for how to disable OCR.
- Removed deprecated Metadata keys/properties and moved some commonly used keys from Metadata to TikaCoreProperties (such as TikaCoreProperties.RESOURCE_NAME_KEY) (TIKA-1974). See below for a list of changed keys.
- We upgraded from
log4j
tolog4j2
in tika-app, tika-server and anywhere else we used to uselog4j
. - The
tika-parsers
package has been split into several sub packages, inluding:tika-parsers-standard-package
,tika-parser-scientific-package
andtika-parser-sqlite3-package
. tika-app
only includes parsers intika-parsers-standard-package
; users have to addtika-parser-scientific-package
andtika-parser-sqlite3-package
if desired.tika-server
is nowtika-server-standard
and only includes parsers intika-parsers-standard-package
tika-server
is now run in--spawnChild
mode by default.- Removed deprecated PDFPreflightParser (TIKA-3437).
- Parsers are now configured via
tika-config.xml
on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers. - Changed namespaces of translator implementations (e.g.
org.apache.tika.language.translate.impl
) to avoid split-package with tika-core.
...