NOTE: THIS PAGE IS IN PROGRESS. PLEASE CHECK BACK FOR MORE DETAILS.
For now, see: https://downloads.apache.org/tika/2.0.0/CHANGES-2.0.0.txt
Metadata.RESOURCE_NAME_KEY
has been renamed TikaCoreProperties.RESOURCE_NAME_KEY
.TikaCoreProperties.KEYWORDS
has been renamed Office.KEYWORDS
.- Meta
X-Parsed-By
has changed to X-TIKA:Parsed-By
X-TIKA:EXCEPTION:runtime
has been changed to X-TIKA:EXCEPTION:container_exception
tika-parsers – specific parser changes
tika-parsers module
When using tika-parsers in your project, you need to change the dependencies from
Code Block |
---|
language | xml |
---|
title | pom.xml from 1.27 |
---|
|
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.27</version>
</dependency> |
to
Code Block |
---|
language | xml |
---|
title | pom.xml for 2.0.0+ |
---|
|
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-scientific-module</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-sqlite3-module</artifactId>
<version>2.0.0</version>
</dependency> |
Also, there's a small transitive dependency issue with jcl-over-slf4j between tika-parsers-standard-package 2.0.0 and tika-parser-scientific-module:2.0.0. So if you are using maven enforcer plugin, you will need to fix it by adding this:
Code Block |
---|
|
<!-- Fix tika-parsers-standard-package 2.0.0 vs tika-parser-scientific-module:2.0.0 transitive dependency -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>1.7.31</version>
</dependency> |
If you are checking for CVEs (recommended), the tika-parser-scientific-module:2.0.0 comes with a transitive dependency on quartz 2.2.0 which should be fixed like this:
Code Block |
---|
|
<dependency>
<groupId>edu.ucar</groupId>
<artifactId>netcdf4</artifactId>
<version>${netcdf-java.version}</version>
<exclusions>
...
<exclusion>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>2.3.2</version>
</dependency> |
When using lang detection, you need to change now use:
Code Block |
---|
language | xml |
---|
title | pom.xml 2.0.0 |
---|
|
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-optimaize</artifactId>
<version>2.0.0</version>
</dependency> |
Also note that org.apache.tika.langdetect.OptimaizeLangDetector.getDefaultLanguageDetector
has moved to org.apache.tika.langdetect.optimaize.OptimaizeLangDetector.getDefaultLanguageDetector
.
For OCR, you can not use anymore TesseractOCRConfig.setTesseractPath(String)
and TesseractOCRConfig.setTessdataPath(String)
methods. They moved to the TesseractOCRParser
class.
tika-parsers-module optional dependencies
zstd
The zstd dependency includes native libs and is not packaged with the tika-parsers-module. If you'd like to parse zstd files, include:
Code Block |
---|
|
<dependency>
<groupId>com.github.luben</groupId>
<artifactId>zstd-jni</artifactId>
<version>1.5.0-4</version>
</dependency> |
TIFF and JPEG2000
If you plan to write TIFFs with Tika (rendering of PDF pages for OCR), and if the BSD-3 with nuclear disclaimer license is acceptable to you, include:
Code Block |
---|
language | xml |
---|
title | jai-imageio-core |
---|
|
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.4.0</version>
</dependency> |
If you plan on processing JPEG2000 images (most common use case would be rendering PDF pages for OCR), and if the BSD-3 with nuclear disclaimer license is acceptable to you, include:
Code Block |
---|
|
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-jpeg2000</artifactId>
<version>1.4.0</version>
</dependency> |
Note! In 2.x, Tika will not warn you if a PDF page that you're trying to render has a JPEG2000 in it. PDFBox will log a warning.
tika-app
tika-server
General
Configuration
tika-pipes
See the tika-pipes page.
tika-eval
tika-langid
In the 1.x branch, the default (hardwired) language identification component was the wrapper around Optimaize. If you used the following in 1.x:
Code Block |
---|
language | xml |
---|
title | pom.xml 1.27 |
---|
|
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect</artifactId>
<version>1.27</version>
</dependency> |
In 2.x, change this to:
Code Block |
---|
language | xml |
---|
title | optimaize-lang-detect |
---|
|
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect-optimaize</artifactId>
<version>2.0.x</version>
</dependency> |
The original language id component that was built by Tika devs and that used to be in tika-core is now in the tika-langdetect-tika module.