Upgrading to PDFBox 2.0

PDFBox is on its way to releasing 2.0. This version represents a major shift from the 1.8.x branch. We'll document some expected differences from a user/consumer perspective on the upgrade. The issue to track progress is TIKA-1285.

NonseqParser

With 2.x, the older parser is gone, and the NonSequential parser is the main/only parser available. In 1.8.x, users of Tika can configure the use of the NonSequential parser via the config file. This choice will disappear in 2.x.

Speed/Memory

This is still in a state of flux. With some changes over the last few days, the speed appears to be equivalent between 1.8.x and the non-sequential parser and 2.x – that said, the speed is slightly slower with the nonsequential parser (TODO: benchmarks);

Memory use configuration is currently going through some upgrades. It looks like the clients will be able to set a threshold and PDFBox will choose when to buffer to disk to stay under the desired memory threshold.

Character Encodings

  • I've noticed a handful of cases where ligatures in 1.8 are "spelled out" in 2.0 – e.g. "identi[fi]cation" in 1.8 has become "identification" in 2.0 (at least in 003403.pdf from govdocs1).

TIFF Extraction

Tiffs are no longer extracted by PDFBox without supplementary, non-Apache friendly libraries added to the classpath by consumers. For now, with straight Tika+PDFBox, if "extractInlineImages" is set to true, and a TIFF is encountered, a zero-byte inputstream will be sent to the embedded (TIFF) parser. This in turn throws an exception. With the standard AutoDetectParser(), this embedded doc exception is caught, swallowed and ignored. The RecursiveParserWrapper will catch these exceptions and allow users to see how many TIFFs they aren't getting, and allow users to see which files contain TIFFs.

To get a sense of the external libraries you'll need to add, take a look at this pom

  • No labels