Page History

...

Memory use configuration is currently going through some upgrades. It looks like the clients will be able to set a threshold and PDFBox will choose when to buffer to disk to stay under the desired memory threshold.

Character Encodings

Wiki Markup
I've noticed a handful of cases where ligatures in 1.8 are "spelled out" in 2.0 -- e.g. "identi\[fi\]cation" in 1.8 has become "identification" in 2.0 (at least in 003403.pdf from govdocs1). \\

TIFF Extraction

Tiffs are no longer extracted by PDFBox without supplementary, non-Apache friendly libraries added to the classpath by consumers. For now, with straight Tika+PDFBox, if "extractInlineImages" is set to true, and a TIFF is encountered, a zero-byte inputstream will be sent to the embedded (TIFF) parser. This in turn throws an exception. With the standard AutoDetectParser(), this embedded doc exception is caught, swallowed and ignored. The RecursiveParserWrapper will catch these exceptions and allow users to see how many TIFFs they aren't getting, and allow users to see which files contain TIFFs.

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key

Character Encodings

TIFF Extraction