Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Memory use configuration is currently going through some upgrades. It looks like the clients will be able to set a threshold and PDFBox will choose when to buffer to disk to stay under the desired memory threshold.

Character Encodings

  • Wiki Markup
    I've noticed a handful of cases where ligatures in 1.8 are "spelled out" in 2.0 -- e.g. "identi\[fi\]cation" in 1.8 has become "identification" in 2.0 (at least in 003403.pdf from govdocs1).
    \\

TIFF Extraction

Tiffs are no longer extracted by PDFBox without supplementary, non-Apache friendly libraries added to the classpath by consumers. For now, with straight Tika+PDFBox, if "extractInlineImages" is set to true, and a TIFF is encountered, a zero-byte inputstream will be sent to the embedded (TIFF) parser. This in turn throws an exception. With the standard AutoDetectParser(), this embedded doc exception is caught, swallowed and ignored. The RecursiveParserWrapper will catch these exceptions and allow users to see how many TIFFs they aren't getting, and allow users to see which files contain TIFFs.

...