Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Registration: https://openpreservation.org/events/embedded-files-risk-challenges-and-options/?q=4664

This talk will be a traditional talk with slides.  However, if you'd like to get your hands on some tools and some files, please follow the directions below.  If you have files you'd like to share please email me: tallison AT apache.org

Data

https://corpora.tika.apache.org/base/share/opf-embedded-files-data-20220920.tgz

Read the notes.txt file for sources of the data.

Tools

Prerequisites

  • Must have Java >= 8 installed. 
  • If you have tesseract installed, you may want to move it/turn it off (e.g. sudo mv /usr/local/bin/tesseract /usr/local/bin/tesseract2)
  • A json viewer: Notepad++ or Sublime Text or ....?

...

  1. Use the simple UI: java -jar tika-app-2.4.1.jar. Then drag and drop opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx into the window and select "Recursive JSON."       
  2. Run Tika against a single file, pipe output to a json file: java -jar tika-app-2.4.1.jar -J -t opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx > test_recursive.json
  3. Run Tika in batch mode: java -jar tika-app-2.4.1.jar -J -t -i opf-embedded-files-data-20220920 -o opf-embedded-files-data-20220920-extracts
  4. To extract the literal embedded files (first level only!): java -jar tika-app-2.4.1.jar --extract-dir=261779-attachments -z opf-embedded-files-data-20220920/ppt/govdocs1/261779.ppt


Advanced

Because of licensing reasonslicense issues, Tika cannot bundle the jpeg2000 parserIf its The digitally_signed_3D_Portfolio.pdf file contains jpeg2000. If the jpeg2000 parser's license is acceptable to you, put it in and the tika-app.X.Y.Z.jar in a bin directory and call tika-app's main class: java -cp "bin/*" org.apache.tika.cli.TikaCLI --extract-dir=portfolio-files -z opf-embedded-files-data-20220920/pdfs/OPF-format-corpus/digitally_signed_3D_Portfolio.pdf

...