You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

"Embedded files: Risk, Challenges and Options"

Registration: https://openpreservation.org/events/embedded-files-risk-challenges-and-options/?q=4664

Data

https://corpora.tika.apache.org/base/share/opf-embedded-files-data-20220920.tgz

Tools

Prerequisites

  • must have Java >= 8 installed. 
  • If you have tesseract installed, you may want to move it/turn it off (e.g. sudo mv /usr/local/bin/tesseract /usr/local/bin/tesseract2)
  • A json viewer: Notepad++ or Sublime Text or ....?

Download

Commandlines

  1. Use the simple UI: java -jar tika-app-2.4.1.jar. Then drag and drop opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx into the window and select "Recursive JSON."       
  2. Run Tika against a single file, pipe output to a json file: java -jar tika-app-2.4.1.jar -J -t opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx > test_recursive.json
  3. Run Tika in batch mode: java -jar tika-app-2.4.1.jar -J -t -i opf-embedded-files-data-20220920 -o opf-embedded-files-data-20220920-extracts
  4. To extract the literal embedded files (first level only!): java -jar tika-app-2.4.1.jar --extract-dir=261779-attachments -z opf-embedded-files-data-20220920/ppt/govdocs1/261779.ppt


Advanced

Because of licensing reasons, Tika cannot bundle the jpeg2000 parser.  If its license is acceptable to you, put it in and the tika-app.X.Y.Z.jar in a bin directory and call tika-app's main class: java -cp "bin/*" org.apache.tika.cli.TikaCLI --extract-dir=portfolio-files -z opf-embedded-files-data-20220920/pdfs/OPF-format-corpus/digitally_signed_3D_Portfolio.pdf



  • No labels