You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

"Embedded files: Risk, Challenges and Options"

Registration: https://openpreservation.org/events/embedded-files-risk-challenges-and-options/?q=4664


Data

https://corpora.tika.apache.org/base/share/opf-embedded-files-data-20220920.tgz

Tools

Prerequisites

  • must have Java >= 8 installed. 
  • If you have tesseract installed, you may want to move it/turn it off (e.g. sudo mv /usr/local/bin/tesseract /usr/local/bin/tesseract2)
  • A json viewer: Notepad++ or Sublime Text or ....?

Download

Commandlines

  1. Use the simple UI: java -jar tika-app-2.4.1.jar. Then drag and drop opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx into the window and select "Recursive JSON."       
  2. Run Tika against a single file, pipe output to a json file: java -jar tika-app-2.4.1.jar -J -t opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx > test_recursive.json
  3. Run Tika in batch mode: java -jar tika-app-2.4.1.jar -J -t -i opf-embedded-files-data-20220920 -o opf-embedded-files-data-20220920-extracts
  • No labels