"Embedded
...
files: Risk, Challenges and Options"
Registration: https://openpreservation.org/events/embedded-files-risk-challenges-and-options/?q=4664
Data
https://corpora.tika.apache.org/base/share/opf-embedded-files-data-20220920.tgz
Tools
Prerequisites
- must have Java >= 8 installed.
- If you have tesseract installed, you may want to move it/turn it off (e.g.
sudo mv /usr/local/bin/tesseract /usr/local/bin/tesseract2
) - A json viewer: Notepad++ or Sublime Text or ....?
Download
- tika-app: https://dlcdn.apache.org/tika/2.4.1/tika-app-2.4.1.jar
- optional: https://repo1.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/1.4.0/jai-imageio-jpeg2000-1.4.0.jar
Commandlines
- Use the simple UI: java -jar tika-app-2.4.1.jar. Then drag and drop
opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx
into the window and select "Recursive JSON." - Run Tika against a single file, pipe output to a json file:
java -jar tika-app-2.4.1.jar -J -t opf-embedded-files-data-20220920/ooxml/test_recursive_embedded.docx > test_recursive.json
- Run Tika in batch mode:
java -jar tika-app-2.4.1.jar -J -t -i opf-embedded-files-data-20220920 -o opf-embedded-files-data-20220920-extracts