Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

KeyNotes
Content-TypeThis is the file's mime type as identified by Tika. Example: application/pdf


X-TIKA:digest:MD5If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM.
resourceNameFile name
Content-LengthWhen available, the number of bytes in a stream
X-TIKA:contentThis is the text that is extracted from the files
X-TIKA:content_handlerThis is the content handler that was used for handling the text (e.g. Text, XHTML, etc.)
X-TIKA:embedded_resource_path
X-TIKA:embedded_depth
X-TIKA:encryptedIf a parser throws an EncryptedDocumentException, the parser also sets this value to true in the metadata.
tika:file_extFile extension

...

Format Specific Metadata

PDF Metadata

PDF metadata is typically stored via two mechanisms, one is the "native" PDF docinfo metadata object and the other is via XMP.  For cases where there may be the same key, e.g. "created," in both the docinfo and the XMP, Tika reports the information in the XMP.  In this case, the created date in the XMP would be reported as dcterms:created.

Some users want to extract the literal docinfo information (irrespective of the XMP), and for that Tika prefixes keys with pdf:docinfo.

Note that XMP metadata may have custom keys, and some PDFs store custom metadata in the docinfo.

PDF is a "page-based" file format, and the number of pages is stored in xmpTPg:NPages.


customCompany:custom:SourceModified
KeyNotes
access_permission:assemble_document
access_permission:can_modify
access_permission:can_print
access_permission:can_print_degraded
access_permission:extract_content
access_permission:extract_for_accessibility
access_permission:fill_in_form
access_permission:modify_annotations
pdf:actionTrigger
pdf:annotationSubtypes
pdf:annotationTypes
pdf:charsPerPage
pdf:docinfo:custom:* Custom metadata stored in the docinfo dictionary, e.g. pdf:docinfo:custom:_dlc_policyId 
pdf:docinfo:created
pdf:docinfo:creator
pdf:docinfo:creator_tool
pdf:docinfo:keywords
pdf:docinfo:modified
pdf:docinfo:producer
pdf:docinfo:title
pdf:docinfo:trapped
pdf:has3D
pdf:hasAcroFormFields
pdf:hasCollection
pdf:hasMarkedContent
pdf:hasXFA
pdf:hasXMP
pdf:PDFExtensionVersion
pdf:PDFVersion
pdf:producer
pdf:unmappedUnicodeCharsPerPage
pdfa:PDFVersion
pdfaid:conformance
pdfaid:part
pdfuaid:part
pdfvt:modified
pdfvt:version
pdfx:conformance
pdfx:version
pdfxid:version




Microsoft Office Files

KeyNotes
embeddedRelationshipId





RTF Files

KeyNotes
rtf_meta:emb_app_version
rtf_meta:emb_class
rtf_meta:thumbnail
rtf_pict:*metadata around embedded images in RTF. A few examples include: rtf_pict:borderLeftColor, rtf_pict:borderRightColor, rtf_pict:borderTopColor, rtf_pict:dhgt, rtf_pict:dxHeightHR, rtf_pict:dxTextLeft, rtf_pict:dxTextRight, rtf_pict:dxWidthHR




Tiff Files

KeyNotes
tiff:ImageWidth
tiff:ImageLength
tiff:BitsPerSample

...