...
Key | Notes |
---|---|
Content-Type | This is the file's mime type as identified by Tika. Example: application/pdf |
X-TIKA:digest:MD5 | If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM. |
resourceName | File name |
Content-Length | When available, the number of bytes in a stream |
X-TIKA:content | This is the text that is extracted from the files |
X-TIKA:content_handler | This is the content handler that was used for handling the text (e.g. Text, XHTML, etc.) |
X-TIKA:embedded_resource_path | |
X-TIKA:embedded_depth | |
X-TIKA:encrypted | If a parser throws an EncryptedDocumentException, the parser also sets this value to true in the metadata. |
tika:file_ext | File extension |
...
Format Specific Metadata
PDF Metadata
PDF metadata is typically stored via two mechanisms, one is the "native" PDF docinfo
metadata object and the other is via XMP. For cases where there may be the same key, e.g. "created," in both the docinfo and the XMP, Tika reports the information in the XMP. In this case, the created date in the XMP would be reported as dcterms:created
.
Some users want to extract the literal docinfo
information (irrespective of the XMP), and for that Tika prefixes keys with pdf:docinfo
.
Note that XMP metadata may have custom keys, and some PDFs store custom metadata in the docinfo.
PDF is a "page-based" file format, and the number of pages is stored in xmpTPg:NPages
.
Key | Notes | ||
---|---|---|---|
access_permission:assemble_document | |||
access_permission:can_modify | |||
access_permission:can_print | |||
access_permission:can_print_degraded | |||
access_permission:extract_content | |||
access_permission:extract_for_accessibility | |||
access_permission:fill_in_form | |||
access_permission:modify_annotations | |||
pdf:actionTrigger | |||
pdf:annotationSubtypes | |||
pdf:annotationTypes | |||
pdf:charsPerPage | |||
pdf:docinfo:custom:* | Custom metadata stored in the docinfo dictionary, e.g. pdf:docinfo:custom:_dlc_policyId | ||
pdf:docinfo:created | |||
pdf:docinfo:creator | |||
pdf:docinfo:creator_tool | |||
pdf:docinfo:keywords | |||
pdf:docinfo:modified | |||
pdf:docinfo:producer | |||
pdf: | customdocinfo: | Companytitle | |
pdf:docinfo | :custom:SourceModified:trapped | ||
pdf:has3D | |||
pdf:hasAcroFormFields | |||
pdf:hasCollection | |||
pdf:hasMarkedContent | |||
pdf:hasXFA | |||
pdf:hasXMP | |||
pdf:PDFExtensionVersion | |||
pdf:PDFVersion | |||
pdf:producer | |||
pdf:unmappedUnicodeCharsPerPage | |||
pdfa:PDFVersion | |||
pdfaid:conformance | |||
pdfaid:part | |||
pdfuaid:part | |||
pdfvt:modified | |||
pdfvt:version | |||
pdfx:conformance | |||
pdfx:version | |||
pdfxid:version | |||
Microsoft Office Files
Key | Notes |
---|---|
embeddedRelationshipId | |
RTF Files
Key | Notes |
---|---|
rtf_meta:emb_app_version | |
rtf_meta:emb_class | |
rtf_meta:thumbnail | |
rtf_pict:* | metadata around embedded images in RTF. A few examples include: rtf_pict:borderLeftColor, rtf_pict:borderRightColor, rtf_pict:borderTopColor, rtf_pict:dhgt, rtf_pict:dxHeightHR, rtf_pict:dxTextLeft, rtf_pict:dxTextRight, rtf_pict:dxWidthHR |
Tiff Files
Key | Notes |
---|---|
tiff:ImageWidth | |
tiff:ImageLength | |
tiff:BitsPerSample |
...