This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Extracting Embedded VBA and JS

By default, Tika ignores embedded VBA and js. The user must configure this programmatically or via tika-config.xml:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.html.HtmlParser"/>
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
            <parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
            <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
        </parser>

        <parser class="org.apache.tika.parser.html.HtmlParser">
            <params>
                <param name="extractScripts" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractActions" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="extractMacros" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.OfficeParser">
            <params>
                <param name="extractMacros" type="bool">true</param>
            </params>
        </parser>    
    </parsers>
</properties>

We encourage using the RecursiveParserWrapper for easier understanding of the extracted data and the boundaries between the parent file and the embedded files – the -J option in tika-app or the /rmeta endpoint in tika-server.

  • No labels