Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself. Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing.

When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell.

If you want to supply your own ContentHandler for Solr to use, you can extend the ExtractingRequestHandler and override the createFactory() method. This factory is responsible for constructing the SolrContentHandler that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter literalsOverride, which normally defaults to *true, to *false" to append Tika-parsed values to literal values.

For more information on Solr's Extracting Request Handler, see https://wiki.apache.org/solr/ExtractingRequestHandler.

Topics covered in this section:

Key Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

  • Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the stream.type parameter.
  • Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see http://www.saxproject.org/quickstart.html.
  • Solr then responds to Tika's SAX events and creates the fields to index.
  • Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See http://tika.apache.org/1.7/formats.html for the file types supported.
  • Tika adds all the extracted text to the content field.
  • You can map Tika's metadata fields to Solr fields.
  • You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any "captured content" fields.
  • You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.

While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly problematic, mostly due to the PDF format itself. In case of a failure processing any file, the ExtractingRequestHandler does not have a secondary mechanism to try to extract some text from the file; it will throw an exception and fail.

Trying out Tika with the Solr techproducts Example

You can try out the Tika framework using the techproducts example included in Solr.

Start the example:

You can now use curl to send a sample PDF file via HTTP POST:

The URL above calls the Extracting Request Handler, uploads the file solr-word.pdf and assigns it the unique ID doc1. Here's a closer look at the components of this command:

  • The literal.id=doc1 parameter provides the necessary unique ID for the document being indexed.
  • The commit=true parameter causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don't call the commit command until you are done.
  • The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file.
  • The argument myfile=@tutorial.html needs a valid path, which can be absolute or relative.

You can also use bin/post to send a PDF file into Solr (without the params, the literal.id parameter would be set to the absolute path to the file):

Now you should be able to execute a query and find that document. You can make a request like  http://localhost:8983/solr/techproducts/select?q=pdf .

You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the /update/extract handler in solrconfig.xml, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

In this command, the uprefix=attr_ parameter causes all generated fields that aren't defined in the schema to be prefixed with attr_, which is a dynamic field that is stored and indexed.

This command allows you to query the document using an attribute, as in: http://localhost:8983/solr/techproducts/select?q=attr_meta:microsoft.

Input Parameters

The table below describes the parameters accepted by the Extracting Request Handler.

Parameter

Description

capture

Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is still also captured into the overall "content" field.

captureAttr

Indexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". See the examples below.

commitWithin

Add the document within the specified number of milliseconds.

date.formats

Defines the date format patterns to identify in the documents.

defaultField

If the uprefix parameter (see below) is not specified and a field cannot be determined, the default field will be used.

extractOnly

Default is false. If true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags.For an example, see http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.

extractFormat

Default is "xml", but the other option is "text". Controls the serialization format of the extract content. The xml format is actually XHTML, the same format that results from passing the -x command to the Tika command line application, while the text format is like that produced by Tika's -t command. This parameter is valid only if extractOnly is set to true.

fmap.<source_field>

Maps (moves) one field name to another. The source_field must be a field in incoming documents, and the value is the Solr field to map to. Example: fmap.content=text causes the data in the content field generated by Tika to be moved to the Solr's text field.

ignoreTikaExceptionIf true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed.

literal.<fieldname>

Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued.

literalsOverride

If true (the default), literal field values will override other values with the same field name. If false, literal values defined with literal.<fieldname> will be appended to data already in the fields extracted from Tika. If setting literalsOverride to "false", the field must be multivalued.

lowernames

Values are "true" or "false". If true, all field names will be mapped to lowercase with underscores, if needed. For example, "Content-Type" would be mapped to "content_type."

multipartUploadLimitInKB

Useful if uploading very large documents, this defines the KB size of documents to allow.

passwordsFile

Defines a file path and name for a file of file name to password mappings.

resource.name

Specifies the optional name of the file. Tika can use it as a hint for detecting a file's MIME type.

resource.password

Defines a password to use for a password-protected PDF or OOXML file

tika.config

Defines a file path and name to a customized Tika configuration file. This is only required if you have customized your Tika implementation.

uprefix

Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>

xpath

When extracting, only return Tika XHTML content that satisfies the given XPath expression. See http://tika.apache.org/1.7/index.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.

Order of Operations

Here is the order in which the Solr Cell framework, using the Extracting Request Handler and Tika, processes its input.

  1. Tika generates fields or passes them in as literals specified by literal.<fieldname>=<value>. If literalsOverride=false, literals will be appended as multi-value to the Tika-generated field.
  2. If lowernames=true, Tika maps fields to lowercase.
  3. Tika applies the mapping rules specified by fmap. source = target parameters.
  4. If uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, any unknown fields are copied to the default field.

Configuring the Solr ExtractingRequestHandler

If you are not working with the supplied sample_techproducts_configs or data_driven_schema_configs config set, you must configure your own solrconfig.xml to know about the Jar's containing the ExtractingRequestHandler and its dependencies:

You can then configure the ExtractingRequestHandler in solrconfig.xml.

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named last_modified. We are also telling it to ignore undeclared fields. These are all overridden parameters.

The tika.config entry points to a file containing a Tika configuration. The date.formats allows you to specify various java.text.SimpleDateFormats date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the DateUtil in Solr):

yyyy-MM-dd'T'HH:mm:ss'Z'
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd HH:mm:ss
EEE MMM d hh:mm:ss z yyyy
EEE, dd MMM yyyy HH:mm:ss zzz
EEEE, dd-MMM-yy HH:mm:ss zzz
EEE MMM d HH:mm:ss yyyy

You may also need to adjust the multipartUploadLimitInKB attribute as follows if you are submitting very large documents.

Parser specific properties

Parsers used by Tika may have specific properties to govern how data is extracted. For instance, when using the Tika library from a Java program, the PDFParserConfig class has a method setSortByPosition(boolean) that can extract vertically oriented text. To access that method via configuration with the ExtractingRequestHandler, one can add the parseContext.config property to the solrconfig.xml file (see above) and then set properties in Tika's PDFParserConfig as below. Consult the Tika Java API documentation for configuration parameters that can be set for any particular parsers that require this level of control.

Multi-Core Configuration

For a multi-core configuration, you can specify sharedLib='lib' in the <solr/> section of solr.xml and place the necessary jar files there.

For more information about Solr cores, see The Well-Configured Solr Instance.

Indexing Encrypted Documents with the ExtractingUpdateRequestHandler

The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either resource.password on the request, or in a passwordsFile file.

In the case of passwordsFile, the file supplied must be formatted so there is one line per rule. Each rule contains a file name regular expression, followed by "=", then the password in clear-text. Because the passwords are in clear-text, the file should have strict access restrictions.

Examples

Metadata

As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a document, such as the author's name, the number of pages, the file size, and so on. The metadata produced depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do.

In addition to Tika's metadata, Solr adds the following metadata (defined in ExtractingMetadataConstants):

Solr Metadata

Description

stream_name

The name of the Content Stream as uploaded to Solr. Depending on how the file is uploaded, this may or may not be set

stream_source_info

Any source info about the stream. (See the section on Content Streams later in this section.)

stream_size

The size of the stream in bytes.

stream_content_type

The content type of the stream, if available.

We recommend that you try using the extractOnly option to discover which values Solr is setting for these metadata elements.

Examples of Uploads Using the Extracting Request Handler

Capture and Mapping

The command below captures <div> tags separately, and then maps all the instances of that field to a dynamic field named foo_t.

Capture & Mapping

The command below captures <div> tags separately and maps the field to a dynamic field named foo_t.

Using Literals to Define Your Own Metadata

To add in your own metadata, pass in the literal parameter along with the file:

XPath

The example below passes in an XPath expression to restrict the XHTML returned by Tika:

Extracting Data without Indexing It

Solr allows you to extract data without indexing. You might want to do this if you're using Solr solely as an extraction server or if you're interested in testing Solr extraction.

The example below sets the extractOnly=true parameter to extract data without indexing it.

The output includes XML generated by Tika (and further escaped by Solr's XML) using a different output format to make it more readable (`-out yes` instructs the tool to echo Solr's output to the console):

Sending Documents to Solr with a POST

The example below streams the file as the body of the POST, which does not, then, provide information to Solr about the name of the file.

Sending Documents to Solr with Solr Cell and SolrJ

SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You'll find more information on SolrJ in Client APIs.

Here's an example of using Solr Cell and SolrJ to add documents to a Solr index.

First, let's use SolrJ to create a new SolrClient, then we'll construct a request containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:

This operation streams the file my-file.pdf into the Solr index for my_collection.

The sample code above calls the extract command, but you can easily substitute other commands that are supported by Solr Cell. The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly. SolrJ takes care of the rest.

Note that the ContentStreamUpdateRequest is not just specific to Solr Cell. You can send CSV to the CSV Update handler and to any other Request Handler that works with Content Streams for updates.

Related Topics

 

 

  • No labels

11 Comments

  1. Propose that this chapter changes name to "Uploading rich documents with the ExtractingRequestHandler". The fact that "Solr Cell" was a temporary code name under development should be a footnote comment, not the main title.

  2. ignoreTikaException parameter is not mentioned. It is analogous to onError="skip" of TikaEntityProcessor.

  3. Can extracted text using Apache Tika be joined with structured data?  For a project I'm currently working on, the data extracted using Apache Tika needs to be combined with data from the database and indexed in Solr together.

    1. Did you get an answer for that. I have the same problem. I have found some ideas using SolrJ but is there an other way to do it, by combining 2 type of document into a merged one ?

  4. I did not.  The plan is to use Apache PDFBox to extract text from PDF file and Apache POI to extract text from MS Office files.  The metadata coming from database tables will need to combined with the extracted text terms prior to indexing in SOLR.  It is my understanding you can't use Apache Tika if you want to join on structured data.  But I am not an expert.

  5. One approach I've seen is to use partial indexing:
    http://stackoverflow.com/questions/13051658/solr-how-to-add-meta-data-to-indexed-binary-files-that-were-indexed-through-so

    We are just at the stage of evaluating SOLR, and this is one of our questions.

    So we don't have any practical experience with this but would love to hear about any issues or even better, successes.

  6. Success, mostly.

    Used CURL to upload a file:

    curl "http://elasticpoc01:8983/solr/gettingstarted/update/extract?literal.id=doc_id_1&commit=true" -F "myFile=@Apptest.docx"

    Note that the id was assigned directly via literal.id=doc_id_1.

    Created an XML file to contain the metadata:

    <add overwrite="false">
      <doc>
        <field name="author">John Halton</field>
        <field name="publisher">Freedom Press</field>
        <field name="id">doc_id_1</field>
      </doc>
    </add>

    Used CURL to submit the update:

    curl http://elasticpoc01:8983/solr/gettingstarted/update -H "Content-Type: text/xml" --data-binary @Apptest.xml

    Without the overwrite="false", the entire record for the file was replaced (overwrite="true" by default).

    With the overwrite="false", the tags were updated and the extracted data form the file was kept.

    Weird behaviour:

    With a straight file upload and no secondary tag update, the search record had a publisher:MyCo that could be found using a query publisher:MyCo.

    After the update where we adjusted the publisher tag to "Freedom Press":

    1) The displayed fields via the SOLR browse showed only the original publisher field derived from the file itself, i.e. publisher:MyCo

     2) A search for publisher:MyCo found the document.

    3) A search for publisher:"Freedom Press" found the document.

    4) A search for publisher:MyCo AND publisher:"Freedom Press" found nothing. 


    Edit:

    Setting the attribute "update" on the field tag solved the find issue:

    <field name="publisher" update="add">Freedom Press</field>

     

  7. hello i try to add a tika-config to my solr-config.xml and i get no errors but solr don't execute tika-config code i just want to know if there is any other modification i need to do in other files am pretty sure my tika-config.xml file is correct

     

  8. I try ti configure tika.config just like this config in request handler data import and add the path of tika config.xml to handler update extract like mentionned in this topic, but its not working. what I miss here.??

  9. I tried to build the SOLRJ-example.  It doesn't work.

    ContentStreamUpdateRequest.addFile needs a second parameter, mime-type. And isn't this the

    Task of Tika?

    ExtractingParams, I also didn't find.

    I use Solr 6.3.