Apache Solr Documentation

6.4 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.5 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.5

Skip to end of metadata
Go to start of metadata

Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler, Solr natively supports indexing structured documents in XML, CSV and JSON.

The recommended way to configure and use request handlers is with path based names that map to paths in the request url. However, request handlers can also be specified with the qt (query type) parameter if the requestDispatcher is appropriately configured. It is possible to access the same handler using more than one name, which can be useful if you wish to specify different sets of default options.

A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating to the appropriate ContentStreamLoader based on the Content-Type of the ContentStream.

Topics covered in this section:

UpdateRequestHandler Configuration

The default configuration file has the update request handler configured by default.

XML Formatted Index Updates

Index update commands can be sent as XML message to the update handler using Content-type: application/xml or Content-type: text/xml.

Adding Documents

The XML schema recognized by the update handler for adding documents is very straightforward:

  • The <add> element introduces one more documents to be added.
  • The <doc> element introduces the fields making up a document.
  • The <field> element presents the content for a specific field.

For example:

Each element has certain optional attributes which may be specified.

Command

Optional Parameter

Parameter Description

<add>

commitWithin=number

Add the document within the specified number of milliseconds

<add>

overwrite=boolean

Default is true. Indicates if the unique key constraints should be checked to overwrite previous versions of the same document (see below)

<doc>

boost=float

Default is 1.0. Sets a boost value for the document.To learn more about boosting, see Searching.

<field>

boost=float

Default is 1.0. Sets a boost value for the field.

If the document schema defines a unique key, then by default an /update operation to add a document will overwrite (ie: replace) any document in the index with the same unique key. If no unique key has been defined, indexing performance is somewhat faster, as no check has to be made for an existing documents to replace.

If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (eg: you build your indexes in batch, and your indexing code guarantees it never adds the same document more then once) you can specify the overwrite="false" option when adding your documents.

XML Update Commands

Commit and Optimize Operations

The <commit> operation writes all documents loaded since the last commit to one or more segment files on the disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation opens a new searcher, and triggers any event listeners that have been configured.

Commits may be issued explicitly with a <commit/> message, and can also be triggered from <autocommit> parameters in solrconfig.xml.

The <optimize> operation requests Solr to merge internal data structures in order to improve search performance. For a large index, optimization will take some time to complete, but by merging many small segment files into a larger one, search performance will improve. If you are using Solr's replication mechanism to distribute searches across many systems, be aware that after an optimize, a complete index will need to be transferred. In contrast, post-commit transfers are usually much smaller.

The <commit> and <optimize> elements accept these optional attributes:

Optional Attribute

Description

waitSearcher

Default is true. Blocks until a new searcher is opened and registered as the main query searcher, making the changes visible.

expungeDeletes

(commit only) Default is false. Merges segments that have more than 10% deleted docs, expunging them in the process.

maxSegments(optimize only) Default is 1. Merges the segments down to no more than this number of segments.

Here are examples of <commit> and <optimize> using optional attributes:

Delete Operations

Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the specified ID, and can be used only if a UniqueID field has been defined in the schema. "Delete by Query" deletes all documents matching a specified query, although commitWithin is ignored for a Delete by Query. A single delete message can contain multiple delete operations.

When using the Join query parser in a Delete By Query, you should use the score parameter with a value of "none" to avoid a ClassCastException. See the section on the Join Query Parser for more details on the score parameter. 

Rollback Operations

The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any event listeners nor creates a new searcher. Its syntax is simple: <rollback/>.

Using curl to Perform Updates

You can use the curl utility to perform any of the above commands, using its --data-binary option to append the XML message to the curl command, and generating a HTTP POST request. For example:

For posting XML messages contained in a file, you can use the alternative form:

Short requests can also be sent using a HTTP GET command, URL-encoding the request, as in the following. Note the escaping of "<" and ">":

Responses from Solr take the form shown here:

The status field will be non-zero in case of failure.

Using XSLT to Transform XML Index Updates

The UpdateRequestHandler allows you to index any arbitrary XML using the <tr> parameter to apply an XSL transformation. You must have an XSLT stylesheet in the conf/xslt directory of your config set that can transform the incoming data to the expected <add><doc/></add> format, and use the tr parameter to specify the name of that stylesheet.

Here is an example XSLT stylesheet:

This stylesheet transforms Solr's XML search result format into Solr's Update XML syntax. One example usage would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can be indexed into another Solr file (provided that all fields are stored):

You can also use the stylesheet in XsltUpdateRequestHandler to transform an index when updating:

JSON Formatted Index Updates

Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update request, described below in the section Transforming and Indexing Custom JSON.

Solr-Style JSON 

JSON formatted update requests may be sent to Solr's /update handler using Content-Type: application/json or Content-Type: text/json.

JSON formatted updates can take 3 basic forms, described in depth below:

Adding a Single JSON Document

The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the /update/json/docs path:

Adding Multiple JSON Documents

Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each object represents a document:

A sample JSON file is provided at example/exampledocs/books.json and contains an array of objects that you can add to the Solr techproducts example:

Sending JSON Update Commands

In general, the JSON update syntax supports all of the update commands that the XML update handler supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be contained in one message:

Comments are not allowed in JSON, but duplicate names are.

The comments in the above example are for illustrative purposes only, and can not be included in actual commands sent to Solr.

As with other update handlers, parameters such as commit, commitWithin, optimize, and overwrite may be specified in the URL instead of in the body of the message.

The JSON update format allows for a simple delete-by-id. The value of a delete can be an array which contains a list of zero or more specific document id's (not a range) to be deleted. For example, a single document:

Or a list of document IDs:

The value of a "delete" can be an array which contains a list of zero or more id's to be deleted. It is not a range (start and end).

You can also specify _version_ with each "delete":

You can specify the version of deletes in the body of the update request as well.

JSON Update Convenience Paths

In addition to the /update handler, there are a few additional JSON specific request handler paths available by default in Solr, that implicitly override the behavior of some request parameters:

PathDefault Parameters
/update/jsonstream.contentType=application/json
/update/json/docsstream.contentType=application/json

json.command=false

The /update/json path may be useful for clients sending in JSON formatted update commands from applications where setting the Content-Type proves difficult, while the /update/json/docs path can be particularly convenient for clients that always want to send in documents – either individually or as a list – without needing to worry about the full JSON command syntax.

Custom JSON Documents

Solr can support custom JSON. This is covered in the section Transforming and Indexing Custom JSON.

 

CSV Formatted Index Updates

CSV formatted update requests may be sent to Solr's /update handler using Content-Type: application/csv or Content-Type: text/csv.

A sample CSV file is provided at example/exampledocs/books.csv that you can use to add some documents to the Solr techproducts example:

CSV Update Parameters

The CSV handler allows the specification of many parameters in the URL in the form: f.parameter.optional_fieldname=value .

The table below describes the parameters for the update handler.

Parameter

Usage

Global (g) or Per Field (f)

Example

separator

Character used as field separator; default is ","

g,(f: see split)

separator=%09

trim

If true, remove leading and trailing whitespace from values. Default=false.

g,f

f.isbn.trim=true
trim=false

header

Set to true if first line of input contains field names. These will be used if the fieldnames parameter is absent.

g

 

fieldnames

Comma separated list of field names to use when adding documents.

g

fieldnames=isbn,price,title

literal.<field_name>

A literal value for a specified field name.

g

literal.color=red

skip

Comma separated list of field names to skip.

g

skip=uninteresting,shoesize

skipLines

Number of lines to discard in the input stream before the CSV data starts, including the header, if present. Default=0.

g

skipLines=5

encapsulator

The character optionally used to surround values to preserve characters such as the CSV separator or whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated value by doubling the encapsulator.

g,(f: see split)

encapsulator="

escape

The character used for escaping CSV separators or other reserved characters. If an escape is specified, the encapsulator is not used unless also explicitly specified since most formats use either encapsulation or escaping, not both

g

escape=\

keepEmpty

Keep and index zero length (empty) fields. Default=false.

g,f

f.price.keepEmpty=true

map

Map one value to another. Format is value:replacement (which can be empty.)

g,f

map=left:right
f.subject.map=history:bunk

split

If true, split a field into multiple values by a separate parser.

f

 

overwrite

If true (the default), check for and overwrite duplicate documents, based on the uniqueKey field declared in the Solr schema. If you know the documents you are indexing do not contain any duplicates then you may see a considerable speed up setting this to false.

g

 

commit

Issues a commit after the data has been ingested.

g

 

commitWithin

Add the document within the specified number of milliseconds.

g

commitWithin=10000

rowid

Map the rowid (line number) to a field specified by the value of the parameter, for instance if your CSV doesn't have a unique key and you want to use the row id as such.

g

rowid=id

rowidOffset

Add the given offset (as an int) to the rowid before adding it to the document. Default is 0

g

rowidOffset=10

Indexing Tab-Delimited files

The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files) and even handle backslash escaping rather than CSV encapsulation.

For example, one can dump a MySQL table to a tab delimited file with:

This file could then be imported into Solr by setting the separator to tab (%09) and the escape to backslash (%5c).

CSV Update Convenience Paths

In addition to the /update handler, there is an additional CSV specific request handler path available by default in Solr, that implicitly override the behavior of some request parameters:

PathDefault Parameters
/update/csvstream.contentType=application/csv

The /update/csv path may be useful for clients sending in CSV formatted update commands from applications where setting the Content-Type proves difficult.

Nested Child Documents

Solr indexes nested documents in blocks as a way to model documents containing other documents, such as a blog post parent document and comments as child documents -- or products as parent documents and sizes, colors, or other variations as child documents.  At query time, the Block Join Query Parsers can search these relationships. In terms of performance, indexing the relationships between documents may be more efficient than attempting to do joins only at query time, since the relationships are already stored in the index and do not need to be computed.

Nested documents may be indexed via either the XML or JSON data syntax (or using SolrJ) - but regardless of syntax, you must include a field that identifies the parent document as a parent; it can be any field that suits this purpose, and it will be used as input for the block join query parsers.

To support nested documents, the schema must include an indexed/non-stored field _root _ . The value of that field is populated automatically and is the same for all documents in the block, regardless of the inheritance depth.

XML Examples

For example, here are two documents and their child documents:

In this example, we have indexed the parent documents with the field content_type, which has the value "parentDocument". We could have also used a boolean field, such as isParent, with a value of "true", or any other similar approach.

JSON Examples

This example is equivalent to the XML example above, note the special _childDocuments_ key need to indicate the nested documents in JSON.

Note

One limitation of indexing nested documents is that the whole block of parent-children documents must be updated together whenever any changes are required. In other words, even if a single child document or the parent document is changed, the whole block of parent-child documents must be indexed together.

 

 

 

Save

  • No labels

32 Comments

  1. Examples for adding JSON formatted documents - https://gist.github.com/vthacker/420b93d0fc9b8a30c2e9

    This should go under "JSON Formated Index Updates", along with the fix for the typo (smile)

    I feel we should not document the 3rd example anymore because it causes a lot of confusion. Although multiple keys with the same name are allowed according to the spec, many/most impls don't play nice with it

    1. what typo do you see?

      I updated the JSON section to start with the "easy" and hopefully most common example of sending multiple documents as an array, and then tweaked the introductory wording for he more involved "one to one mapping with xml" example.

      I agree that we should be promoting the simplified view, but i don't think we should un-document the more complex examples – if people see complex JSON syntax like these in Solr clients in the wild, they should be able to come here and understand what they mean.

       

       

        1. I fixed those - there were 2 of them.

  2. the CSV param table needs to be updated to clarify what the default values are are for any param that has a default

  3. Some suggested changes to the "Nested Documents" section above:

    • Replace the first sentence with the following: Solr supports the indexing of "nested documents" by using a "Block Join" to model documents containing other documents. For example, a blog post would be treated as a "parent document" and comments on that blog post could then be treated as "child documents". Another example would be defining products as parent documents and the associated product details (sizes, colors, or other variations) as child documents.

    • Fix typo in this sentence: "At query time, the Block Join Query Parsers can be used search against these relationships." i.e. "used search" -> "used to search".

    • Fix typo in this sentence: "This example is equivalent to the XML example above, note the special _childDocuments_ key need to indicate the nested documents in JSON." i.e. "need to" -> "that is needed to"
  4. If we want to fully migrate away from the old wiki page at https://wiki.apache.org/solr/UpdateCSV , it feels like we should migrate the examples?  More complex params (like split) definitely need those examples.

  5. One word should be removed:

    In general, the JSON update syntax supports accepts all of 

    Sending JSON Update Commands

    In general, the JSON update syntax supports accepts all of the update commands that the XML update handler supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be contained in one message:

  6. It looks like the leading If is extraneous:

    Setting JSON Defaults

    If It is possible to send any json to the /update/json/docs endpoint and the default configuration of the component is as follows:

  7. Noble Paul, it seems like SOLR-6633 and {{<str name="mapUniqueKeyOnly">true</str>}} made JSON examples and http://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/ as well obsolete.

    The first example under Transforming and Indexing Custom JSON fails with

     "msg":"Raw data can be stored only if split=/","code":400,
     and it's ok since it's explicit error.

    if I remove {{split=/exams}} or set it to {{split=/}} it completely ignores explicitly requested mappings {{f=..&}} and only stores raw content in {{_src_}}, and {{text}}. To enable {{f=..}} mappings I need to pass mapUniqueKeyOnly=false

    Until srcField is removed from solrconfig.xml, passing something beside split=/ is not possible.

    1. This is a problem in the sample_tech_products configset. if you wish to map fields then the user should comment out those two attributes. Any alternate suggestions?

      1. Perhaps it's worth to break it explicitly i.e. if we specify any f=... and mapUniqueKeyOnly=true, it should

        ,"error":{"msg":"Fields can be mapped f=... only if mapUniqueKeyOnly=true","code":400}

        WDYT?

  8. At the end of the 'delete' section under XML, there is text starting from "When using the Join query parser". It does not seem to make any sense in that context. Incorrect copy/paste?

    1. no, it's a specifict comment about using a join query parser when doing a delete by query - i converted it to a NOTE box and tried to clarify the wording.

  9. Cassandra Targett Can we split this into multiple pages (one per type) . This has become unmanageable

  10. there is an example of Nested Docs with Json with (name - Joe Smith , Phone number and having 2 orgs one Microsoft in Seattle and other Apple in Cupertion

    I have 2 queries on it. a) can same thing be done in xml rather than in json (because the xml example with "comments" is differnt because it does not repeat the different child nested field names)

    b) Another json example with subject and marks are shown where it is split on / and hence 2 docs are created for same doc. From end usage perspective when it makes sense to split incoming doc where child nodes are repeated in multiple documents vs storing it as nested child docs. What are pros and cons of it.

    1. a) This is a JSON example where everyone expects JSON. Why use XML ?
      b) These are different use cases .  If you need a flat structure, go ahead without nesting. 

  11. thanks Noble for a prompt response. Request further guidance

    a) I check in advance for xml, because I am dealing with a PROD business case where I have xml data with nested structures and I dont know xslt/xquery to transform data , so I have to learn it. That's why  before I make an effort, I wanted to check whether it is doable in XML as it is reported in JSON. as if it is not doable in xml then I would think of making effort in something else.

     b) As you mention they are different use cases, request you elaborate more. In my case I have nested data in xml and technically it can be stored flat as well as with nesting. So what is difference between them from a functional point of view. When does it make sense to store it as flat vs nested.

    1. a) XML does not have nested doc support

      Here is a small page on nested objects http://yonik.com/solr-nested-objects/

      search for "solr block join " 

      I'm sure you will get more answers if the question is posted to the mailing list. The purpose of the comments section is to enhance/correct the documentation itself

       

      1. to be crystal clear: uploading data to solr, in the xml format, absolutely supports nested documents.

        On this page, in the "Nested Child Documents" section, the very first example is in fact in solr's xml format.

        The "joe smith" example that Aniruddh Sharma asked about is specifically in a section named "Transforming and Indexing Custom JSON" and as noble mentioned in his first reply, it would not make sense to include an XML example of nested documents in that section - that section is specifically about the transformation feature which is specific to JSON.

        1. Sorry, I assumed the question was about the free form XML support like what we introduced in JSON.

          Actually, even JSON had nested doc support before this change. But the user had to send the JSON in a format that Solr knows.

          In fact,  all formats,  XML, JSON and binary had nested doc support from the beginning. 

  12. It would be helpful if the Indexing section contained a brief overview of the UpdateHandler workflow which included a link to the UpdateHandlers and UpdateRequestProcessor documentation in 'Configuring solrconfig.xml'. I was trying to locate some URP documentation and started (logically, I think) in this section but was not directed to it.

  13. I'm trying to index a tab-delimited file via the Nifi's PutSolrContentStream processor. I added the separator value and set it to %09, and the escape property and set it to %5c as suggested above, but I get an error saying that a separator value of %09 is not valid. What am I missing?

    1. I had never heard of NiFi until I saw this comment.  You might need to elicit their help.

      NiFi is probably using SolrJ, the Java Solr client.  SolrJ handles the URL escaping of parameters, which means that you would not want to use values like %09 – the URL escaping would turn that into %2509, so the value received by Solr would be the string %09 instead of a tab character.  You would want to use an actual tab character in the data you give to Nifi, and let it URL escape that for you.  In a Java literal string, I think you can use "\t" to represent a tab.

      1. Thanks, Shawn. I actually already tried "\t". It didn't like that either.

  14. Noble Paul: In the Transforming custom JSON section, it says "One or more valid JSON documents can be sent to the /update/json/docs path with the configuration params"

    Does this mean we can send a JSON array with several documents to the handler? Or does it mean one document that can be split into several by our transformation process? If the first, could we do even a tiny example showing the syntax/possibility.

    1. It actually supports both. you can send an array of docs as follows

      [
      {"id",1},
      {"id",2},
      ]

      and the following as well (one doc after another)

      {"id",1},
      {"id",2},

       

      I shall add an example

      1. It also supports the .jsonl format which is basically a JSON document per line. It'd be nice to have an example for that too.