Apache Solr Documentation

5.3 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

5.4 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 5.4

Skip to end of metadata
Go to start of metadata

Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports two approaches to updating documents that have only partially changed.

The first is atomic updates. This approach allows changing only one or more fields of a document without having to re-index the entire document.

The second approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on its version. This approach includes semantics and rules for how to deal with version matches or mis-matches.

Atomic Updates and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.

Atomic Updates

Solr supports several modifiers that atomically update values of a document. This allows updating only specific fields, which can help speed indexing processes in an environment where speed of index additions is critical to the application.

To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added to, or incrementally increased if a number.

Modifier

Usage

set

Set or replace the field value(s) with the specified value(s), or remove the values if 'null' or empty list is specified as the new value.

May be specified as a single value, or as a list for multivalued fields

add

Adds the specified values to a multivalued field.

May be specified as a single value, or as a list.

remove

Removes (all occurrences of) the specified values from a multivalued field.

May be specified as a single value, or as a list.

removeregex

Removes all occurrences of the specified regex from a multiValued field.

May be specified as a single value, or as a list.

inc

Increments a numeric value by a specific amount.

Must be specified as a single numeric value.

The core functionality of atomically updating a document requires that all fields in your schema must be configured as stored="true" except for fields which are <copyField/> destinations -- which must be configured as stored="false". Atomic updates are applied to the document represented by the existing stored field values.  If  <copyField/> destinations are configured as stored, then Solr will attempt to index both the current value of the field as well as an additional copy from any source fields.

For example, if the following document exists in our collection:

And we apply the following update command:

The resulting document in our collection will be:

Optimistic Concurrency

Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace documents to ensure that the document they are replacing/updating has not been concurrently modified by another client application. This feature works by requiring a _version_ field on all documents in the index, and comparing that to a _version_ specified as part of the update command. By default, Solr's schema.xml includes a _version_ field, and this field is automatically added to each new document.

In general, using optimistic concurrency involves the following work flow:

  1. A client reads a document. In Solr, one might retrieve the document with the /get handler to be sure to have the latest version.
  2. A client changes the document locally.
  3. The client resubmits the changed document to Solr, for example, perhaps with the /update handler.
  4. If there is a version conflict (HTTP error code 409), the client starts the process over.

When the client resubmits a changed document to Solr, the _version_ can be included with the update to invoke optimistic concurrency control. Specific semantics are used to define when the document should be updated or when to report a conflict.

  • If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in the document must match the _version_ in the index.
  • If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
  • If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
  • If the content in the _version_ field is equal to '0', then it doesn't matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

If the document being updated does not include the _version_ field, and atomic updates are not being used, the document will be treated by normal Solr rules, which is usually to discard the previous version.

When using Optimistic Concurrency, clients can include an optional versions=true request parameter to indicate that the new versions of the documents being added should be included in the response.  This allows clients to immediately know what the _version_ is of every documented added with out needing to make a redundant /get request.

For example...

 

For more information, please also see Yonik Seeley's presentation on NoSQL features in Solr 4 from Apache Lucene EuroCon 2012.

Power Tip

The _version_ field is by default stored in the inverted index (indexed="true"). However, for some systems with a very large number of documents, the increase in FieldCache memory requirements may be too costly. A solution can be to declare the _version_ field as DocValues:

Sample field definition

Document Centric Versioning Constraints

Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned, globally unique values for the _version_ field. However, In some situations users may want to configure their own document specific version field, where the version values are assigned on a per-document basis by an external system, and have Solr reject updates that attempt to replace a document with an "older" version. In situations like this the DocBasedVersionConstraintsProcessorFactory can be useful.

The basic usage of DocBasedVersionConstraintsProcessorFactory is to configure it in solrconfig.xml as part of the UpdateRequestProcessorChain and specify the name of your custom versionField in your schema that should be checked when validating updates:

Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing document where the value of the my_version_l field in the "new" document is not greater then the value of that field in the existing document.

versionField vs _version_

The _version_ field used by Solr for its normal optimistic concurrency also has important semantics in how updates are distributed to replicas in SolrCloud, and MUST be assigned internally by Solr.  Users can not re-purpose that field and specify it as the  versionField for use in the DocBasedVersionConstraintsProcessorFactory configuration.

 

DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are optional:

  • ignoreOldUpdates - A boolean option which defaults to false. If set to true then instead of rejecting updates where the versionField is too low, the update will be silently ignored (and return a status 200 to the client).
  • deleteVersionParam - A String parameter that can be specified to indicate that this processor should also inspect Delete By Id commands. The value of this configuration option should be the name of a request parameter that the processor will now consider mandatory for all attempts to Delete By Id, and must be be used by clients to specify a value for the versionField which is greater then the existing value of the document to be deleted. When using this request param, any Delete By Id command with a high enough document version number to succeed will be internally converted into an Add Document command that replaces the existing document with a new one which is empty except for the Unique Key and versionField to keeping a record of the deleted version so future Add Document commands will fail if their "new" version is not high enough.

Please consult the processor javadocs and test configs for additional information and example usages.

  • No labels

20 Comments

  1. After SOLR-5670 I would add a few lines of documentation. Maybe something a-la this:

    ... By default, Solr's schema.xml includes a _version_ field, and this field is automatically added to each new document. By default, the field is declared to be indexed in schema.xml

    <field name="_version_" type="long" indexed="true" stored="true"/>

    If you want you can make it docValued instead of indexed - e.g.

    <field name="_version_" type="ondisk_docval_long" indexed="false" stored="true" required="true" docValues="true"/>
    <fieldType name="ondisk_docval_long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>

    Using docValue instead of indexed might be a better solution if you have a rapidly changing index or if you have so much data that you cannot afford the memory used by the reversed term-index on _version_ in FieldCache.
    1. Incorporated this as a Power Tip box with a slightly different wording. Please review.

      We might want to edit the example schema.xml too and add an XML comment in there too?

  2. {"id":"mydoc", "f1"{"set":10}, "f2"{"add":20}}

    should be:

    {"id":"mydoc", "f1":{"set":10}, "f2":{"add":20}}

    1. Thanks Furkan - I fixed the example.

  3. I don't see a good description of updateRequestProcessorChain in the new reference guide.  There's a bit of info in "SolrCloud with Legacy Configuration Files".  In a single page need to cover: defining your custom chain, referencing your chain from an update processor, how to invoke (I think there's 3 ways: 1: reference it in default update, 2: update.chain=, and 3: I think maybe if it's the only chain defined, not sure on that), and then explain better about solr.LogUpdateProcessorFactory, solr.DistributedUpdateProcessorFactory, solr.RunUpdateProcessorFactory and the ORDER that they appear in.  Mentions of item running "only once" if placed before Distributed seems a bit odd in a SolrCloud environment, I can only think of 1 use case (log transaction to external system only once, not once per replica), and what happens if I don't include RunDupateProcessorFactory at all, and does it matter where I place the Log entry.  The old wiki has a bit more info but is also lacking, and also covers 1x/3x/4x.  A coordinated page covering basic end-to-end on this topic would be good, even if it resorted to linking to other pages for fine details.

    1. There is an open JIRA issue to add update processor chains to the Ref Guide:

       

      JIRA Issues Macro: com.atlassian.sal.api.net.ResponseStatusException: Unexpected response received. Status code: 404

      As with most things, someone just needs to sit down and do it. If you are knowledgeable about the subject, you could add some content to that JIRA and a committer can add it to the Guide. While editing the Guide is limited to the committers of Lucene & Solr, anyone can submit content to be added, either as a comment on a page or in JIRA issues.

  4. From the CHANGES.txt file 

    Upgrading from Solr 4.8
    ----------------------

    * Support for DiskDocValuesFormat (ie: fieldTypes configured with docValuesFormat="Disk")
    has been removed due to poor performance. If you have an existing fieldTypes using
    DiskDocValuesFormat please modify your schema.xml to remove the 'docValuesFormat'
    attribute, and optimize your index to rewrite it into the default codec, prior to
    upgrading to 4.9. See LUCENE-5761 for more details.

     

    The power tip needs to be modified accordingly - 

    This is what it should now look like - 

    A solution can be to declare the_version_ field as DocValues, e.g:

    <field name="_version_" indexed="false" stored="true" required="true" docValues="true"/>
    1. This isn't an either-or.  I was looking at the RealTimeGet and related code and found that in one circumstance it needs the largest value for the version field.  To do that, the fast approach is if it's indexed, since it can ask the Terms for this.  docValues only requires a full-scan.  So I suspect the optimal configuration is both indexed & docValues.  Why not just ship this way by default?

  5. There seems to a be typo where set value should be 99 , not 999 in atomic update example result above:

    {"id":"mydoc",
     "price":999,
     "popularity":62,
     "categories":["kids","toys","games"],
     "tags":["buy_now","clearance"]
    }
    Thanks
  6. About "remove" modifier,when i remove a multiValued and solr.TrieDateField type field,it doesn't work.

    my schema is following:

    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>

    <dynamicField name="*_dts" type="date"    indexed="true"  stored="true" multiValued="true"/>

    my json is following:

    {"id":"1","test_dts":{"remove":["2015-11-11T03:11:11Z"}}

    by the way,if string type,it works.

    is date type unsupported this modifier?

    ================================

    i saw the source about DistributedUpdateProcessor.java 's doRemove method,the input param "Object fieldVal" is:

    pattern A: with url json,fieldVal will be a list<String>

    pattern B: with solrj api,fieldVal will be a String

    but original value is a Date.

    i think this is a reason why Collection can't match this value and remove it.

  7.  Is it correct to set the value of "versionField" as "_version_" in the class called "DocBasedVersionConstraintsProcessorFactory"? 


     

    1. great question! .. no, you can not use the existing _version_ field as your "versionField" in that processor – the _version_ must be assigned by solr for distributed document updates to work properly.

      I've added a special note about this – thank you for asking.

  8. nit:

    It is a feature of many NoSQL databases, and allows conditional updating a document based on it's version

    it's -> its

  9. Cassandra Targett, regarding "remove Removes (all occurrences of) the specified values from a multivalued field." - actually it looks like the "remove" command only removes the first occurrence from the list of values, not all matches. At least with Solr 5.2.1.