Apache Solr Documentation

6.2 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.3 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.3

Skip to end of metadata
Go to start of metadata

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

Enabling DocValues

To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. All of these actions are done in schema.xml.

Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition, as in this example from the schema.xml of Solr's sample_techproducts_configs config set:

If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.

DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:

  • StrField and UUIDField.
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.
  • Any Trie* numeric fields, date fields and EnumField.
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.

These Lucene types are related to how the values are sorted and stored.

There are two implications of multi-valued DocValues being stored as SORTED_SET types that should be kept in mind when combined with /export (and, by extension, Streaming Expression-based functionality):

  1. Values are returned in sorted order rather than the original input order.
  2. If multiple, identical entries are in the field in a single document, only one will be returned for that document.

There is an additional configuration option available, which is to modify the docValuesFormat used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat implementation. For example, you could choose to keep everything in memory by specifying docValuesFormat="Memory" on a field type:

Please note that the docValuesFormat option may change in future releases.

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the docValuesFormat in your schema.xml, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.

Using DocValues

Sorting, Faceting & Functions

If docValues="true" for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or Function Queries.

Retrieving DocValues During Search

Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. "fl=*") for search queries depending on the effective value of the  useDocValuesAsStored parameter for each field.  For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true". See Field Type Definitions and Properties & Defining Fields for more details.

When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*").

Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).

In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.

When retrieving fields from their docValues form, two important differences between regular stored fields and docValues fields must be understood:

  1. Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
  2. Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.

 

  • No labels

7 Comments

  1. I don't think i setup my permissions right to edit the page, but i dont know how to best word all this anyway for the recent changes:

    For the manu_exact example the default="" can be removed (this is already done in example/schema.xml)

    The paragraph "It's important that the fields be populated" can be removed.

    Also the paragraph "The default implementation loads everything into memory" can be changed (I dont know how best to word this: the default impl loads some things into memory, other things it keeps on disk): If you want extremes: everything on disk, use docValuesFormat="Disk" ; everything in memory, use docValuesFormat="Memory"

    1. I updated the page with these edits. If you could take a look at the paragraph about docValuesFormat and let me know if I got it right, that would be great.

      I think Steve or Hoss can update your permissions, per instructions on this page: Internal - CWIKI ACLs.

  2. This looks great! Thank you

  3. There is a circular reference.  The How to Use Doc Values section has a reference to http://wiki.apache.org/solr/DocValues, but that wiki page claims it is outdated and the reader should be reading this confluence page. Please either move the relevant contents from the wiki to here, or update the Wiki page.

    More guidance about when one should use DocValues and when should not would be appreciated.

     

  4. You mean "than traditional indexing", correct?

  5. "returning docValues fields in the fl list only requires memory access." Is this correct? DV do have a part that stays on disk, as this same page says.

    Also, the two implications of returning from DV are explained in two sections. I believe it makes more sense int the "Using DV" section, I suggest removing it from the "Enabling DV" section.

  6. Thanks Erick.

    The warning can be rephrased as,

    "expecting metadata for field 'id' (expected=SORTED) with docvalues as true. Use UninvertingReader or index with docvalues. Recommended to index with docValues for better performance. "