Apache Solr Documentation

5.0 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

5.1 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 5.1

Skip to end of metadata
Go to start of metadata

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, then traditional indexing.

Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

How to Use DocValues

To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. All of these actions are done in schema.xml.

Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition, as in this example from the schema.xml of Solr's sample_techproducts_configs config set:

Icon

If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.

DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:

  • String fields of type StrField.
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.
  • Any Trie* fields.
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.
  • UUID fields

These Lucene types are related to how the values are sorted and stored.

There is an additional configuration option available, which is to modify the docValuesFormat used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat implementation. For example, you could choose to keep everything in memory by specifying docValuesFormat="Memory" on a field type:

Please note that the docValuesFormat option may change in future releases.

Icon

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the docValuesFormat in your schema.xml, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.

Labels
  • No labels
  1. I don't think i setup my permissions right to edit the page, but i dont know how to best word all this anyway for the recent changes:

    For the manu_exact example the default="" can be removed (this is already done in example/schema.xml)

    The paragraph "It's important that the fields be populated" can be removed.

    Also the paragraph "The default implementation loads everything into memory" can be changed (I dont know how best to word this: the default impl loads some things into memory, other things it keeps on disk): If you want extremes: everything on disk, use docValuesFormat="Disk" ; everything in memory, use docValuesFormat="Memory"

    1. I updated the page with these edits. If you could take a look at the paragraph about docValuesFormat and let me know if I got it right, that would be great.

      I think Steve or Hoss can update your permissions, per instructions on this page: Internal - CWIKI ACLs.

  2. This looks great! Thank you

  3. There is a circular reference.  The How to Use Doc Values section has a reference to http://wiki.apache.org/solr/DocValues, but that wiki page claims it is outdated and the reader should be reading this confluence page. Please either move the relevant contents from the wiki to here, or update the Wiki page.

    More guidance about when one should use DocValues and when should not would be appreciated.

     

  4. Are DocValues storing original text (like stored form) or the tokenized text (like the indexed form)? 

    Update: Since DocValues do not apply to the text fields (only Strings), the question is irrelevant. There is no tokenization or processing. Just storing in a particular structure.