Apache Solr Documentation

5.2 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

5.3 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 5.3

Skip to end of metadata
Go to start of metadata

Solr has support for writing and reading its index and transaction log files to the HDFS distributed filesystem. This does not use Hadoop Map-Reduce to process Solr data, rather it only uses the HDFS filesystem for index and transaction log file storage.

To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to instruct Solr to use the HdfsDirectoryFactory. There are also several additional parameters to define. These can be set in one of three ways:

  • Pass JVM arguments to the bin/solr script. These would need to be passed every time you start Solr with bin/solr.
  • Modify solr.in.sh (or solr.in.cmd on Windows) to pass the JVM arguments automatically when using bin/solr without having to set them manually.
  • Define the properties in solrconfig.xml. These configuration changes would need to be repeated for every collection, so is a good option if you only want some of your collections stored in HDFS. 

Starting Solr on HDFS

Standalone Solr Instances

For standalone Solr instances, there are a few parameters you should be sure to modify before starting Solr. These can be set in solrconfig.xml(more on that below), or passed to the bin/solr script at startup.

  • You need to use an HdfsDirectoryFactory and a data dir of the form hdfs://host:port/path
  • You need to specify an UpdateLog location of the form hdfs://host:port/path
  • You should specify a lock factory type of 'hdfs' or none.

If you do not modify solrconfig.xml, you can instead start Solr on HDFS with the following command:

This example will start Solr in standalone mode, using the defined JVM properties (explained in more detail below). 

SolrCloud Instances

In SolrCloud mode, it's best to leave the data and update log directories as the defaults Solr comes with and simply specify the solr.hdfs.home. All dynamically created collections will create the appropriate directories automatically under the solr.hdfs.home root directory.

  • Set solr.hdfs.home in the form hdfs://host:port/path
  • You should specify a lock factory type of 'hdfs' or none.

This command starts Solr in SolrCloud mode, using the defined JVM properties.

Modifying solr.in.sh (*nix) or solr.in.cmd (Windows)

The examples above assume you will pass JVM arguments as part of the start command every time you use bin/solr to start Solr. However, bin/solr looks for an include file named solr.in.sh (solr.in.cmd on Windows) to set environment variables. By default, this file is found in the bin directory, and you can modify it to permanently add the HdfsDirectoryFactory settings and ensure they are used every time Solr is started.

For example, to set JVM arguments to always use HDFS when running in SolrCloud mode (as shown above), you would add a section such as this:

The Block Cache

For performance, the HdfsDirectoryFactory uses a Directory that will cache HDFS blocks. This caching mechanism is meant to replace the standard file system cache that Solr utilizes so much. By default, this cache is allocated off heap. This cache will often need to be quite large and you may need to raise the off heap memory limit for the specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the follow is an example command line parameter that you can use to raise the limit when starting Solr:

HdfsDirectoryFactory Parameters

The HdfsDirectoryFactory has a number of settings that are defined as part of the directoryFactory configuration.

Solr HDFS Settings

Parameter

Example Value

Default

Description

solr.hdfs.home

hdfs://host:port/path/solr

N/A

A root location in HDFS for Solr to write collection data to. Rather than specifying an HDFS location for the data directory or update log directory, use this to specify one root location and have everything automatically created within this HDFS location.

Block Cache Settings

Parameter

Default

Description

solr.hdfs.blockcache.enabled

true

Enable the blockcache

solr.hdfs.blockcache.read.enabled

true

Enable the read cache

solr.hdfs.blockcache.write.enabled

true

Enable the write cache

solr.hdfs.blockcache.direct.memory.allocation

true

Enable direct memory allocation. If this is false, heap is used

solr.hdfs.blockcache.slab.count

1

Number of memory slabs to allocate. Each slab is 128 MB in size.

solr.hdfs.blockcache.global
false

Enable/Disable using one global cache for all SolrCores. The settings used will be from the first HdfsDirectoryFactory created.

NRTCachingDirectory Settings

Parameter

Default

Description

solr.hdfs.nrtcachingdirectory.enable

true

Enable the use of NRTCachingDirectory

solr.hdfs.nrtcachingdirectory.maxmergesizemb

16

NRTCachingDirectory max segment size for merges

solr.hdfs.nrtcachingdirectory.maxcachedmb

192

NRTCachingDirectory max cache size

HDFS Client Configuration Settings

solr.hdfs.confdir pass the location of HDFS client configuration files - needed for HDFS HA for example.

Parameter

Default

Description

solr.hdfs.confdir

N/A

Pass the location of HDFS client configuration files - needed for HDFS HA for example.

Kerberos Authentication Settings

Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr's HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos authentication from Solr, you need to set the following parameters:

Parameter

Default

Description

solr.hdfs.security.kerberos.enabled

false

Set to true to enable Kerberos authentication

solr.hdfs.security.kerberos.keytabfileN/A

A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less authentication when Solr attempts to authenticate with secure Hadoop.

This file will need to be present on all Solr servers at the same path provided in this parameter.

solr.hdfs.security.kerberos.principalN/A

The Kerberos principal that Solr should use to authenticate to secure Hadoop; the format of a typical Kerberos V5 principal is: primary/instance@realm

Example

Here is a sample solrconfig.xml configuration for storing Solr indexes on HDFS:

If using Kerberos, you will need to add the three Kerberos related properties to the <directoryFactory> element in solrconfig.xml, such as:

Limitations

You must use an 'append-only' Lucene index codec because HDFS is an append-only filesystem. The currently default codec used by Solr is 'append-only' and is supported with HDFS.

Automatically Add Replicas in SolrCloud

One benefit to running Solr in HDFS is the ability to automatically add new replicas when the Overseer notices that a shard has gone down. Because the "gone" index shards are stored in HDFS, the a new core will be created and the new core will point to the existing indexes in HDFS. 

Collections created using autoAddReplicas=true on a shared file system have automatic addition of replicas enabled. The following settings can be used to override the defaults in the <solrcloud> section of solr.xml.

Param

Default

Description

autoReplicaFailoverWorkLoopDelay10000The time (in ms) between clusterstate inspections by the Overseer to detect and possibly act on creation of a replacement replica.
autoReplicaFailoverWaitAfterExpiration30000The minimum time (in ms) to wait for initiating replacement of a replica after first noticing it not being live. This is important to prevent false positives while stoping or starting the cluster.
autoReplicaFailoverBadNodeExpiration60000The delay (in ms) after which a replica marked as down would be unmarked.

Temporarily disable autoAddReplicas for the entire cluster

When doing offline maintenance on the cluster and for various other use cases where an admin would like to temporarily disable auto addition of replicas, the following APIs will disable and re-enable autoAddReplicas for all collections in the cluster:

Disable auto addition of replicas cluster wide by setting the cluster property autoAddReplicas to false:

Re-enable auto addition of replicas (for those collections created with autoAddReplica=true) by unsetting the autoAddReplicas cluster property (when no val param is provided, the cluster property is unset):

  • No labels

20 Comments

  1. This looks mostly fine. A few suggestions:

    • first sentence, change "it's index" to "its index".
    • After the 1st sentence, add a clarifying sentence like, "This does not use Hadoop to process Solr data, rather it only uses the HDFS filesystem for index and transaction log file storage."
    • The CSS is applying a background color to the 'noformat' blocks which is a bit distracting - maybe change those to code blocks (with {code:borderStyle=solid|borderColor=#666666}). With the code block the first start command could also be one line for easier copy/paste; the code block will automatically wrap that across multiple lines depending on the width of the user's browser.
    • In the section "Block Cache", the last sentence: "For the Oracle/OpenJDK vm's,...", change "vm's" to "JVMs" for clarity.
    • Ideally, there would be an example from solrconfig.xml showing the params in use. That could be added later for the next version, however.
    1. Thanks Cassandra! I've attempted to incorporate the suggested changes.

  2. "This does not use Hadoop" - I'd say "This does not use Hadoop map-reduce".

    "You muse use" -> "You must use".

    Also, I think the documentation should mention that Hadoop 2.0.x is required - initially I tried this with Hadoop 1.x and it failed with a cryptic unhelpful message (which is really the fault of Hadoop interop, but still we can save users some trouble).

    1. thanks ab, i incorporated your comments

  3. Can you post an example of how to create an index on HDFS?
    Other than the configuration above, what do you need to specify to a Java program to tell IndexWriter that the file is in HDFS?

    1. Daniel: With the configuration specified above, Solr will keep your index and transaction logs in HDFS – there is nothing you need to do in any java client application.

      can you please clarify if this is unclear to you in the documentation above (and if so what might make it more clear) or if you are asking because you have a general questions about using using HDFS directly from a lucene java application that is not solr (in which case these comments aren't really the appropriate forum: please ask on the java-user mailing list)

      1. Essentially my question is if I need to replace the code below once I configured Solr to run with HDFS:

        // Creating an Lucene IndexWriter
        IndexWriterConfig conf = new IndexWriterConfig(LuceneUtils.LUCENE_VERSION,
        new WhitespaceAnalyzer(LuceneUtils.LUCENE_VERSION));
        IndexWriter iw = new IndexWriter(FSDirectory.open(new File("index")), conf);

        If the code above stays like this, I would just say in the documentation that there is no need to change application code once Solr is configured for HDFS.

        1. If you're writing java code directly using IndexWRiterConfig & IndexWriter then please see my previous ocmment...

          ... if you are asking because you have a general questions about using using HDFS directly from a lucene java application that is not solr (in which case these comments aren't really the appropriate forum: please ask on the java-user mailing list)

  4. In the HDFS Client Configuration Settings section it would be good to mention the solr.hdfs.home property as well. The Example section could also use a (perhaps commented out) section showing the options required to run in HA mode.

    1. Thanks Greg, I've added some info on solr.hdfs.home as a start.

  5. Can someone with the knowledge comment on this page why someone would put Solr indexes into HDFS? Perhaps it's for fault-tolerance but then that's what Solr replication buys you without using HDFS. Are there performance advantages? I suspect performance disadvantages.

    1. In my case it we moved our indexes to HDFS due to the amount of space we've got there and the lack of physical disk space on our solr nodes. Rather than having 6 nodes with limited amounts of disk space we can now run a solr instance on each of our map-reduce/hdfs nodes and not have to worry about local storage as well which is good for scalability. Moving indexes around between one instance and another is also simplified due to the shared file system and map-reduce jobs that operate on the indexes don't need to pull them from somewhere else; the indexes are "local" to them.

    2. There are a variety of reasons you might want to put Solr indexes into HDFS.

      As Greg mentions above, one of those reasons might be the ease of dealing with disk space if you are already using HDFS or intend to.

      It also does allow you to offer different trade offs in terms of fault-tolerance. This HDFS integration is just the beginning - once you can work with a shared filesystem, it becomes easy to reassign indexes to new or existing nodes without standard recovery - in this case you could count on HDFS for fault-tolerance, which is much more hardened than the standard SolrCloud replication fault tolerance at this point.

      There are other synergies as well. If you are making indexes with MapReduce, it can really make things nice and simple to just write the indexes to HDFS. Then serve them from HDFS.

      It's really just another storage option to consider, especially if you are already using HDFS, and we hope that it is just the start.

      In terms of performance, as in most cases with Hadoop, data will favor being local where you can use things like HDFS local reads and no network trip. In terms of writing, even with a network trip, if your pipe is large enough, HDFS is not really the bottleneck. For reads, the HDFS block cache Directory impl does a pretty good job of taking over for the local filesystem cache. In addition, many HDFS nodes are outfitted with multiple drives, which comes with it's own benefits that local file system options cannot easily match without setting up some sort of RAID system.

      We have not focused on performance yet, so I'm sure their are many improvements to come, but initial comparison one off benchmarks are not bad at all.

      1. My main concern when I was initially looking at this is that standard Solr uses memory-mapped files for performance reasons, and takes a lot of burden off of Java heap allocation. Does the HDFS block cache Directory perform the same type of function as the memory mapped files? In other words, if I am currently using 200GB of virtual memory (as reported by top) in each of my solr instances, should I set -XX:MaxDirectMemorySize=200g (I am currently allocating a heap of 32GB so that I have 16 GB for a searcher with enough room to start a new searcher when necessary). If so, I think a note describing how to transition from current memory-mapped files to HDFS would be useful. For example, "To determine the amount of direct memory needed after the transition, check the current amount of virtual memory your solr processes are using"

        Other than that, this seems fantastic

  6. Hi guys i am newbie to solr and am trying to integrate solr(4.6) with hdfs (hadoop 2.2).

    I am encountering error,there is difference in protobuf version(2.4 in hadoop and 2.5 in solr). Kindly provide with any insight also kindly inform where i can make queries

    1. mohammed: if you have questions about using solr, please ask them on the mailing list.

      these comments are for discussing specific issues with the documentation itself

       

       

  7. Newer Hadoop distributions are moving towards high-availability namenodes. This should offer much better protection from single node failures, but it introduces a complication that isn't discussed here.  If the namenode changes, then the configuration parameter (solr.hdfs.home) needs to change.  Is someone already working on making this automatic (e.g. by specifying all of the potential namenodes and choosing the one that is not in standby mode), or is that feature still on the todo list?

  8. Lkc

    Hi, guys Im a newer and I dont understand what is the solr.hdfs.confdir exactly means, what is HDFS client? Can anyone help me?

    1. Lkc

      And I configered as above, But the solr cant startup it just stucked, Can anybody tell me why? I use Solr4.10.3 and Hadoop 2.6.0

  9. I could not run SolrCloud on HDFS, I tried to run many variations of the command:

    java -Dsolr.directoryFactory=HdfsDirectoryFactory
         -Dsolr.lock.type=hdfs
         -Dsolr.hdfs.home=hdfs://host:port/path
    Shown here but with no avail. Can anyone post a correct command that works directly? Preferably if it works on a schemaless configuration.