Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed content and query requests across multiple servers. It's a system in which data is organized into multiple pieces, or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly.

This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of what it is you're trying to accomplish. This page provides a simple tutorial to start Solr in SolrCloud mode, so you can begin to get a sense for how shards interact with each other during indexing and when serving queries. To that end, we'll use simple examples of configuring SolrCloud on a single machine, which is obviously not a real production environment, which would include several servers or virtual machines. In a real production environment, you'll also use the real machine names instead of "localhost" which we've used here.

In this section you will learn how to start a SolrCloud cluster using startup scripts and a specific configset.

This tutorial assumes that you're already familiar with the basics of using Solr. If you need a refresher, please see the Getting Started section to get a grounding in Solr concepts. If you load documents as part of that exercise, you should start over with a fresh Solr installation for these SolrCloud tutorials.

SolrCloud Example

Interactive Startup

The bin/solr script makes it easy to get started with SolrCloud as it walks you through the process of launching Solr nodes in cloud mode and adding a collection. To get started, simply do:

This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local cluster, with the default being 2.

The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. These nodes will each exist on a single machine, but will use different ports to mimic operation on different servers.

Next, the script will prompt you for the port to bind each of the Solr nodes to, such as:

Choose any available port for each node; the default for the first node is 8983 and 7574 for the second node. The script will start each node in order and show you the command it uses to start the server, such as:

The first node will also start an embedded ZooKeeper server bound to port 9983. The Solr home for the first node is in example/cloud/node1/solr as indicated by the -s option.

After starting up all nodes in the cluster, the script prompts you for the name of the collection to create:

The suggested default is "gettingstarted" but you might want to choose a name more appropriate for your specific search application.

Next, the script prompts you for the number of shards to distribute the collection across. Sharding is covered in more detail later on, so if you're unsure, we suggest using the default of 2 so that you can see how a collection is distributed across multiple nodes in a SolrCloud cluster.

Next, the script will prompt you for the number of replicas to create for each shard. Replication is covered in more detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is handled in SolrCloud.

Lastly, the script will prompt you for the name of a configuration directory for your collection. You can choose basic_configs, data_driven_schema_configs, or sample_techproducts_configs. The configuration directories are pulled from server/solr/configsets/ so you can review them beforehand if you wish. The data_driven_schema_configs configuration (the default) is useful when you're still designing a schema for your documents and need some flexiblity as you experiment with Solr.

At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run the status command:

If you encounter any errors during this process, check the Solr log files in example/cloud/node1/logs and example/cloud/node2/logs.

You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: http://localhost:8983/solr/#/~cloud. Solr also provides a way to perform basic diagnostics for a collection using the healthcheck command:

The healthcheck command gathers basic information about each replica in a collection, such as number of docs, current status (active, down, etc), and address (where the replica lives in the cluster).

Documents can now be added to SolrCloud using the Post Tool.

To stop Solr in SolrCloud mode, you would use the bin/solr script and issue the stop command, as in:

Starting with -noprompt

You can also get SolrCloud started with all the defaults instead of the interactive session using the following command:

Restarting Nodes

You can restart your SolrCloud nodes using the bin/solr script. For instance, to restart node1 running on port 8983 (with an embedded ZooKeeper server), you would do:

To restart node2 running on port 7574, you can do:

Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join the cluster with node1.

Adding a node to a cluster

Adding a node to an existing cluster is a bit advanced and involves a little more understanding of Solr. Once you startup a SolrCloud cluster using the startup scripts, you can add a new node to it by:

Notice that the above requires you to create a Solr home directory. You either need to copy solr.xml to the solr_home directory, or keep in centrally in ZooKeeper /solr.xml.

Example (with directory structure) that adds a node to an example started with "bin/solr -e cloud":

The previous command will start another Solr node on port 8987 with Solr home set to example/cloud/node3/solr. The new node will write its log files to example/cloud/node3/logs.

Once you're comfortable with how the SolrCloud example works, we recommend using the process described in Taking Solr to Production for setting up SolrCloud nodes in production.

  • No labels

26 Comments

  1. Before Confluence 3.5.x and higher, long lines in code blocks didn't wrap properly, but that's fixed with Confluence 3.5.x and higher which means the backslash ('\') isn't needed any longer in the examples under "Using Multiple ZooKeepers in an Ensemble". They aren't hurting anything, but could someday be removed.

  2. It will be helpful to note that collection1 has replicationfactor = 2 and maxShardsPerNode = 1

  3. I successfully set up a solr cloud demo on my virtual machine, great stuff for introduction.

    A tiny problem when I connected the solr cloud with independent eternal zookeeper cluster:

    the command line used for starting shards and replicas

    cd node1
    java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf \
    -Dcollection.configName=myconf -DzkHost=localhost:9983,localhost:8574,localhost:9900 \
    -jar start.jar

    -DzkRun is not neccessary any more, which could incur internal zookeeper starting error instead.  

    As mentioned above: "-DzkRun Starts up a ZooKeeper server embedded within Solr. This server will manage the cluster configuration. Note that we're doing this example all on one machine; when you start working with a production system, you'll likely use multiple ZooKeepers in an ensemble (or at least a stand-alone ZooKeeper instance). In that case, you'll replace this parameter with zkHost=<ZooKeeper Host:Port>, which is the hostname:port of the stand-alone ZooKeeper."

    Same as the rest of the commands.

     

    1. The -DzkRun is supplied here because the example starts multiple zookeeper nodes embedded within Solr. So each solr which is startup has -DzkRun and the zookeeper host/port numbers for all the Solr instances are added as -DzkHost. But you are right that if the ZooKeeper servers were setup externally then -DzkRun would not be required.

  4. Hi 

    I copied the example folder as node1 and node2 and started the shards. For which I got an error saying, "could not find collection configname".

    I changed the name of the collection in core.properties in both the nodes and it worked later.

    1. Mukundarama: the steps as documented above should work fine w/o any errors - no manual editing of core.properties should be needed.  (I just tested this to verify the documentation is correct)

      if you are having problems, please contact the solr-user@lucene mailing list with more specifics

  5. How can I see which servers running ZooKeeper from the graph in Solr admin UI?

    1. The graph doesn't show that but the "Dashboard" page shows the command line arguments to Solr which should have the -DzkHost parameter as well.

      1. Thanks! However, I think that grpah notation is quite important information for SolrCloud. Hope Solr 5 can add that to the graph, wihch may become from-time-to-time admin UI for those might administrate SolrCloud.

  6. If one node goes down, for example, the one on 8983, then users can send requests on 7574 .. But how to make this failure agnostic, so that users can always hit a particular hostname : port and get results ? Do we have to place a apache load balancer on top of this architecure, or is there any other way ?

    1. The Java client (SolrJ) includes an object (CloudSolrServer) that talks to zookeeper and always knows the clusterstate, so it can talk to an active server at all times.  To my knowledge, this is the only cloud-aware client there is.

       

      If CloudSolrServer is not an option for you, then you'll need a load balancer sitting in front of your Solr servers, so if one of them goes down, the client can still talk to the load balancer host/port and get through to a working Solr instance.

       

  7. when i read “Using Multiple ZooKeepers in an Ensemble” throw out the following exception

    SEVERE: null:java.lang.IllegalArgumentException: port out of range:-1
            at java.net.InetSocketAddress.<init>(InetSocketAddress.java:83)
            at java.net.InetSocketAddress.<init>(InetSocketAddress.java:63)
            at org.apache.solr.cloud.SolrZkServerProps.setClientPort(SolrZkServer.java:315)
            at org.apache.solr.cloud.SolrZkServerProps.getMySeverId(SolrZkServer.java:278)
            at org.apache.solr.cloud.SolrZkServerProps.parseProperties(SolrZkServer.java:453)
            at org.apache.solr.cloud.SolrZkServer.parseConfig(SolrZkServer.java:90)
            at org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:208)

     

    you should not use the parameter "-DzkRun" which means start the internal embedded zookeeper,!

     

  8. I tried to run simplest way 2 nodes and failed many times.
    Reason was:

    bin/solr fragment

    As you can see relative path is used. I started many times second node and after playing with all parameters I noticed there are boostrap_confdir variables passed to JVM. 
    So :
    1) releative path to execution point used
    2) checking if collection1/core.properties exists should be avoided. It can be used only with example, not for production execution. Very confusing. 

    Now it is simple if I run from other location the solr.solr.home/..  that I will not find  "/.solr/collection1/conf ".  
    I think setting implicit boostrap_confdir should be removed from this script and better described on this page. 

    1. It seems like you're talking about an older version of Solr and not Solr 5.0 (for which, this is the draft ref. guide). There have been a lot of changes in Solr 5.0. To begin with, there is no longer a default collection1 or auto/implicit-bootstrapping required. Depending upon how you run the script, it would either bootstrap the appropriate configset (there are a few of them now) or just start solr without any config set bootstrapped.

  9. I'm doing till stopping cloud. After stop upon restart everything get messed up. On second node the cores get dupicated, and if you query an core you get

    {
      "responseHeader": {
        "status": 503,
        "QTime": 2,
        "params": {
          "q": "*:*",
          "indent": "true",
          "wt": "json",
          "_": "1426103541439"
        }
      },
      "error": {
        "msg": "no servers hosting shard: ",
        "code": 503
      }
    }

    If you try to stop any node and attempt restart you cant see the solr admin....

     

    1. Hi Adnan, if you're running multiple Solr instances on a single machine from the same Solr server directory, make sure you start-up those instances with separate home directories (using the "-s" parameter). Otherwise you'll get that error.

  10. If you want to start solr in cloud mode and have your first node anywhere else other than in example/cloud/node1/solr, you need to also copy zoo.cfg to the folder that you pass to -s parameter. You can copy zoo.cfg from server/solr/zoo.cfg

    otherwise you will get java.lang.IllegalArgumentException in solr.log complaining that zoo.cfg is missing.

      

  11. I have 3 zookeeper vms and two SOLR 4.9 vms with approximately 200K documents that I am able to Search from a web UI

    I want to do the following:

    1. Set up 2 NEW SOLR 5.4 vms in cloud mode using the existing Zookeeper VMs rather than having to build new Zookeeper.
    2. Set up a collection/core in each of them that is named the same as the collection/core on my SOLR 4.9 vm's
    3. copy (and modify if necessary) schema.xml and solrConfig.xml from the SOLR 4.9 vm's and put them on the SOLR 5.4 vm's
    4. Run my SpringBoot microservice that takes my "dataTopic" from Kafka (each message = xml for SOLR) and throw it all at the new SOLR 5.4 vm's for indexing.
    5. Sit back and enjoy the fact that I didn't have to upgrade, I simply built new VM's, added my data, and turned off the old ones.

    The thing is, I cannot find good instructions on how to set up SOLR 5.x in cloud mode in a realistic way.  (Loopback doesn't help me, I need something for Prod.)

    I am not sure where the instanceDir and dataDir should be, I cannot get a core built via the admin console without having things in place that I don't know for sure where they belong.

    Has anyone written clear instructions on how to set up SOLR cloud, from scratch, on SOLR 5.x with actual, separate VMs for SOLR and Zookeeper instead of multiple things automagically placed on one box by the script mentioned above?

    The configuration "cost" of SOLR is significant.  I like SOLR a lot, but understand why many are choosing another tool.  I wish for some much clearer documentation.  Can anyone point me at such?  If it does not exist, I will be happy to write the page if those "in the know" will provide me with some guidance.

     

    1. You can have multiple clouds running within a single zookeeper ensemble by using a chroot on the zkHost parameter.  See the "Taking Solr to Production" page from this guide for more info about that.

      You need to use the zkcli script (the one included with Solr, not the one included with Zookeeper) to "upconfig" your config/schema to zookeeper, using the chroot for your new cloud.  The "upconfig" action is covered on the page named "Using ZooKeeper to Manage Configuration Files".

      You are right that this could be made a lot more clear.  If you have suggestions for how to improve this guide, please feel free to discuss them.  You can do this here, but the mailing list might be a better place.  Opening an issue in JIRA for specific documentation issues is also an idea.

      1. Thank you Shawn.  I came up with the same key idea myself after pounding on this for about a week.  Once I had a new chroot in Zookeeper, things began to fall into place.  The chicken-and-egg question of needing to have the config in place FIRST when the command for creating a new collection clearly references the config anyway was a bit of a stumper for a while, but a lucky "mistake" left the configs in there from a previous attempt and that did the trick.  When I tried to reproduce from scratch, I realized that the configs had to be there first.

        I have a complete doc on this now.  I've reproduced the same results 5 times by stripping my VM's down to nothing but the OS and doing the entire process again.  I guess I'd propose an additional wiki page entitled something like "Building SolrCloud on Separate Virtual Machines."  If that meets with the approval of the powers-that-be, I can provide the doc.

        Also, forgive ignorance, but the mailing list is....  where?

        =======================

        In case we don't get to a full wiki page, and to assist anyone else struggling as I was, the nutshell is below.  Less detail than I have in my doc, but as Shawn says, this probably isn't the place for the full doc.

        Assumption:  Zookeeper is up and running correctly on separate boxes or VM's reachable by your SOLR boxes or VM's

        1. Install SOLR 5.4 "for production" as described here:
          1. https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

        2. Upload the configs from your previous version of SOLR (or your tested Dev version) to Zookeeper like this:
          1. sudo /opt/solr/server/scripts/cloud-scripts/zkcli.sh -cmd upconfig -confdir /home/john/conf/ -confname fooBar -z 192.168.56.5/solr5_4

          2. Verify the new node and config directory exists on Zookeeper if you want, using the Zookeeper zkCli.sh tool

        3. Start SOLR 5.4 on two physical boxes or VM's:
          1. sudo /opt/solr/bin/solr restart -c -z 192.168.56.5,192.168.56.6,192.168.56.7/solr5_4
        4. Create the new collection on the new chroot
          1. opt/solr/bin/solr create -c fooBar -d /home/john/conf -shards 1 -replicationFactor 2
          2. You should see output that clearly states the new collection was created
          3. This happens on the new chroot of solr5_4 because you started your SOLR instances on that chroot in step #3.
        5. Check the collection's status using this command:
          1. /opt/solr/bin/solr healthcheck -192.168.56.5,192.168.56.6, 192.168.56.7/solr5_4 -c fooBar

          2. You should see a JSON "object" for each node, indicating that the nodes were created on each of the two SOLR nodes started in #3.

        6. Verify the directories are where they should be on each of the two Solr machines

          1. ls /opt/solr/server/solr on each machine should return the following:

            1. machine a:  fooBar_shard1_replica1  

            2. machine b: fooBar_shard1_replica2

        7. Did you make a mistake in Zookeeper?  Here's how to clear it out and try again:

          1. Using the Zookeeper zkCli.sh tool, issue this command:

            1. rmr /solr5_4

            2. Now repeat step #2

            3. NOT tested, but probably will work if you have other things in /solr5_4:

              1. rmr /solr5_4/configs

         

        1. Support resources for Solr, including the mailing lists, can be found here:

          http://lucene.apache.org/solr/resources.html#community

          For your doc, I would not mention virtual machines in the title of the document, because the instructions will apply equally to install on multiple physical servers.  If it says "virtual machines" then someone who is not using VMs may think that the documentation doesn't apply to them.

        2. This was incredibly helpful. Did your doc ever get added to the wiki? I've looked through it a bit and haven't seen anything providing instructions for a similar procedure (setting up a collection without using the included example configsets directly, but instead creating your own config + schema using them as a reference).

          The problem I was having that your post resolved was that for step 4, instead of creating the collection with the solr "create" command, I was trying to do it through the Admin UI, and kept getting errors in the Solr log saying that it couldn't find solrconfig.xml, even though it was both in ZooKeeper under /<solr_root>/configs/<config_name> and in the file system under /opt/solr/server/configsets/<config_name>/conf (I put it there because thats where the example configsets had it).

          Looks like the solr create command works because it specifies the config directory with my config files, whereas doing it through the Admin UI couldn't find them even though the config name was selectable through a drop down menu in the UI, I assume because Solr found the path in ZooKeeper under /<solr_root>/configs/<config_name>. 

          That last part led me to discover that step (2) is not actually needed, and actually just creates an unused path in ZooKeeper at /<solr_root>/configs/<config_name>, while the actual path for the configs that Solr itself creates when the collection is created is at /<solr_root>/configs/<collection_name>. Instead, I just need to make sure the Solr root in ZooKeeper ("chroot"?) exists before Solr starts. So now my procedure from start to finish looks like this:

          My use case is using one identical set of config files for multiple collections, but now that I'm thinking about it, I suppose it's common to have a different set of config files for each collection, and then you'd probably make <collection name> the same as <config name>, and then uploading the configs to ZooKeeper before starting Solr would be useful because then perhaps creating the collection with the Admin UI would then work.

          To reset everything I use:

          1. Thanks Brent - no I haven't explored whether just anyone can make a wiki page or whether it requires permission from committers or something similar - and, of course, having been hired to architect a Solr-based solution to replacing the GSA, I've been waaay busy.  I had intended to set up a blog or something, but (like all lazy developers) having solved the problem, I let that part slide...  I still have notes, but they would need editing and cleanup before they would not drive others crazy...

            I would agree that documentation for setting up a "real" SolrCloud environment is sorely lacking.  Personally, I cringe every time I see instructions for an open source project that include doing things on "loopback".  In every case so far that I've tried, moving from that to actual, separate servers or VM's (as you would want to for Prod) proves to be exasperatingly difficult due to all kinds of gotchas hidden by the "magic" scripts provided to set up loopback and/or special examples... (I'm not knocking the examples, they're very helpful, but they mask the gotchas involved in a "start from scratch" scenario..)

            1. Editing this wiki (the reference guide) can only be done by committers.  Anyone can edit the old MoinMoin wiki, they just need to create an account and ask for edit permissons on the mailing list or IRC channel.

  12. I am following the tutorial and in the end, the prompt says

    POSTing request to Config API: http://localhost:8983/solr/gettingstarted/config{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
    Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000

    It is interesting to know you can set (potentially get) property through API. However,

    1. I don't see any documentation from either CollectionAPI nor ConfigsetAPI. Is it documented anywhere?
    2. When I tried to post to that URL myself, it doesn't seem to work either by using Curl or Postman. Is that API callable? 

    BTW, a small grammar typo "The script will start each node in order and shows " -> "The script will start each node in order and show"

    1. The Config API is documented here. You can definitely make modifications using curl or postman.