Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

When your collection is too large for one node, you can break it up and store it in sections by creating multiple shards.

A Shard is a logical partition of the collection, containing a subset of documents from the collection, such that every document in a collection is contained in exactly one Shard. Which shard contains a each document in a collection depends on the overall "Sharding" strategy for that collection.  For example, you might have a collection where the "country" field of each document determines which shard it is part of, so documents from the same country are co-located.  A different collection might simply use a "hash" on the uniqueKey of each document to determine its Shard.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting an index across shards is not exclusively a SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud:

  1. Splitting an index into shards was somewhat manual.
  2. There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
  3. There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness.

In SolrCloud there are no masters or slaves. Instead, every shard consists of at least one physical replica, exactly one of which is a leader. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..

If a leader goes down, one of the other replicas is automatically elected as the new leader.

When a document is sent to a Solr node for indexing, the system first determines which Shard that document belongs to, and then which node is currently hosting the leader for that shard.  The document is then forwarded to the current leader for indexing, and the leader forwards the update to all of the other replicas.

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the router.name parameter when creating your collection. If you use the (default) "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_route_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

The _route_ parameter replaces shard.keys, which has been deprecated and will be removed in a future Solr release.

The compositeId router supports prefixes containing up to 2 levels of routing. For example: a prefix routing first by region, then by customer: "USA!IBM!12345"

Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across multiple shards. The syntax for such a use case would be : "shard_key/num!document_id" where the /num is the number of bits from the shard key to use in the composite hash.

So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across 1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your query with the _route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to specific shards.

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_ parameter to name a specific shard.

Shard Splitting

When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready.

More details on how to use shard splitting is in the section on the Collection API's SPLITSHARD command.

Ignoring Commits from Client Applications in SolrCloud

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster. To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code. To activate this request processor you'll need to add the following to your solrconfig.xml:

As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimize request. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this custom chain is taking the place of the default chain.

In the following example, the processor will raise an exception with a 403 code with a customized error message:

Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:

  • No labels

20 Comments

  1. The new stuff from SOLR-4221 is fine in the Collections API page, but this page also needs to be updated with more conceptual information to describe the new capabilities - particularly the differences between the two router types and when to use them.

  2. From reading SOLR-5017, it seems this page needs to be updated to deprecate shard.keys in favor of _route_. Is that the case?

  3. Hi folks ( - especially Cassandra ),

    I don't get it all about nodes and shards management from reading the doc:

    Once it is specified that one cannot add shards, then that one can add replica-only shards, then that last "Shard Splitting" paragraph states that something changed starting with Solr 4.3.

    But it doesn't states that splitting shards can end in a new non-replica shard, in a just added node, thus increasing the amount of storage available to the index / collection. It states that "split action effectively makes two copies of the data as new shards" instead, which tastes a lot like replica style shards.

    So does it?

    Could there be some sort of tutorial describing how to add available storage capacity for index / collection, thus adding a node / shard - core that one can send new documents to be indexed? (of course, load-balancing would be trigered, so it looks like documents would be added to shards out of a set of nodes).

     

    Thank you for your implication!

    Charles

  4. In this page is written: 
    If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.
    Could someone please explain what does it mean "forwards the index notation" ?
    1. I think it means a signal(i.e. a notification) saying "there's a update for index, please come to take it and update your index object accordingly".

  5. Suppose I want to establish a collection with ten millions of news articles, I design document Id as, for example,  "PY2001!12345" where PY2001 is the published year of news article, i.e. I want to make it one shard for one whole publish year's news articles. When I want to search articles of year 2001, I can specify in query url with _route_=PY2001!. However, in real world, we normally search articles across multiple published years, so my question is "Can I query with multiple _route_ parameters, say _route_=PY!2001&_route_=PY!2002, if I want to search articles from 2001 to 2002?" BTW, it's not necessary to use exclamation mark, I can also use underscore, e.g. PY2001_, as long as I comply with that in route parameter, right?

    (Note: How do I insert underscore around a word in new post since it seems to make bold effect if I type it directly? I use Edit function to make the 2 underscore characters appear around the word "route", though)

    1. You can specify multiple values for _route_ by separating them with comma character e.g. _route_=PY2001!,PY2002!,PY2003! and so on. No, separating by underscore is not supported. Excalamation mark is the only supported separator for composite keys.

      1. Just like shards parameter, I see. Thanks!

  6. I have 4 questions and hope for answer and suggestion: 

    Q1.Some articles said implicit router is cumstom shardng since we decide which shard docs will be sent to, wherease other aritcles said compositeID routre is custom sharding due to we specify shard key to instruct Solr which shard the docs will be sent to. Is there a clear answer? 

    Q2.When use compositeID router, we prefix shard key to doc id. Say our shard key is IBM!, and doc id is 12345. If we update doc with XML, do we have to specify doc id as "<DOCID>IBM!12345</DOCID>..." in xml?

    Q3.Why do we have to specify numShards if we want to use compositeID key?

    Q4. It seems that the idea of shard key in my previous post in this article has practical problem. If I specify shard key in the form 'PYyyyy' where yyyy is the year in published date, I don't know what value of numShards to set in advance since there're always news articles of a new year to be added into Solr. What design of shard key is more practical for news articles case?

     

    1. Q1: A better name for the implicit router is "manual" ... it is inherently custom sharding.  If you use compositeId with shard keys, then that becomes custom sharding also.

      Q2: I'm going to show my ignorance here and say that I really have no idea how shard keys work. I know they somehow control document routing, but I couldn't tell you what they do.  If I were to research it, I would likely be able to figure it out, but it hasn't been a priority for me.

      Q3: The compositeId's default operation, when you aren't telling it what to do with a shard key, is to automatically split the documents evenly between the shards.  It does this by calculating a hash on the uniqueKey field and comparing it to the hash ranges on each shard in the clusterstate.  In order to calculate the hash ranges on each shard when the collection is initially created, Solr must know how many shards there are.

      Q4: If you are creating a collection for time-series data where new shards will be added on a regular basis, I think you should be using the implicit router.  You just create a new shard when your time period rolls over, and send all new documents explicitly to the new shard.  You can query the entire collection by leaving off the shards parameter, or include the shards parameter to limit the search to only a subset of your data.  A company named loggly uses SolrCloud in this manner.

      1. Thank you very much, Shawn. After a bit research on "custom" things, I come to a conclusion: Implicit router is doing custom sharding whereas compositeId is doing custom hashing. They are kinda different things. You can refer to my answer to my own question on stackoverflow: http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud

      2. After more study, I believe answer to Q2 is yes. compositeID is the default router and Solr just look at id field for routing. If there's no ! or / in it, Solr will completely use its own algorithm to do the hashing. This is what I understand so far.

        And for Q3, I think numshards could be kinda used as denomitor in some formula in some hashing function. So it's a necessity.

  7. I think this is not adequate description: "A shard is a way of splitting a core over a number of "servers". In previous articles, it clarifies that shard is logical concept and core is physical concept. This might confuse readers again. I'd suggest describe in this way: "A shard is a way splitting data that represents a collection.... " ,blah blah blah.

    1. yeah ... that intro section was full of very out of date terminology which i've tried to clean up.

  8. It would be helpful if Q2 from above was addressed in the documentation:

    Q2.When use compositeID router, we prefix shard key to doc id. Say our shard key is IBM!, and doc id is 12345. If we update doc with XML, do we have to specify doc id as "<DOCID>IBM!12345</DOCID>..." in xml?

    If an update specifies only the non-routed id, will SolrCloud select the correct shard for updating?

    If an update specifies a different route, will SolrCloud delete the previous document with the same id but with the different routing? (Will it effectively change which shard the document is stored on?)

  9. There is a type at this page. Thou shall not issue a commit => You shall not issue a commit

    1. "Thou" isn't a typo, actually, it's an alternate (but archaic for most English speakers) form of "you".

  10. I have index size of around 300GB, and i was earlier using 5 shard 10 nodes solr cloud with solr 4.2 version. Meanwhile, we moved to solr 6.3 keeping in mind for shard splitting.

    I tried Shard Splitting with 6.3 version of Solr,with the following steps:-

    Step 1 :

    I have issued "collections?action=SPLITSHARD&collection=<collectionName>&shard=shard1"

    Step 2 :

    I noticed 2 child shard got created shard1_0 and shard1_1

    step 3 :

     After complete step 2, still I see 

    shard1 state : active

    AND 

    shard1_0 and shard1_1 : 

    state:construction

    I checked the state in state.json for nearly 48 hours , but the data copying got frozen up while reaching a certain range(for example:- 60GB data in parent node, after splitting, both child nodes got 24GB data, then data copying into child  got stopped). The state.json file was not changing further.

    Moreover, when i manually changed state.json (parent node to inactive  from active and child node to active from construction) i suffered a huge loss of data.I read that this issue was fixed in 6.3 , but i am still facing it.Any help regarding this will be greatly appreciated.

  11. The IgnoreCommitOptimizeUpdateProcessorFactory is just as useful in non-cloud mode, if the client application abuses commit/optimize. I just installed it at a customer with master/slave. So why is it only discussed under SolrCloud?

    1. There's no reason I'm aware of. The URPs are not covered in much depth in the Guide today (note that Update Request Processors mostly just provides brief descriptions and links to javadocs). I think in general any example of any URP in the Guide is not meant to be exclusive of any other appropriate uses of that URP.