Apache Solr Documentation

6.4 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.5 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.5

Skip to end of metadata
Go to start of metadata

The following sections cover provide general information about how various SolrCloud features work. To understand these features, it's important to first understand a few key concepts that relate to SolrCloud.

If you are already familiar with SolrCloud concepts and basic functionality, you can skip to the section covering SolrCloud Configuration and Parameters.

Key SolrCloud Concepts

A SolrCloud cluster consists of some "logical" concepts layered on top of some "physical" concepts.

Logical

  • A Cluster can host multiple Collections of Solr Documents.
  • A collection can be partitioned into multiple Shards, which contain a subset of the Documents in the Collection.
  • The number of Shards that a Collection has determines:
    • The theoretical limit to the number of Documents that Collection can reasonably contain.
    • The amount of parallelization that is possible for an individual search request.

Physical

  • A Cluster is made up of one or more Solr Nodes, which are running instances of the Solr server process.
  • Each Node can host multiple Cores.
  • Each Core in a Cluster is a physical Replica for a logical Shard.
  • Every Replica uses the same configuration specified for the Collection that it is a part of.
  • The number of Replicas that each Shard has determines:
    • The level of redundancy built into the Collection and how fault tolerant the Cluster can be in the event that some Nodes become unavailable.
    • The theoretical limit in the number concurrent search requests that can be processed under heavy load.

 

  • No labels

10 Comments

  1. Some discussion on IRC about how confusing it is that this page tries to explain shards in terms of "slices" - ASAICT this is the only page that does so, we should rip it out and give a better example in simple terms of (collections, shards, replicas) and how those relate to cores...

    16:46 < wavis:#solr> I think the third paragraph here really needs some cleanup 
                         https://cwiki.apache.org/confluence/display/solr/How+SolrCloud+Works
    16:47 < wavis:#solr> i'm trying to learn about SolrCloud and it seems to indicate that there's a "slice" entity somehow, but it's 
                         very unclear
    16:51 <@hoss:#solr> wavis: from a user perspective there is no concrete concept called a "slice" which is why that page doesn't 
                        bold it ... it's just describing that you can slice up a collection into "shards" (which is a specific 
                        concept in solr that users should be aware of)
    16:52 <@hoss:#solr> you could imagine repalcing the words "slice" with "divide up" or "chunk" on that page and read it the same 
                        way
    16:52 < wavis:#solr> it says "Collections can be divided into slices" -- how does one do this? what is doing this called?
    16:53 <@hoss:#solr> it's not an action you take ... it's a logical construct .. the number of slices/shards depends on the 
                        properties you define when you create the collection and/or if you take actions to "split" shards later
    16:53 < wavis:#solr> it says "Each slice can exist in multiple copies" -- how is a slice "in" a copy? if a copy is called a 
                         shard, does a shard contain multiple slices?
    16:54 <@hoss:#solr> ie: you do not "slice" ... but you can think of collections as containing multiple "slices" internally
    16:55 <@hoss:#solr> "can be" isn't trying to suggest that it's something *you* can do .. it's describing how you might think 
                        about it behaving internally
    16:57 <@hoss:#solr> wavis: your confusion is totally understandable .. there's been much discussion among developers about the 
                        value of the "slice" concept, would you mind if i archive & make public this exchange as a data point about 
                        how easy it is for a user to understand? (i'll post it as a comment on that page untill we have a chance to 
                        fix the wording)
    16:57 < wavis:#solr> feel free
    
    

     

     

     

    1. Agree. Get rid of the term "slice" from all public APIs and documentation. Also see my comment in 

      JIRA Issues Macro: Unable to locate JIRA server for this macro. It may be due to Application Link configuration.

      The text in this page should be aligned with the text in Shards and Indexing Data in SolrCloud, there is overlap here

  2. The discussion about slices came up on IRC again.  I have fixed some blatantly incorrect information and removed "slice" from the page entirely.  Based on the discussion above, this may be an incomplete fix, but at least it is correct now.

  3. The text description above is well-written and quite useful. I've created a simple model to express the verbal description visually, which some wiki users may find useful. It can be found here:

    Solr Cloud Concepts

    Please feel free to leave suggestions for improving the model.

  4. I've been using DSE for the Cassandra+Solr combo, and am looking at moving to straight Solr, but am a little confused.

    With DSE, I've been splitting my data into Solr cores, essentially thinking of each core as a single table, holding documents that are related. Basically trying to reduce the number of documents I search in by only running a search in a single table/core.

    With the addition of "collection" and "shard" terms, I'm not sure if I now should be using a collection in the way I was using a core before, or would a collection be more similar to a keyspace? In my DSE use, I have a single Cassandra keyspace with six column families (each mapping to a Solr core). When I search, I just search in a single core. With Solr, should I now have six collections instead? Or one collection, with six... cores? shards? 

    Or am I overthinking it, and I should keep it simple and use cores the same way I was with DSE? Then I'd just have a single collection, but I'm a little unsure how many shards I would use. 

    This is in a cluster of somewhere between 12 and 24 nodes, with a replication factor of 3.

    Thanks for any help!

    1. Where you would have used cores in non-cloud mode, when switching to cloud you should typically use collections.  A collection is a logical index.  You cannot create standalone cores in cloud mode – they must belong to a collection.

      Each collection has one or more shards, and each shard may have one or more replicas.  Each replica is one core.  These details allow an index to scale massively, to accommodate a huge index size with shards, to handle a huge query load with replicas, or both.  It sounds like your replication will be more for redundancy than query volume.

  5. If I have two nodes, one collection, with one shard and no replication (replicationFactor=1), then am I correct in assuming that each node has a single core that contains half the documents, and if a document is on node A then it's not on node B (i.e., the two nodes don't contain any of the same documents, or at least documents with the same unique ID)?

    1. If you have one shard and replicationFactor=1, which is what you described, then no. Your entire index would be contained on one node and the other node would not have any part of the index.

      If you have two shards, and use the compositeId router, then yes, it would be as you described.

      With the implicit router, assuming that no effort is made by the user to influence the routing in the request, data would not be distributed – it would exist on the shard that received the indexing request.

      1. If I have shards=1, replicationFactor=2, then both nodes contain the complete index and complete set of non-indexed data? 

        Is there a way to make it so I have a cluster of say, 12 nodes, where each node has the complete index but the non-indexed data is only replicated twice, so one node can safely go down.

        Am I correct in that a way to do is to use shards=12, replicationFactor=2, and compositeId router. Then what is the best way to make a query that gets the top results from the "complete" index (on all 12 nodes)? Do I have to explicitly query each node or is there some sort of internal mechanism that handles gathering, scoring, sorting the results from all the nodes?

        I don't mind having documents just indexed on the node they're added to (assuming RF=2 ensures they get copied to one other node), as I'll have an application on each node generating documents and adding them locally at the same rate, so the number of docs/node would be fairly even.

        With implicit router, would a query to node A only search the index on node A?

        Thanks for your help!

        1. This belongs on the mailing list, not here.  I will reply one more time, but please use the IRC channel or the mailing list if you have further questions.  After these questions and answers have been here for a while, they will be deleted.

          If you have one shard and two replicas, then a cloud of two nodes would have the complete index on both nodes.

          For your next example, I have no idea what you mean by "non-indexed data."  If you want 12 nodes to all have the complete index, you need numShards=1 and replicationFactor=12.

          The ADDREPLICA feature of the Collections API can be used to add replicas after initial collection creation is done.

          When you query a collection, Solr will query all shards unless you include a "distrib=false" parameter.  That parameter forces the query to be handled only by the shard replica that received the request.