Apache Solr Documentation

6.4 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.5 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.5

Skip to end of metadata
Go to start of metadata

SolrCloud supports elasticity, high availability, and fault tolerance in reads and writes. What this means, basically, is that when you have a large cluster, you can always make requests to the cluster: Reads will return results whenever possible, even if some nodes are down, and Writes will be acknowledged only if they are durable; i.e., you won't lose data.

Read Side Fault Tolerance

In a SolrCloud cluster each individual node load balances read requests across all the replicas in collection. You still need a load balancer on the 'outside' that talks to the cluster, or you need a smart client which understands how to read and interact with Solr's metadata in ZooKeeper and only requests the ZooKeeper ensemble's address to start discovering to which nodes it should send requests. (Solr provides a smart Java SolrJ client called CloudSolrClient.)

Even if some nodes in the cluster are offline or unreachable, a Solr node will be able to correctly respond to a search request as long as it can communicate with at least one replica of every shard, or one replica of every relevant shard if the user limited the search via the 'shards' or '_route_' parameters. The more replicas there are of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node failures.

zkConnected

A Solr node will return the results of a search request as long as it can communicate with at least one replica of every shard that it knows about, even if it can not communicate with ZooKeeper at the time it receives the request. This is normally the preferred behavior from a fault tolerance standpoint, but may result in stale or incorrect results if there have been major changes to the collection structure that the node has not been informed of via ZooKeeper (ie: shards may have been added or removed, or split into sub-shards)

zkConnected header is included in every search response indicating if the node that processed the request was connected with ZooKeeper at the time:

Solr Response with partialResults

shards.tolerant

In the event that one or more shards queried are completely unavailable, then Solr's default behavior is to fail the request. However, there are many use-cases where partial results are acceptable and so Solr provides a boolean shards.tolerant parameter (default 'false'). If shards.tolerant=true then partial results may be returned. If the returned response does not contain results from all the appropriate shards then the response header contains a special flag called 'partialResults'. The client can specify 'shards.info' along with the 'shards.tolerant' parameter to retrieve more fine-grained details.

Example response with partialResults flag set to 'true':

Solr Response with partialResults

Write Side Fault Tolerance

SolrCloud is designed to replicate documents to ensure redundancy for your data, and enable you to send update requests to any node in the cluster.  That node will determine if it hosts the leader for the appropriate shard, and if not it will forward the request to the the leader, which will then forward it to all existing replicas, using versioning to make sure every replica has the most up-to-date version.  If the leader goes down, another replica can take its place. This architecture enables you to be certain that your data can be recovered in the event of a disaster, even if you are using Near Real Time Searching.

Recovery

A Transaction Log is created for each node so that every change to content or organization is noted. The log is used to determine which content in the node should be included in a replica. When a new replica is created, it refers to the Leader and the Transaction Log to know which content to include. If it fails, it retries.

Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes redoing the uncommitted updates if indexing is interrupted.

If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of sync, the system asks for a full replication/replay-based recovery.

If an update fails because cores are reloading schemas and some have finished but others have not, the leader tells the nodes that the update failed and starts the recovery procedure. 

Achieved Replication Factor

When using a replication factor greater than one, an update request may succeed on the shard leader but fail on one or more of the replicas. For instance, consider a collection with one shard and a replication factor of three. In this case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but fails on both replicas, for whatever reason, the update request is still considered successful from the perspective of the client. The replicas that missed the update will sync with the leader when they recover.

Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the achieved replication factor for an update request in the response. For the example scenario described above, if the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only tells Solr that the client application wishes to know what the achieved replication factor was for the update request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates that succeed on a subset of replicas. 

On the client side, if the achieved replication factor is less than the acceptable level, then the client application can take additional measures to handle the degraded state. For instance, a client application may want to keep a log of which update requests were sent while the state of the collection was degraded and then resend the updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application to be warned that an update request was accepted while the collection is in a degraded state.

 

9 Comments

  1. This is probably a good place to add details on the shards.tolerant parameter.

  2. I don't quite understand this: "You still need a load balancer on the 'outside' that talks to the cluster". If it's already a cluster, what's the role of external load balancer? To do what? To balance what? In our current environment, we use HA proxy as load balancer since there're multiple Solr slave servers. But if SolrCloud cluster knows how to load balance things, there's no reason to use an external load balancer IMO.

    1. In SolrCloud, a client can make a request to any node and SolrCloud will take care of routing/forwarding it to the right node in the cluster. However, load balancing is still required so that the client doesn't send all requests to a single node. If you are using Java then the CloudSolrClient class in SolrJ performs intelligent routing and load balancing functions and no further setup is required. But, if you are not using a Java client, then you must perform such load balancing yourself via HA proxy etc. 

      1. In our enviroment, if we send requests to slave 1, the work will be also done by slave 1, so we need a HA proxy to distribute requests among multiple slaves. However, in your reply you said "a client can make a request to any node and SolrCloud will take care of routing/forwarding it to the right node in the cluster", doesn't that mean even one node receives many requests, it will pass along some requests to other "right nodes" to do works. I thought this is already a load-balancing behavior. Do I misunderstand something here?

        BTW, I thought cluster means "you don't worry about what's inside it, it acts as a whole one thing", so why should I specifically send requests to some node inside it? Can't I just send requests to the "cluster"? This arises other question: "Can I just send a query request without specifying shards and router parameters?".

        Note: After reviewing this guide, I maybe know what you talk about. Do you mean SolrCloud doesn't have a cluster-level point to accept requests?

        1. doesn't that mean even one node receives many request, it will pass along the requests to other "right nodes" to do tasks. I thought this is already a load-balancing behavior. Do I misunderstand something here?

          Well, yes and no. The node which receives the request from the client will attempt to load balance query requests among all shards/replicas. But, say you have a 100 node cluster, you can't expect a single node to handle all the client traffic intended for the entire cluster. Even if the node forwards requests and load balances among other nodes (including itself), its thread capacity is limited. Also, if all your clients are sending requests to one node only, then that node is a single point of failure for the entire cluster (from a client's perspective). So a better way is for your clients themselves to load balance between all available servers (which is automatically done by CloudSolrClient if you are using Java/SolrJ).

          BTW, I thought cluster means "you don't worry about what's inside it, it acts as a whole one thing", so why should I specifically send requests to some node inside it? Can't I just send requests to the "cluster"?

          That is the dream. But we're still talking HTTP and sockets so a client has to send a request to a specific node. How that node is selected affects the operational characteristics. The client is as much part of the cluster as the servers themselves.

          Can I just send a query request without specifying shards and router parameters?

           Yes, absolutely. That is indeed how most query requests are sent. Most of the time you don't even know which shard has the data you want to query so Solr will end up querying all of them. The "shards" and the "_route_" parameters are simply there for limiting the subset of nodes to be queried if you, the client, knows beforehand the shard on which you want to search (which is very common in multi-tenant search solutions).

          1. Very subtle reply. I appreciate it! Thanks!

  3. When using CloudSolrClient to add the index and then update the index (in fact, the reconstruction of the index), found that many times the results are inconsistent, which can be seen from the _version_ value is not consistent. Initial suspect load balanced Hash mapping is not correct, which can be said that the use of CloudSolrClient to update the index may be the existence of multiple (shard number) version of the data. please enlighten.  Thanks!

    1. The _version_ field will be auto-set by Solr every time you index, and will not be the same in different indexing runs.  You can't actually set this field in your indexing requests.  If you include it, it is used to control optimistic concurrency and may result in the new document NOT being indexed.

    2. In the future, please use the mailing list or the IRC channel for support requests.  These comments are for help with the documentation itself, not for support.

      http://lucene.apache.org/solr/resources.html