Status

Current stateUnder Discussion

Discussion thread: https://lists.apache.org/thread.html/r186364d4d22a6301887b54023cb3db48a5324f197590a3b3e95535fd%40%3Cdev.solr.apache.org%3E

JIRA: SOLR-15636 - Getting issue details... STATUS

Released: <Solr Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

Motivation

Many organizations are frustrated with Solr Cloud deployments due to the perceived cost of managing a separate, dedicated Apache ZooKeeper ensemble. We can ameliorate this complexity by running our own embedded Zookeeper ensemble, based on ZOOKEEPER-3874 - Getting issue details... STATUS  and released with ZooKeeper 3.7

This ensemble should be launched automatically from Solr processes, and dynamically configure quorum information.

There is some overlap between the motivations of this SIP and SIP-5 Coordination Module + Apache Curator but the two approaches should be complimentary.

Public Interfaces

We will need to create APIs for retrieving quorum status from a Solr node. This may include determining if the node is part of serving a quorum, which quorum it is connected to, getting information about other quorum members (ports, addresses) for observers joining. We may also need APIs for instructing nodes to join or depart a particular quorum.

The full extent of the necessary APIs is not yet determined.


We will need to expose additional ports from Solr nodes for ZK functionality. This will likely include the ZK secureClientPort, and possibly the serverPortelectionPort, Admin port and others.

Proposed Changes

There are several phases to accomplishing what we would need to do.

Migrate Unit Tests to use ZooKeeperServerEmbedded (ZKSE)

Currently, our unit tests use a fragile construction for an embedded Zookeeper. In order to develop confidence towards an embedded ZooKeeper in production settings, we should ensure that our test framework is using the same APIs.

Migrate ZKRun implementation to use ZKSE

When we launch a Solr service in "cloud" mode without specifying a zookeeper host to connect to, it launches its own service on a separate port.

This is the simplest usage of an embedded zookeeper server that we currently have, it does not use quorums and has lifecycle tied to that of the parent Solr node.

Create an auto-clustering implementation for several ZKRun nodes

This approach may not be feasible for service discovery, but would be the ultimate goal of our efforts.

For example, we would start three Solr nodes each with ZKSE, and instruct all of the ZK servers to form a cluster. There may be ordering issues to resolve here, as well as concerns about service discovery for other Solr nodes.

Compatibility, Deprecation, and Migration Plan

Existing users will be able to continue to run Solr Cloud with an external ZooKeeper quorum.

Major Risks

Zookeeper services launched this way may be subject to Solr availability - if the server is exhausted from too many queries or bad queries then that may adversely impact the health of the whole cluster rather than causing isolated failure on given replicas. This should be mitigated by offering multiple ZK services in a quorum that can tolerate individual node failure, but may be enough motivation to use a larger default quorum size of 5 or 7 members instead of the minimal 3 node setup.

Security considerations

When running our own ZK services, the security of ZK becomes our responsibility instead of being something that we can delegate. The ZK Servers that we start should be secure by default using available authentication methods and practices.

Test Plan

 [ TBD ]

Rejected Alternatives

  • Continue to launch embedded ZK process the same way that we do now. This is an unattractive proposal because we will be tied to ZK internals which are subject to change and not part of their public APIs.
  • SOLR-7099 - Getting issue details... STATUS  bin/solr -cloud mode should launch a local ZK in its own process using zkcli's runzk option (instead of embedded in the first Solr process)
  • SOLR-7074 - Getting issue details... STATUS  Simple script to start external Zookeeper
  • SOLR-6734 - Getting issue details... STATUS  Standalone solr as *two* applications -- Solr and a controlling agent
  • No labels

2 Comments

  1. This is promising! Question: Would this mode be valuable also for Kubernetes deployments, i.e. we could get rid of the ZookeeperOperator and instead let the SolrOperator keep track of which Solr pods that also act as ZK nodes?

    Would we allow a Solr node to start in a ZK-only mode, i.e. not eligible for collections/cores/overseer? This would also support those huge clusters where you want dedicated ZKs.