Cluster Membership: ZOOKEEPER-107

Motivation

ZooKeeper clusters are currently static: the set of servers participating in a cluster are statically defined in a configuration file. In many instances, however, it is desirable to have the ability of adding and removing servers from an ensemble. The difficulty of implementing such a feature is making sure that a change in the configuration does not cause inconsistencies in a ZooKeeper cluster. A related issue is the one of enabling a client to learn the current ensemble of servers dynamically.

Requirements

  1. The ensemble of servers must agree upon the current configuration, and to reach agreement, Zab sounds like the obvious choice;
  2. We need new client calls to add and remove servers. It is unclear whether we want one call for each modification or one call to propose a whole new configuration;
  3. It must work with both majority and flexible quorums;
  4. We need a mechanism, perhaps based on URIs, to enable a client to learn the current configuration.

Some pre-design random thoughts

When moving from one configuration to another, we need to make sure that a quorum of the old configuration and a quorum of the new configuration commit to the new configuration. A quorum of the old configuration needs to agree to avoid a split-brain problem, for example, when adding more servers. A quorum of the new configuration needs to agree for progress. We also need to make sure that a quorum of the old configuration confirms first, otherwise a partition could cause a split-brain problem.

If the current leader is part of the old and new configurations, then it can keep being the leader once the new leader is installed. Otherwise an epoch change becomes necessary.

It is critical to make sure that every operation committed once a configuration is installed is acknowledged by a quorum from the new configuration. Otherwise a leader crash can cause committed operations to be lost. It might be simpler to stall the pipeline of request processors when a reconfiguration goes through PrepRequestProcessor. By stalling I mean holding operations until the reconfiguration operation is committed.

Proposed design

This section contains a proposed algorithm and API for reconfiguring zookeeper cluster membership.
Any comments are welcome (Alex Shraer, shralex@yahoo-inc.com).

Let M be the current configuration of Zookeeper. We define a configuration to be the set of participating servers, members(M), and a quorum system over these members. In order to reconfigure the system to a new configuration M’ an administrative client submits a reconfig(M’) operation through any member s in M. This causes s to:

  1. Send a message to all members(M’) instructing them to connect to leader(M).
  2. Wait for <connected> from a quorum of M’ (optional, see line 1 below)
  3. Submit a reconfig(M’) operation to leader(M)

The algorithm for leader(M) is as follows:

All servers do the following:

Notes:

Recovery from leader failure

During state discovery in M, if some server responds next(M)!=null, let M’ be such returned non-null configuration. The algorithm (or ZAB) will make sure that at-most a single non-null value is returned.
The leader executes the reconfiguration alg. with the following changes:

Some design choices

Reconfiguration API

A choice that has to be made is what kind of operations to support – incremental changes like “add server A” or “remove server B” (e.g., as in survey on reconfiguration of R/W objects , #DynaStore) or full membership specification, as in “reconfigure so that the membership becomes is A, B, C” (e.g., survey on reconfiguration with virtual synchrony , #Rambo).

One notable disadvantage of non-incremental requests is when multiple reconfigurations are proposed concurrently. Suppose that the initial configuration is A, B, C and one process proposes to add a server D whereas another proposes to remove B. If each process has to specify the full new configuration then the first process would propose A, B, C, D whereas the second would propose A, C. One of these would succeed first, suppose A, C. Then the second proposal should be aborted otherwise the resulting configuration would be A, B, C, D, where B appears even though it was already removed.

With the incremental approach this would not happen. Here, a configuration can be viewed as a set of changes. The initial configuration is (+A, +B, +C) then in the scenario above the next configuration is (+A, +B, +C, -B) and last one is (+A, +B, +C, -B, +D). This allows all reconfiguration requests to complete without aborting (a wait-free algorithm), as each change can be applied to the current configuration, whether this configuration remained the same or changed.

The non-incremental approach might be preferable for two main reasons:

To support the non-incremental approach, we propose that the administrative client intending to execute a reconfig, first fetches the current config from the system, and then submits a reconfig request where he fully specifies the requested configuration, its quorum system, and includes the version of the current configuration (version(M)). If the server that gets this request already has some other proposal to reconfigure (next(version(M))!=null) the reconfiguration is aborted and the administrative client should decide whether to retry.

At first stage we propose to use the non-incrememtal API for reconfigurations. In the future we intend to use this non-incremental interface only for changing the quorum system and to add an incremental API for wait-free reconfigurations.

Old reconfiguration requests

Suppose that a reconfig request was issued and the leader started sending phase-1 messages to the current configuration M, but failed after sending to only one other server A. Then, when the recovery happened, the new leader did not see a message from A. Should we allow the reconfiguration request to surface at a later time ? If not, a possible solution might be to have a command “next(M) = null” as the first one issued by any elected leader. If ZAB is used for sending the message in Phase 1, explicitly making sure that there are no incomplete reconfigurations that can surface later might be unnecessary.

Online vs. offline reconfiguration

The idea of an “off-line” strategy for reconfiguration (survey on reconfiguration with with virtual synchrony , survey on reconfiguring state-machine replication ) is to stop operations in the old configuration, transfer the state to the new configuration and then enable operations – in the new configuration. In contrast, an online reconfiguration approach (#RAMBO, #DynaStore) never stops the service while reconfiguring.
One of the complexities arising in the online approach is that a normal operation can be executing concurrently with a reconfiguration, however the state still must be transferred correctly to the next configuration. The easy case is when the operation occurs in the old configuration and the reconfiguration transfers the state. It is possible, however, that the reconfiguration misses the operation when it transfers the state and completes. In this case, existing online reconfiguration solutions (#RAMBO, #DynaStore) continue the operation and execute it in the new configuration.
Unfortunately this may violate the global primary order in Zookeeper - operations issued in the new configuration (potentially by a different primary) may have already completed, in which case global primary order does not allow operations issued by an old primary to be applied.
We therefore choose the offline reconfiguration strategy, however we try to minimize the period of unavailability by pre-transferring the state to the new configuration before the reconfig begins.

Bibliography

Surveys:
1. Ken Birman, Dahlia Malkhi, and Robbert Van Renesse, Virtually Synchronous Methodology for Dynamic Service Replication, no. MSR-TR-2010-151, November 2010 paper
2. Leslie Lamport, Dahlia Malkhi and Lidong Zhou, Reconfiguring a State Machine. In SIGACT News 41(1), SIGACT News 41(1): 63-73 (2010) paper
3. Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, Alexander Shraer:
Reconfiguring Replicated Atomic Storage: A Tutorial. In the Bulletin of the European Association for Theoretical Computer Science 102, pages 84-108, Distributed Computing Column, October 2010. paper

Rambo (R/W objects, fully specifies configurations, online, uses external consensus on configurations order):
4. Nancy Lynch and Alex Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In 5th International Symposium on Distributed Computing (DISC), 2002. paper

DynaStore (R/W objects, incremental reconfiguration, online, doesn’t require consensus):
5. Marcos Aguilera, Idit Keidar, Dahlia Malkhi, and Alex Shraer, Dynamic Atomic Storage Without Consensus, in Journal of the ACM, ACM, 2011 paper