Cluster Membership: ZOOKEEPER-107

Motivation

ZooKeeper clusters are currently static: the set of servers participating in a cluster are statically defined in a configuration file. In many instances, however, it is desirable to have the ability of adding and removing servers from an ensemble. The difficulty of implementing such a feature is making sure that a change in the configuration does not cause inconsistencies in a ZooKeeper cluster. A related issue is the one of enabling a client to learn the current ensemble of servers dynamically.

Requirements

The ensemble of servers must agree upon the current configuration, and to reach agreement, Zab sounds like the obvious choice;
We need new client calls to add and remove servers. It is unclear whether we want one call for each modification or one call to propose a whole new configuration;
It must work with both majority and flexible quorums;
We need a mechanism, perhaps based on URIs, to enable a client to learn the current configuration.

Some pre-design random thoughts

When moving from one configuration to another, we need to make sure that a quorum of the old configuration and a quorum of the new configuration commit to the new configuration. A quorum of the old configuration needs to agree to avoid a split-brain problem, for example, when adding more servers. A quorum of the new configuration needs to agree for progress. We also need to make sure that a quorum of the old configuration confirms first, otherwise a partition could cause a split-brain problem.

If the current leader is part of the old and new configurations, then it can keep being the leader once the new leader is installed. Otherwise an epoch change becomes necessary.

It is critical to make sure that every operation committed once a configuration is installed is acknowledged by a quorum from the new configuration. Otherwise a leader crash can cause committed operations to be lost. It might be simpler to stall the pipeline of request processors when a reconfiguration goes through PrepRequestProcessor. By stalling I mean holding operations until the reconfiguration operation is committed.

Proposed design

This section contains a proposed algorithm and API for reconfiguring zookeeper cluster membership.
Any comments are welcome (Alex Shraer, shralex@yahoo-inc.com).

Let M be the current configuration of Zookeeper. We define a configuration to be the set of participating servers, members(M), and a quorum system over these members. In order to reconfigure the system to a new configuration M’ an administrative client submits a reconfig(M’) operation through any member s in M. This causes s to:

Send a message to all members(M’) instructing them to connect to leader(M).
Wait for <connected> from a quorum of M’ (optional, see line 1 below)
Submit a reconfig(M’) operation to leader(M)

The algorithm for leader(M) is as follows:

operation reconfig(M’)
1. If no quorum of M’ is connected or the connected quorum is too far behind M return failure
2. If leader(M) is in M’:
  1. Designate leader(M) to be the leader of M’
  2. Suspend processing further operations (until M’ is activated)
3. Otherwise:
  1. Designate the most up-to-date server in connected quorum of M’ to be leader(M’)
  2. Stop processing new operations, return failure for any further ops received
4. Schedule the reconfiguration last in the queue of operations
  Phase 1:
5. send <reconfig-start, version(M), M’> to members(M)
6. wait for <reconfig-start-ack> from quorum of members(M)
  Phase 2:
7. send <activate-config, M’, leader(M’) > to members (M’)
8. wait for <activate-config-ack> from quorum of members (M’)
  Phase 3:
9. send <retire M> to members(M)

All servers do the following:

upon receipt of <reconfig-start, version(M), M’>
- store next(version(M)) = M’
- send <reconfig-start-ack> to leader(M)

upon receipt of <activate-config, M’, leader(M’)> in configuration M’
- start serving operations in M’
  - the first operation should clean up any incomplete reconfig request from M’
- send <activate-config-ack< to leader(M)

upon receipt of <retire, version(M)>
- garbage-collect M

Notes:

State transfer starts when members(M’) connect to leader(M), before reconfig(M’) is invoked. However, some suffix of operations might not have been transferred to M’ before the reconfiguration began. Since no new operations can be committed in M’ before this state-transfer completes, we’d like to minimize this tail. A possible way to do that is to require that at most w operations scheduled before the reconfiguration are unknown to the connected quorum of M’ as a prerequisite (line 1).
When line 8 completes we know that operations scheduled before the reconfiguration are committed in M’.
Even if the current leader remains the leader of M’ we cannot allow operations to be executed in M’ before phase 1 ends, otherwise, if the leader fails, we have a spilt brain (some operations execute in M’ and then when a new leader recovers new ops will be executed in M).
Instead of lines (2b) and (3b) the leader of M can redirect further operations to leader(M’) (whether leader(M') is equal to leader(M) or not). Leader(M') will buffer them until M’ is activated.
If phase 1 is done using a normal ZAB proposal, explicitly making sure that there are no incomplete reconfiguration requests that remain in the system after a recovery might not be necessary.

Recovery from leader failure

During state discovery in M, if some server responds next(M)!=null, let M’ be such returned non-null configuration. The algorithm (or ZAB) will make sure that at-most a single non-null value is returned.
The leader executes the reconfiguration alg. with the following changes:

Instead of line 1, try to connect to M’ and transfer state, so that a quorum of M’ is connected and up-to-date. If unsuccessful, skip the reconfiguration.
If a quorum of M’ indicate that they’ve already activated M’ then skip to phase 3
Otherwise, if next(M) = M’ was returned by a quorum of M, skip to phase 2

Some design choices

Reconfiguration API

A choice that has to be made is what kind of operations to support – incremental changes like “add server A” or “remove server B” (e.g., as in survey on reconfiguration of R/W objects , #DynaStore) or full membership specification, as in “reconfigure so that the membership becomes is A, B, C” (e.g., survey on reconfiguration with virtual synchrony , #Rambo).

One notable disadvantage of non-incremental requests is when multiple reconfigurations are proposed concurrently. Suppose that the initial configuration is A, B, C and one process proposes to add a server D whereas another proposes to remove B. If each process has to specify the full new configuration then the first process would propose A, B, C, D whereas the second would propose A, C. One of these would succeed first, suppose A, C. Then the second proposal should be aborted otherwise the resulting configuration would be A, B, C, D, where B appears even though it was already removed.

With the incremental approach this would not happen. Here, a configuration can be viewed as a set of changes. The initial configuration is (+A, +B, +C) then in the scenario above the next configuration is (+A, +B, +C, -B) and last one is (+A, +B, +C, -B, +D). This allows all reconfiguration requests to complete without aborting (a wait-free algorithm), as each change can be applied to the current configuration, whether this configuration remained the same or changed.

The non-incremental approach might be preferable for two main reasons:

It is easy to completely change the quorum system of the new configuration (with the incremental approach It is perhaps possible to support that as a separate command, or automatically deduce the quorum system from a given set of members).
One can argue that, in the scenario above, adding D was intended in the context of the original configuration A, B, C, and it should be left to the client whether the reconfiguration makes sense given that the system configuration has changed.

To support the non-incremental approach, we propose that the administrative client intending to execute a reconfig, first fetches the current config from the system, and then submits a reconfig request where he fully specifies the requested configuration, its quorum system, and includes the version of the current configuration (version(M)). If the server that gets this request already has some other proposal to reconfigure (next(version(M))!=null) the reconfiguration is aborted and the administrative client should decide whether to retry.

At first stage we propose to use the non-incrememtal API for reconfigurations. In the future we intend to use this non-incremental interface only for changing the quorum system and to add an incremental API for wait-free reconfigurations.

Old reconfiguration requests

Suppose that a reconfig request was issued and the leader started sending phase-1 messages to the current configuration M, but failed after sending to only one other server A. Then, when the recovery happened, the new leader did not see a message from A. Should we allow the reconfiguration request to surface at a later time ? If not, a possible solution might be to have a command “next(M) = null” as the first one issued by any elected leader. If ZAB is used for sending the message in Phase 1, explicitly making sure that there are no incomplete reconfigurations that can surface later might be unnecessary.

Online vs. offline reconfiguration

The idea of an “off-line” strategy for reconfiguration (survey on reconfiguration with with virtual synchrony , survey on reconfiguring state-machine replication ) is to stop operations in the old configuration, transfer the state to the new configuration and then enable operations – in the new configuration. In contrast, an online reconfiguration approach (#RAMBO, #DynaStore) never stops the service while reconfiguring.
One of the complexities arising in the online approach is that a normal operation can be executing concurrently with a reconfiguration, however the state still must be transferred correctly to the next configuration. The easy case is when the operation occurs in the old configuration and the reconfiguration transfers the state. It is possible, however, that the reconfiguration misses the operation when it transfers the state and completes. In this case, existing online reconfiguration solutions (#RAMBO, #DynaStore) continue the operation and execute it in the new configuration.
Unfortunately this may violate the global primary order in Zookeeper - operations issued in the new configuration (potentially by a different primary) may have already completed, in which case global primary order does not allow operations issued by an old primary to be applied.
We therefore choose the offline reconfiguration strategy, however we try to minimize the period of unavailability by pre-transferring the state to the new configuration before the reconfig begins.

Bibliography

Surveys:
1. Ken Birman, Dahlia Malkhi, and Robbert Van Renesse, Virtually Synchronous Methodology for Dynamic Service Replication, no. MSR-TR-2010-151, November 2010 paper
2. Leslie Lamport, Dahlia Malkhi and Lidong Zhou, Reconfiguring a State Machine. In SIGACT News 41(1), SIGACT News 41(1): 63-73 (2010) paper
3. Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, Alexander Shraer:
Reconfiguring Replicated Atomic Storage: A Tutorial. In the Bulletin of the European Association for Theoretical Computer Science 102, pages 84-108, Distributed Computing Column, October 2010. paper

Rambo (R/W objects, fully specifies configurations, online, uses external consensus on configurations order):
4. Nancy Lynch and Alex Shvartsman. RAMBO: A reconﬁgurable atomic memory service for dynamic networks. In 5th International Symposium on Distributed Computing (DISC), 2002. paper

DynaStore (R/W objects, incremental reconfiguration, online, doesn’t require consensus):
5. Marcos Aguilera, Idit Keidar, Dahlia Malkhi, and Alex Shraer, Dynamic Atomic Storage Without Consensus, in Journal of the ACM, ACM, 2011 paper