Registrar Design Document

Motivation

Before getting into issues in the Mesos Master's current design, let's look at some of the motivating design principles.

Simplicity is an important and fundamental design principle, particularly when designing a fault-tolerant, large scale, maintainable system. Accordingly, when a simpler solution was pragmatically "good enough", we have rejected more complex solutions.

In particular, it was an intentional design decision based on simplicity to have a Master without persistent state. However, it was at the cost of allowing inconsistent task state in very rare situations involving a failure confluence across Masters and Slaves. Frameworks such as Aurora had ways of working around these issues, which is why persistence has not been added to the Master; the stateless Master was "good enough" to warrant simplicity's substantial benefits.

Here's how the Master works in the current design. During the Master's normal operation, when a slave process fails, the Master detects its socket closure / health check failures. It then notifies the framework about the lost Slave and that some of the framework's tasks may have been lost as shown here:

Slave Failure

This works well because:

The Master is long-running, usually restarting due to either manual restarts or machine / network failures.
A new Master is elected in < 10 seconds. (We use a 10 second ZooKeeper session timeout).
Slaves recover from failures; the above scenario requires a Slave's permanent failure, which is rare in practice.

However, Master failovers cause consistency issues due to the lack of persistence. Consider the same scenario, but with a Master failover:

slave_failure_master_failover

Updates for the lost tasks were never sent to the framework. It would at first glance seem possible for the scheduler to fully reconcile this state against the Master by asking "What is the status of my tasks?". Since the tasks are unknown to it, the Master could reply with the task lost messages. However, this is trickier than it seems: what if the Master has just recovered from a failover, and hasn't yet learned of all the tasks? There's also the following consistency issue, in the presence of network partitions:

slave_partitioned

In this case, there was a network partition between the Master and the Slave, resulting in the Master notifying the Framework that the tasks and Slave were lost. During the partition, the Master fails over to a new Master, which knows nothing of the partitioned Slave. Consequently, when the partition is restored, the new Master allows the Slave to re-register, despite the framework having been told that Slave was lost!

These inconsistencies can be solved through the Registrar: the addition of a minimal amount of persistent state to the Master.

Design

We propose adding a minimal amount of persistent replicated state in the Master so as to guarantee eventual consistency for Slave information. This state...

Initially will be the set of Slaves managed by the Master.
Will be authoritative; if a Slave is not present in this state: it cannot re-register with the Master.
Is called the Registry.
- The initial Registry only stores Slave information.
- We plan to add additional state into the Registry later on.

We also add the Registrar which is...

A component in the Master responsible for managing the Registry using the State abstraction.
- The State implementation can vary between ZooKeeper, the Replicated Log, and LevelDB (see below for more details).
An asynchronous libprocess Process (think: Actor) inside the Master process as shown here:

Master Process

The Master first consults the Registrar when dealing with registration, re-registration, and Slave removal. When the Registrar replies to the Master, the modified state has been persisted to the Registry and the Master can proceed with the remainder of the operation. This applies the same principle as write-ahead-logging for ensuring atomicity and durability.

Consider the Register's normal operation with respect to a Slave's lifecycle:

registrar_adding_slave

Note we only register the slave once it is persisted in the Registry; after this point, the slave is allowed to re-register:

registrar_reregister

Again, note that the Slave is only re-registered once the Master confirms the Slave is present in the Registry.

Now, consider Slave removal, when a Slave is no longer reachable:

registrar_removal

Due to the Registry's write-ahead nature, the framework is only informed once we've removed the Slave from the Registry. This signal is not delivered reliably. That is, if the Master fails after persisting the removal, but before notifying frameworks, the signal is dropped. Likewise, if a framework fails before receiving the signal, it is dropped. We deal with this through a reconciliation mechanism against the master (see consistency below).

To observe how we prevent inconsistencies, consider the partition scenario from the Motivation section above:

Registrar_slave_partition

We can see that once removed from the Registry, the Slave is never allowed to re-register. This is key to correctness: once decided, we must not reverse our decision to remove the Slave.

We propose the following registry format, which may be amended based on feedback or new inspiration:

// The Registry is a single state variable, to enforce atomic versioned writes.
message Registry {
  // Leading master. This ensures the Registry version changes when an election occurs.
  required Master master;

  // All admitted slaves.
  required Slaves slaves;

  // Other information in the future.
}

// This wrapper ensures we have a single Message to represent the Master,
// in case Master is stored separately in the future.
message Master {
  required MasterInfo master;
}

// This wrapper message ensures we have a single Message to represent Slaves,
// in case Slaves are stored separately in the future.
message Slaves {
  repeated Slave slaves;
}

// This wrapper message ensures we can store additional information (missing
// from SlaveInfo) if needed.
message Slave {
  required SlaveInfo info;
}

Consistency

The Registrar use does not guarantee that frameworks receive the lost Slave signal (we do not persist the message receipt, so frameworks may not receive lost Slave messages in some circumstances. For example, the Master dying immediately after removing a Slave from the Registry). As a result, frameworks must do the following to ensure consistent task state:

Before returning from statusUpdate(), the update must be persisted durably by the framework scheduler.
Periodically, the scheduler must reconcile the status of its non-terminal tasks with the Master (via reconcileTasks). (The scheduler can receive a TASK_LOST after a different terminal update, but never before, these TASK_LOST updates should be ignored).
Optionally, when receiving slaveLost(), the scheduler should transition its tasks to TASK_LOST for the corresponding Slave. This is only an optimization.

Later, these may be done automatically for schedulers via a persistent scheduler driver. Also, the periodic nature of reconciliation is unnecessary and is only required after a Master failover. However, the initial implementation will require periodic reconciliation. More thorough reconciliation documentation will be added separately from this document.

Performance

We will test the Registrar for efficiency and being able to scale to over tens of thousands of Slaves. With respect to scalability, this design has two key aspects:

The Master and Registrar interact asynchronously. The Master does not block waiting for the Registrar, and vice versa.
The Registrar uses a batch processing technique ensuring at most one write operation is queued at any time. While a state operation is in effect, all incoming requests are batched together into a single mutation of the state, and applied as a single subsequent write.

1. alone is insufficient to ensure a performant Registrar, as the number of queued operations could grow unbounded. However, with both 1. and 2., the Registrar is largely immune to performance issues in the underlying state implementation. With a fast state implementation, we'll perform more writes of smaller mutations to state, and with a slow state implementation we'll perform fewer writes with larger mutations to the state.

State Implementation

ZooKeeper

Mesos currently supports ZooKeeper as a storage layer for State. However, there are two issues with the use of ZooKeeper in the Registrar:

ZooKeeper's default znode size limit is just under 1MB (see jute.maxbuffer). Zookeeper was also designed to store data on the order of kilobytes in size, it is considered an unsafe option to modify the upper limit. The Registrar's initial design treats Slaves as a large versioned blob, which can easily exceed 1MB for thousands of slaves.
Because we treat Slaves as a single versioned blob, it is undesirable to split our data across znodes. This would add additional complexity, and we lose our atomic writes unless we also implement transactional writes in our ZooKeeper storage layer.

Therefore, although support will be provided, it is imperative to use ZooKeeper only for small Mesos clusters (< 100 Slaves), where Slaves do not contain large amounts of metadata (e.g. SlaveInfo attributes).

Replicated Log

Mesos will soon support the Replicated Log as a state storage backend.

The Replicated Log has an atomic append operation for blobs of unbounded size.
When using the Replicated Log, Mesos must run in high availability mode (multiple masters running as hot spares), so that it has a quorum of online log replicas.
The Replicated Log has a safety constraint that at most one log co-ordinator has write access to the log. The safety implications of this are discussed below.

LevelDB

New Mesos users frequently run it in a non-High Availability mode, with a single Master, and potentially no available ZooKeeper cluster. This type of user can use LevelDB as the state storage backend, which means moving a Master to another machine requires moving the Master's work directory (containing LevelDB state) to the new machine.

Safety and State Initialization

The first safety concern is guaranteeing that only the leading Master can make a successful update to the Registry. The following properties help provide this guarantee:

The State abstraction uses a versioning technique to deny concurrent writes to the same variables. We will store the Registry in a single State Variable, which enforces consistent versioning. This means if a rogue Master attempts a write to the Registry after the leading Master has performed a successful write, the rogue Master fails. However, there is a race: the rogue Master can write first and cause the leading Master to fail.
To deal with this race, when a leading Master recovers state from the Registrar, the Registry is updated with the leader's MasterInfo. This ensures the Registry version updates as a result of an election, thus preventing a rogue Master from performing any writes.
In addition to these two safety measures, if the replicated log is used, the rogue Master cannot perform a write as the rogue log co-ordinator loses its leadership when the new leading Master performs a read of the state. This helps prevent any writes in the rogue Master.

Related to upgrades and initialization, there are a few important tradeoffs between safe operation and ease of use. Let's begin with some questions that should be asked:

What occurs when state is empty? With a simple implementation, it appears as an empty cluster. Most of the time, this is because the cluster is indeed empty, or being started for the first time. However, consider an accidental mis-configuration of the Master that pointed a production cluster to empty state, or, less likely, the loss of all log replicas or ZooKeeper data loss. In these cases, you shouldn't proceed happily assuming there are no Slaves in the cluster (every Slave will be refused re-admission to the Registry).

How does one upgrade? If one did so naively using a simple implementation, this also appears as an empty cluster, resulting in a restart of all of the Slaves (they will be refused re-admission into the Registry).

From these questions you can derive the following important requirements:

Safe upgrade functionality must be provided. This is essential for upgrading a running cluster.
For production users, we must provide the ability to require an initialization of the Registry, allowing differentiation between a first run and misconfiguration / data loss.

It is interesting to consider satisfying 2. through 1.: If the state is empty, enter a temporary upgrade period in which all Slaves are permitted to re-register. This keeps a cluster operational, however it exposes the inconsistencies described in the Motivation section. This solution's implicit nature is also undesirable (operators may want to be alerted and intervene to recover data when this occurs).

Proposed Initialization and Upgrade Scheme

To determine whether the Registry state is initialized, we will persist an 'initialized' variable in the Registry. Two new flags provide upgrade semantics and state initialization:

--registrar_upgrade=bool (default: false) - When true, the Registrar auto-initializes state and permits all re-registrations of Slaves. This lets the Registry be bootstrapped with a running cluster's Slaves.
--registrar_strict=bool (default: false) - Only interpreted when not upgrading (--registrar_upgrade=false). When true (strict), the Registrar fails to recover when the state is empty. When false (non-strict), the Registrar auto-initializes state when the state is empty.

It is important to consider the upgrade semantics in the future as we add additional information to the Registrar, where we will only want to bootstrap a portion of the Registry. This could be done by updating the --registrar_upgrade flag to take the name(s) of the registry entries to bootstrap.

Let's look at how the various use cases map to these flags below.

Upgrade

When upgrading from pre-Registrar Mesos to Registrar Mesos:

Stop all Masters, and upgrade the binaries.
Start all Masters with --registrar_upgrade.
After all Slaves have re-registered with the Master, stop all Masters.
Start all Masters without --registrar_upgrade, optionally setting --registrar_strict if desired.

Starting a Cluster for the First Time

When starting a cluster for the first time:

Start all Masters without --registrar_strict.
Optionally, stop all Masters, start all Masters with --registrar_strict, if desired.

Space shortcuts

Child pages