Motivation

Before getting into issues in the Mesos Master's current design, let's look at some of the motivating design principles.

Simplicity is an important and fundamental design principle, particularly when designing a fault-tolerant, large scale, maintainable system. Accordingly, when a simpler solution was pragmatically "good enough", we have rejected more complex solutions.

In particular, it was an intentional design decision based on simplicity to have a Master without persistent state. However, it was at the cost of allowing inconsistent task state in very rare situations involving a failure confluence across Masters and Slaves. Frameworks such as Aurora had ways of working around these issues, which is why persistence has not been added to the Master; the stateless Master was "good enough" to warrant simplicity's substantial benefits.

Here's how the Master works in the current design. During the Master's normal operation, when a slave process fails, the Master detects its socket closure / health check failures. It then notifies the framework about the lost Slave and that some of the framework's tasks may have been lost as shown here:

Slave Failure

This works well because:

  • The Master is long-running, usually restarting due to either manual restarts or machine / network failures.
  • A new Master is elected in < 10 seconds. (We use a 10 second ZooKeeper session timeout).
  • Slaves recover from failures; the above scenario requires a Slave's permanent failure, which is rare in practice.

However, Master failovers cause consistency issues due to the lack of persistence. Consider the same scenario, but with a Master failover:

slave_failure_master_failover

Updates for the lost tasks were never sent to the framework. It would at first glance seem possible for the scheduler to fully reconcile this state against the Master by asking "What is the status of my tasks?". Since the tasks are unknown to it, the Master could reply with the task lost messages. However, this is trickier than it seems: what if the Master has just recovered from a failover, and hasn't yet learned of all the tasks? There's also the following consistency issue, in the presence of network partitions:

slave_partitioned

In this case, there was a network partition between the Master and the Slave, resulting in the Master notifying the Framework that the tasks and Slave were lost. During the partition, the Master fails over to a new Master, which knows nothing of the partitioned Slave. Consequently, when the partition is restored, the new Master allows the Slave to re-register, despite the framework having been told that Slave was lost!

These inconsistencies can be solved through the Registrar: the addition of a minimal amount of persistent state to the Master.

Design

We propose adding a minimal amount of persistent replicated state in the Master so as to guarantee eventual consistency for Slave information. This state...

  • Initially will be the set of Slaves managed by the Master. 
  • Will be authoritative; if a Slave is not present in this state: it cannot re-register with the Master. 
  • Is called the Registry
    • The initial Registry only stores Slave information. 
    • We plan to add additional state into the Registry later on.

We also add the Registrar which is...

  • A component in the Master responsible for managing the Registry using the State abstraction.
    • The State implementation can vary between ZooKeeper, the Replicated Log, and LevelDB (see below for more details).
  • An asynchronous libprocess Process (think: Actor) inside the Master process as shown here:

Master Process

The Master first consults the Registrar when dealing with registration, re-registration, and Slave removal. When the Registrar replies to the Master, the modified state has been persisted to the Registry and the Master can proceed with the remainder of the operation. This applies the same principle as write-ahead-logging for ensuring atomicity and durability. 

Consider the Register's normal operation with respect to a Slave's lifecycle:

registrar_adding_slave

Note we only register the slave once it is persisted in the Registry; after this point, the slave is allowed to re-register:

registrar_reregister

Again, note that the Slave is only re-registered once the Master confirms the Slave is present in the Registry.

Now, consider Slave removal, when a Slave is no longer reachable:

registrar_removal

Due to the Registry's write-ahead nature, the framework is only informed once we've removed the Slave from the Registry. This signal is not delivered reliably. That is, if the Master fails after persisting the removal, but before notifying frameworks, the signal is dropped. Likewise, if a framework fails before receiving the signal, it is dropped. We deal with this through a reconciliation mechanism against the master (see consistency below).

To observe how we prevent inconsistencies, consider the partition scenario from the Motivation section above:

Registrar_slave_partition

We can see that once removed from the Registry, the Slave is never allowed to re-register. This is key to correctness: once decided, we must not reverse our decision to remove the Slave.  

We propose the following registry format, which may be amended based on feedback or new inspiration:

// The Registry is a single state variable, to enforce atomic versioned writes.
message Registry {
// Leading master. This ensures the Registry version changes when an election occurs.
required Master master;

// All admitted slaves.
required Slaves slaves;

// Other information in the future.
}

// This wrapper ensures we have a single Message to represent the Master,
// in case Master is stored separately in the future.
message Master {
required MasterInfo master;
}

// This wrapper message ensures we have a single Message to represent Slaves,
// in case Slaves are stored separately in the future.
message Slaves {
repeated Slave slaves;
}

// This wrapper message ensures we can store additional information (missing
// from SlaveInfo) if needed.
message Slave {
required SlaveInfo info;

Consistency

The Registrar use does not guarantee that frameworks receive the lost Slave signal (we do not persist the message receipt, so frameworks may not receive lost Slave messages in some circumstances. For example, the Master dying immediately after removing a Slave from the Registry). As a result, frameworks must do the following to ensure consistent task state:

  1. Before returning from statusUpdate(), the update must be persisted durably by the framework scheduler.
  2. Periodically, the scheduler must reconcile the status of its non-terminal tasks with the Master (via reconcileTasks). (The scheduler can receive a TASK_LOST after a different terminal update, but never before, these TASK_LOST updates should be ignored).
  3. Optionally, when receiving slaveLost(), the scheduler should transition its tasks to TASK_LOST for the corresponding Slave. This is only an optimization.

Later, these may be done automatically for schedulers via a persistent scheduler driver. Also, the periodic nature of reconciliation is unnecessary and is only required after a Master failover. However, the initial implementation will require periodic reconciliation. More thorough reconciliation documentation will be added separately from this document.

Performance

We will test the Registrar for efficiency and being able to scale to over tens of thousands of Slaves. With respect to scalability, this design has two key aspects:

  1. The Master and Registrar interact asynchronously. The Master does not block waiting for the Registrar, and vice versa.
  2. The Registrar uses a batch processing technique ensuring at most one write operation is queued at any time. While a state operation is in effect, all incoming requests are batched together into a single mutation of the state, and applied as a single subsequent write.

1. alone is insufficient to ensure a performant Registrar, as the number of queued operations could grow unbounded. However, with both 1. and 2., the Registrar is largely immune to performance issues in the underlying state implementation. With a fast state implementation, we'll perform more writes of smaller mutations to state, and with a slow state implementation we'll perform fewer writes with larger mutations to the state.

State Implementation

ZooKeeper

Mesos currently supports ZooKeeper as a storage layer for State. However, there are two issues with the use of ZooKeeper in the Registrar:

  1. ZooKeeper's default znode size limit is just under 1MB (see jute.maxbuffer). Zookeeper was also designed to store data on the order of kilobytes in size, it is considered an unsafe option to modify the upper limit. The Registrar's initial design treats Slaves as a large versioned blob, which can easily exceed 1MB for thousands of slaves.
  2. Because we treat Slaves as a single versioned blob, it is undesirable to split our data across znodes. This would add additional complexity, and we lose our atomic writes unless we also implement transactional writes in our ZooKeeper storage layer.

Therefore, although support will be provided, it is imperative to use ZooKeeper only for small Mesos clusters (< 100 Slaves), where Slaves do not contain large amounts of metadata (e.g. SlaveInfo attributes).

Replicated Log

Mesos will soon support the Replicated Log as a state storage backend.

  • The Replicated Log has an atomic append operation for blobs of unbounded size. 
  • When using the Replicated Log, Mesos must run in high availability mode (multiple masters running as hot spares), so that it has a quorum of online log replicas. 
  • The Replicated Log has a safety constraint that at most one log co-ordinator has write access to the log. The safety implications of this are discussed below.

LevelDB

New Mesos users frequently run it in a non-High Availability mode, with a single Master, and potentially no available ZooKeeper cluster. This type of user can use LevelDB as the state storage backend, which means moving a Master to another machine requires moving the Master's work directory (containing LevelDB state) to the new machine.

Safety and State Initialization

The first safety concern is guaranteeing that only the leading Master can make a successful update to the Registry. The following properties help provide this guarantee:

  1. The State abstraction uses a versioning technique to deny concurrent writes to the same variables. We will store the Registry in a single State Variable, which enforces consistent versioning. This means if a rogue Master attempts a write to the Registry after the leading Master has performed a successful write, the rogue Master fails. However, there is a race: the rogue Master can write first and cause the leading Master to fail.
  2. To deal with this race, when a leading Master recovers state from the Registrar, the Registry is updated with the leader's MasterInfo. This ensures the Registry version updates as a result of an election, thus preventing a rogue Master from performing any writes.
  3. In addition to these two safety measures, if the replicated log is used, the rogue Master cannot perform a write as the rogue log co-ordinator loses its leadership when the new leading Master performs a read of the state. This helps prevent any writes in the rogue Master.

Related to upgrades and initialization, there are a few important tradeoffs between safe operation and ease of use. Let's begin with some questions that should be asked:

What occurs when state is empty? With a simple implementation, it appears as an empty cluster. Most of the time, this is because the cluster is indeed empty, or being started for the first time. However, consider an accidental mis-configuration of the Master that pointed a production cluster to empty state, or, less likely, the loss of all log replicas or ZooKeeper data loss. In these cases, you shouldn't proceed happily assuming there are no Slaves in the cluster (every Slave will be refused re-admission to the Registry).

How does one upgrade? If one did so naively using a simple implementation, this also appears as an empty cluster, resulting in a restart of all of the Slaves (they will be refused re-admission into the Registry).

From these questions you can derive the following important requirements:

  1. Safe upgrade functionality must be provided. This is essential for upgrading a running cluster.
  2. For production users, we must provide the ability to require an initialization of the Registry, allowing differentiation between a first run and misconfiguration / data loss.

It is interesting to consider satisfying 2. through 1.: If the state is empty, enter a temporary upgrade period in which all Slaves are permitted to re-register. This keeps a cluster operational, however it exposes the inconsistencies described in the Motivation section. This solution's implicit nature is also undesirable (operators may want to be alerted and intervene to recover data when this occurs).

Proposed Initialization and Upgrade Scheme

To determine whether the Registry state is initialized, we will persist an 'initialized' variable in the Registry. Two new flags provide upgrade semantics and state initialization:

  1. --registrar_upgrade=bool (default: false) - When true, the Registrar auto-initializes state and permits all re-registrations of Slaves. This lets the Registry be bootstrapped with a running cluster's Slaves.
  2. --registrar_strict=bool (default: false) - Only interpreted when not upgrading (--registrar_upgrade=false). When true (strict), the Registrar fails to recover when the state is empty. When false (non-strict), the Registrar auto-initializes state when the state is empty.

It is important to consider the upgrade semantics in the future as we add additional information to the Registrar, where we will only want to bootstrap a portion of the Registry. This could be done by updating the --registrar_upgrade flag to take the name(s) of the registry entries to bootstrap.

Let's look at how the various use cases map to these flags below.

Upgrade

When upgrading from pre-Registrar Mesos to Registrar Mesos:

  1. Stop all Masters, and upgrade the binaries.
  2. Start all Masters with --registrar_upgrade.
  3. After all Slaves have re-registered with the Master, stop all Masters.
  4. Start all Masters without --registrar_upgrade, optionally setting --registrar_strict if desired.

Starting a Cluster for the First Time

When starting a cluster for the first time:

  1. Start all Masters without --registrar_strict.
  2. Optionally, stop all Masters, start all Masters with --registrar_strict, if desired.
  • No labels

26 Comments

  1. I have a handful of questions, but i'll ask them in separate comments to allow for independent discussion threads.

    so as to guarantee eventual consistency for Slave information

    Can you elaborate on how the state is eventually-consistent, and what (if any) are the implications of that?  The rest of the document seems to describe the necessity for fully-consistent state.

    1. If a master is partitioned for some period of time it will not be able to learn about state updates, but it will eventually learn about those updates. Of course, if a master is partitioned and it can't learn about state updates it also can't make state updates.

  2. How does one reconcile the situation where a slave wants to re-register and the master rejects due to lack of state about the slave?  Will/does the slave have a special (documented) exit code when it fails to re-register?

    1. If a slave is not known the master will reject the slave. I like the idea of using a different exit code. Note that his is precisely why the "upgrade" semantics were considered.

      1. Right, but what happens after that?  Is human intervention required to successfully register a rejected slave?

        1. If an operator is using a process monitor then the slave will get restarted and it will register (not reregister) with the master and be able to join the cluster. If no process monitor is being used an operator will need to go manually restart the slave, similar to the case where a slave had crashed.

  3. Regarding the protobuf message layout: have you thought through the implications of writing full state on every mutation?  This seems like a non-issue for a cluster that has quiesced, but potentially problematic when slaves are churning, or large batches of slaves are added.

    1. Yes. You can see the current implementation of the Registrar here where we already batch mutations. That means that we'll write as fast as the underlying storage allows us and in the mean time batch together all mutations for the next write. This should be optimal w.r.t. writing the full state on every mutation.

      Of course, we still might not want to write the full state on every mutation. We plan on doing performance evaluations to determine the limits of this approach. Depending on those limits we'll begin implementing some of our storage optimizations. The primary optimization will be to compute "diffs" of each serialized protobuf and only write those, applying and snapshotting those diffs as required.

  4. Why the bifurcation of Registry implementations backed by ZooKeeper and LevelDB?  If you continue down this path, will there be a bridge to move state from one to another for cluster admins that change their decision?

    1. This is mostly an artifact of the State abstraction already having multiple storage backends. I like the idea of a bridge tool (it might be a great GSOC project).

      1. My suggestion is actually to pick a winner and make that one really good, rather than having two with their own setup and caveats.  Is there a downside to doing this (other than a potential deprecation cycle)?

        1. Yes, I like the idea of picking a winner (especially without a migration tool).

          1. Previously I intended LevelDB to be used for testing, local runs, and non-HA setups. I didn't think at the time of just using the Replicated Log with a quorum of 1 (the C++ API for the Log does not require ZooKeeper). With this in mind, we could only support the replicated log, removing ZooKeeper and LevelDB Registry support.

  5. The case made for not using ZooKeeper as the sole implementation is weak, can you add more to justify the decision to use the replicated log?  Additionally, splitting the data is not prohibitively challenging AFAICT, and seems to benefit both the replicated log and ZooKeeper approaches.

    1. Splitting the data is not the biggest issue (and I agree, it could be helpful for the replicated log as well if we don't use our "diff" approach mentioned above), the biggest issue is having guarantees on atomic updates (across all znodes in the case of ZooKeeper, even the ones you're not currently updating). This can be accomplished using transactional writes as Ben pointed out but it adds more complexity (it's not well known in practice, it has some implementation quirks of it's own, and we'd have to build it) and can introduce a performance hit. We want to target the replicated log because a write-ahead log more naturally represents the abstraction that we're looking for. Moreover, we don't need to build it and it's been running in production at Twitter for >2 years. Finally, there are many people in the community that have asked if Mesos can be used without the ZooKeeper dependency which this will enable. And to your point above, we'd rather have people start using the replicated log storage rather than having to migrate later if they want to move away from ZooKeeper. It's likely we shouldn't even allow people to start with ZooKeeper as the underlying storage backend.

  6. We will test the Registrar for efficiency and being able to scale to over 10,000 Slaves.

    Can you include some back of the envelope calculations of the Registry message size based on known parameters?  Also, can you set more firm acceptance criteria than "over 10,000 slaves"?

    1. Anonymous

      Please update "over 10,000 slaves" to a much higher number.

      1. Sure! We will definitely benchmark with more than 10,000 as well.

    2. My initial performance testing with simulated cluster data indicated that 10,000 slaves with some attributes and additional resources will consume < 2MB. This will need to be revised as the message format has been updated, and I'd like to benchmark using more slave attributes to further inflate the data.

      For acceptance criteria, the design will support any number of slaves, but we'll want to quantify the various events during the performance testing:

      1. Full cluster bootstrap (This will likely be the slowest operation).
      2. Slave Registration(s)
      3. Slave Re-registration(s)
      4. Slave Removal(s)
      5. Slave churn

      I don't have any hard criteria here yet for each of these, but we will definitely be measuring all of these types of things.

  7. Should --registrar_upgrade instead be something like --registrar_initialize?  The term 'upgrade' is missing from the description, so the presence in the arg is somewhat confusing.  Is this a 'run-once' arg?  i.e. if a master is running with 'true', will it abort if it discovers existing state?  (This would prevent accidentally running indefinitely with this arg.)
    1. If it's intended to be a "run-once" flag have you considered adding a tool or subcommand (e.g. mesos-init or mesos-master init) to initialize this state?

      1. A run-once argument is operationally more complex because what happens when a failure occurs during this one run? The Masters will subsequently need to flap, which produces a window in which operators must intervene in a timely manner to manually wipe the state and try again.

        Making it a subcommand actually requires a Master to start, re-register all slaves, persist them, and terminate when complete, all while not communicating with the Framework.

        --registrar_upgrade is designed to allow the state to bootstrap, if a failure occurs and the Masters restart with --registrar_upgrade, the bootstrapping process continues. This flag can remain for an arbitrary amount of time, which is equivalent to operating in a pre-Registrar world: all slaves can re-register. Perhaps a better name is --registrar_bootstrap?

  8. Will there be a way for a cluster administrator to "dump" and state prior to an upgrade and do an offline restore in the case of a rollback? I'd think something along the lines of

    mesos-dump --zk=zk://example.com/mesos/master > master-state.proto
    # Stop mesos-master, catastrophic upgrade failure
    mesos-restore --location-of-state < master-state.proto
    # Start mesos-master
    1. I assume you are referring to the case when we are upgrading in the future, where the Registry is destroyed in the process of an upgrade?

      We may possibly need something for this in the future as the Registry grows. For now, the --registrar_upgrade flag is sufficient and less dangerous than what you suggest:

      1. Restoring state from old data could cause the removal of many slaves.
      2. --registrar_upgrade is just as manual, and will allow all slaves to re-register to restore the current state.
  9. Have you considered solutions other than reconciliation for maintaining consistency? eg, adding message IDs to messages so frameworks can detect missed/duplicate messages, then providing a mechanism for frameworks to request missed messages?

    1. The problem is how to "request" missed messages. If the messages originate on the slaves (i.e., the slaves create the message IDs and provide the duplicate semantics), and the slave is kaput, a framework won't be able to get that missed message. We do in fact keep status updates on the slaves and send duplicates until they're acknowledged by the framework. But that only works until the machine fails. We're trying to avoid storing state on the master so that we can make it scale, but even if we did, there's still the potential that some messages are lost (i.e., if the masters are down and then the slave fails). Given all these failures at the end of the day we need some higher-level abstraction for "reconciling state".