Problem:


There are several cases where a framework would need to update its FrameworkInfo.

For example, to set new fields as new fields are added to the FrameworkInfo (e.g., ‘principal’).

Or, to update its configuration, e.g., change the ‘user’ (MESOS-703) or ‘failover_timeout’ (MESOS-1218). In fact, most frameworks today update their FrameworkInfo on failover because they set (or the driver sets it for them) ‘hostname’ to the current instance’s host.


Currently, there is no easy way for a framework to update its FrameworkInfo, ie., while keeping its FrameworkID.  The only way an updated FrameworkInfo is correctly reflected in the Mesos cluster is when the 1) leading master is restarted and 2) slaves are drained and restarted.


This is currently what happens when a framework re-registers with an updated FrameworkInfo. The leading master and all the slaves ignore the FrameworkInfo until the leading master restarts. After a master failover, the new leader accepts the updated FrameworkInfo but doesn’t inform the slaves (that are currently running the framework’s tasks) about it. So any new task/executor launches on such slaves will ignore the updated FrameworkInfo (e.g, framework checkpointing). A slave will only get the updated FrameworkInfo when the framework’s task is launched on the slave and the slave doesn’t know about the framework i.e., there are no active tasks/executors belonging to the framework on the slave. Clearly this is un-intuitive and error prone.


Goal:

Allow the frameworks to update their FrameworkInfo without having to restart masters or slaves or tasks/executors. In other words, the updated FrameworkInfo should be properly reconciled across the cluster in a seamless fashion.


Non Goal:

Persisting and updating the FrameworkInfo in a distributed storage (replicated log) is out of scope. This is tracked in MESOS-1719.


FrameworkInfo:

The following table lists the different fields of FrameworkInfo and the different components in the Mesos stack that are impacted by them.


Legend:

* : Optional field

Yes : Component impacted by the FrameworkInfo field

No: Component not impacted by the FrameworkInfo field

N/A: Not Applicable (because update is not allowed)

Green : Easy to update

Orange :  More involved to update

Red: Complex to update

 

 

 

master/master

master/allocator

master/http

slave/slave

slave/http

user

Yes

No

Yes

Yes

Yes

name

No

No

Yes

No

Yes

id*

N/A

N/A

N/A

N/A

N/A

failover_timeout*

Yes

No

Yes

No

Yes

checkpoint*

Yes

No

Yes

Yes

Yes

role*

Yes

Yes

Yes

No

Yes

hostname*

No

No

Yes

No

Yes

principal*

Yes

No

Yes

No

Yes

capabilities*YesYesYesNoYes

 

 

Updating the components :


Framework:

This is straightforward for frameworks. The framework will create a new MesosSchedulerDriver with the updated FrameworkInfo (likely after a framework process restart). Under the covers the driver will re-register with the Master with the updated FrameworkInfo.  

Master:

The Master will get the updated FrameworkInfo via the ReregisterFramework message. If the FrameworkInfo has changed, master will update its own internal data structures.

Allocator:

Master will update the Allocator by calling Allocator::frameworkUpdated() API call. This is a new API call that needs to be added to the Allocator.

Slave:

Master will also update the slaves (that have active tasks of the framework) via UpdateFrameworkMessage. UpdateFrameworkMessage already exists and sent on framework re-registration, but we need to add a new field ‘FrameworkInfo’ to it.


message UpdateFrameworkMessage {

   required FrameworkID framework_id = 1;

   required string pid = 2;

   optional FrameworkInfo framework_info = 3;   ← this is new

}

Slave will update its internal data structures on the receipt of UpdateFrameworkMessage.

Update semantics:

In the spirit of iterative development the update strategy has been broken down into several phases. All the phases could/should probably go in the same release.


Phase #1 (HTTP only fields):

Allow updates of the fields that only impact the HTTP endpoints of master and/or slave, i.e., ‘name’ and ‘hostname’. Note that FrameworkInfo will still be updated on both masters and slaves.

This should be pretty straightforward. On framework re-registration, Master updates its internal data structures, updates the allocator and the slaves. Slave updates its ‘frameworks’ map, and checkpoints it (if framework checkpointing is enabled),  but doesn’t change the ‘Framework.executors’ map.

Phase #2 (Master only fields):

Also allow updates to fields that only impact the master (but not the allocator), i.e., ‘failover_timeout’ and ‘principal’.

FrameworkInfo.failover_timeout:

This should be straightforward since it is only used in Master::exited(). After framework re-registers, future disconnections of framework will honor the new failover_timeout.

FrameworkInfo.principal:

This is a bit involved because ‘principal’ is used for authentication and rate limiting.

The authentication part is straightforward because a framework with updated ‘principal’ should authenticate with the new ‘principal’ before being allowed to re-register. The ‘authenticated’ map already gets updated when the framework disconnects and reconnects, so it is fine.

For rate limiting, Master:failoverFramework() needs to be changed to update the principal in ‘frameworks.principals’ map and also remove the metrics for the old principal if there are no other frameworks with this principal (similar to what we do in Master::removeFramework()).

The Master::visit() and Master::_visit() should work with the current semantics.

Phase #3 (Master + Slave fields, but not the Allocator):

Also allow updates to fields that impact the slave, i.e., ‘user’ and ‘checkpoint’.

The update semantics for tasks/executors are as follows:

 

 

Old Task

New Task

Old Executor

Honors old FrameworkInfo

Honors old FrameworkInfo*

New Executor

N/A

Honors new FrameworkInfo

 

In a nutshell, the new FrameworkInfo is only applicable to new executors.

* This is for simplicity and ease of implementation. We can revisit this in the future.


FrameworkInfo.user:

This should be straightforward. Master doesn’t care about this field. Slave only uses ‘FrameworkInfo.user’ when launching a new executor if and only if the ‘CommandInfo.user’ is not specified on task/executor itself. So all new executors (and tasks with no executors) will be launched with the updated ‘FrameworkInfo.user’ when ‘CommandInfo.user’ is not set.

FrameworkInfo.checkpoint:

Master uses this field to determine the course of action when a slave gets disconnected. If ‘checkpoint’ is false, master immediately removes the framework’s tasks/executors when a slave gets “disconnected” (i.e., restarted) irrespective of whether the slave is checkpointing or not. If ‘checkpoint’ is true, master allows the disconnected (checkpointing) slave to re-register within a timeout.


We should keep the same semantics when the framework updates this field for simplicity. But there are some subtle implications for frameworks that are worth noting:

  • If checkpoint changes from true to false, master will immediately remove tasks/executors from a disconnected slave even if it has old running tasks that are checkpointing.

  • If checkpoint changes from false to true, master doesn’t immediately remove tasks/executors from a disconnected slave but allows it to re-register within a timeout. If the slave had non-checkpointing tasks/executors the master will know about their LOSTness after the slave re-registers or slave registration time out (75s) happens and then informs the framework about the LOST tasks.

Slave uses this field to determine whether to checkpoint information regarding the task/executor. All new tasks and executors will honor the updated ‘checkpoint’ value, while old tasks and new tasks on old executors will honor the old ‘checkpoint’ value.

Phase #4 (Master + Slave + Allocator fields):

Also allow updates to fields that impact the allocator, i.e., ‘role’ and 'capabilities'.

These are the most tricky and involved fields to update!

There should be no outstanding offers with the old FrameworkInfo because FrameworkInfo can only be updated after a framework failover. When the old framework disconnects the old offers are rescinded. A failed over framework doesn’t get new offers until it re-registers with its (possibly updated) FrameworkInfo.

Now, the tricky part is w.r.t to the resources that are currently in use by tasks/executors. To better understand the complexity, lets look at an example with updating role.

Consider a single node cluster that has the following reservations on the slave:

 

 

# CPUS

Role: *

10

Role: foo

5

Role: bar

2

 

Now, lets assume a framework, registered with role ‘foo’, is using 15 cpus (10 + 5)  and running tasks. If framework wants to update its role to ‘bar’, we cannot simply update the allocator to say that role ‘bar’ has now been allocated 15 cpus, because this role can only be allocated 12 cpus (10 + 2).


Currently, there is no mechanism for the Allocator to reject such updates because all its API methods return ‘void’ and hence are expected to succeed. Master cannot detect this discrepancy by itself because it doesn’t keep track of allocations per role by itself; it delegates it to the Allocator.

FrameworkInfo.role:

There are a few options here.  

Option #1:

Don’t allow updates to FrameworkInfo.role. Send a FrameworkError message if a framework tries to update its ‘role’. The only tricky part is that this cannot be reliably enforced immediately after a master failover, without framework persistence, because master cannot reliably know that a framework is re-registering with an updated FrameworkInfo.role. Nonetheless, this should probably be the strategy for the first cut. 

Option #2:

Same as above but implement framework persistence to reliably enforce it.

Option #3:

Make Allocator::frameworkUpdated() return a Future (instead of void) which returns a failure when there are not enough resources in the new ‘role’ to satisfy the resources required for running tasks/executors. 

Option #4:

Allow framework to register with multiple roles. This needs a change to the FrameworkInfo protobuf, but this is something we want to do anyway. Once we add this support, we can always support frameworks adding roles but not removing roles.

Option #5:

Same as Option #4 but also allow removing roles.   

 

FrameworkInfo.capabilities:

 

This should be pretty straightforward if the framework adds new capabilities (e.g., receive revocable resources). On re-registration, Allocator::updateFramework() will be called with the new framework info, which will cause the (hierarchical) allocator to update its internal 'frameworks' map. All subsequent allocations to the framework will include any revocable resources.

It is a bit tricky if the framework removes an existing capability (e.g, revocable resources). As with adding new capabilities, once the allocator's 'frameworks' map is updated, the framework will no longer receive offers for any new revocable resources. There should be no outstanding offers with revocable resources because a framework is only expected to change framework info after a failover. The easiest thing to do would be to keep the already allocated revocable resources for the framework untouched in the allocator, much like we do with checkpointing. Once the tasks/executors terminate those resources will be unallocated from the framework and given to some other framework. 

Deployment:

The deployment of this feature should be straightforward. Master and slaves could be upgraded in any order. Frameworks should be able to update their FrameworkInfo once both masters and slaves are upgraded.


  • No labels