Overview
To support long-running applications, REEF provides Driver high availability (HA) such that the crash of the Driver process does not fail the entire application. This function is currently supported only on the REEF YARN runtime. The REEF HA feature on YARN primarily relies on YARN’s Resource Manager High Availability (RMHA) feature. When RMHA is enabled, applications can preserve containers across application retries such that the failure of a YARN AM only results in a resubmission of an AM by the RM. The outstanding containers from the previous submission are preserved, but it is up to the application to re-associate its previous containers with the new AM submission. REEF’s HA feature prevents users from having to worry about such details, so users will only need to focus on their core application logic. Note that HA can only be supported if the runtime supports preservation of worker processes on Driver failure and allows a new instance of the Driver to be associated with worker processes requested by the previous Driver.
Handlers, Client Configurations, and Control flow
Aside from the regular event handlers, REEF HA provides a few special event handlers exposed to the client via DriverRestartConfiguration
to handle the event of a Driver resubmission:
DriverRestartHandler
: Invoked in place of theDriverStartedHandler
. Allows the user to know that there are potentially Evaluators to recover from the previous application submission. The IDs of the expected Evaluators can be retrieved from a getter in theDriverRestarted
object (getExpectedEvaluatorIds()
) in Java. If an Evaluator has failed in the process of Driver resubmission, theDriverRestartEvaluatorFailedHandler
will be invoked.DriverRestartRunningTaskHandler
: Once an Evaluator from a previous application submission has been recovered, the Driver will inspect whether the Evaluator has a running Task. If it does,DriverRestartTaskRunningHandler
is invoked.DriverRestartActiveContextHandler
: If the recovered Evaluator as mentioned in 2 does not have a running Task,DriverRestartContextHandler
will be invoked to inform the user that a new Task can be submitted to the Evaluator.DriverRestartCompletedHandler
: Invoked to notify the user that the restart has completed, when either after all expected Evaluators have reported back or a client configurable timeout has passed. Aboolean
getter (isTimedOut()
in Java) indicates whether the handler was invoked due to a timeout. For all Evaluators from previous submissions that have not yet reported back to the Driver after the timeout has passed, theDriverRestartFailedEvaluatorHandler
will be invoked.DriverRestartFailedEvaluatorHandler
: Invoked to notify the user that an Evaluator has failed in the restart process. The Evaluator is either an Evaluator that has failed on a timeout forDriverRestartCompleted
or an Evaluator that has failed before the resubmitted Driver has registered with the Resource Manager. The user can make the distinction with the set of expected Evaluator IDs provided inDriverRestartHandler
.
Runtime Configuration
A supported RuntimeRestartConfiguration
must be merged with DriverConfiguration
in addition to DriverRestartConfiguration
. This is because elements used in restart are runtime dependent. As of the time of writing, only YarnRuntimeRestartConfiguration
is under development.
Driver HA Design
The Driver keeps track of the Evaluator recovery process via a finite state machine. The recovery process always begins in the NOT_RESTARTED
state. The Driver starts the recovery process by determining whether it is a resubmitted instance through information provided by the runtime. In YARN, this is observed by parsing the container ID of the AM container in which the Driver is running. If the Driver is a resubmitted instance, the process enters the BEGAN
state. Previous Evaluators and their statuses are then queried from the RM. The user is notified if any Evaluator has failed during the restart process, and the process enters the IN_PROGRESS
state as the Driver waits for evaluators to report back.
In anticipation of Evaluators reporting back, the Driver also keeps a state for all its expected Evaluators. An expected Evaluator starts out in the EXPECTED
state, as viewed from the Driver. When the Driver receives a recovery heartbeat from said Evaluator, the Evaluator moves to the REPORTED
state. The Evaluator is subsequently added to the set of managed Evaluators in the Driver, and the Evaluator moves to the REREGISTERED
state. The handlers for Evaluator context and task, as described in the section "Handlers, ClientConfigurations, and Control Flow," are then invoked, and the Evaluator finally ends up in the PROCESSED
state. The Driver only transitions its state to the final COMPLETED
state if it expects no more Evaluators to report back or if a configurable recovery timeout has passed. Once the recovery timeout has been reached, all Evaluators that are still in the EXPECTED
state are marked as EXPIRED
, and will be ignored and shut down if it reports back to the Driver afterwards. The reports of such expired Evaluators are entirely transparent to the user.
In order for the Driver to inform the client of which Evaluator failed on restart, it would need to keep external state on the Evaluators. The Driver performs this bookkeeping by the EvaluatorPreserver interface, which records the Evaluator ID of the Evaluator allocated by the RM to the Driver. The EvaluatorPreserver would also remove the Evaluator ID when an evaluator is released. The current implementation utilizes the cluster distributed file system (DFS) to perform Evaluator ID tracking.
Evaluator HA Design
Evaluators periodically send heartbeats back to the Driver. In the event of a Driver failure, such heartbeats will fail. After passing a threshold of heartbeat failures, the Evaluator will enter a recovery mode where it assumes that the Driver has failed. Under the recovery mode, the Evaluator will try to perform an HTTP call routed by the YARN RM to a well-known endpoint set up by its associated Driver. This endpoint would provide the remote identification of the running Driver instance. In the YARN runtime, this endpoint can be derived from the RM host port and the application ID of the YARN application. Once the Driver has successfully been restarted, the endpoint will become available, and the Evaluators will be able to recover the remote identification of the new Driver instance and reestablish its heartbeats. As an alternative to the Driver setting up a special endpoint for HA, REEF is also looking into utilizing the YARN Service Registry to recover the remote identification of the Driver.