Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

To support long-running applications, REEF provides Driver high availability (HA) such that the crash of the Driver process does not fail the entire application. This function is currently supported only on the REEF YARN runtime. The REEF HA feature on YARN primarily relies on YARN’s Resource Manager High Availability (RMHA) feature. When RMHA is enabled, applications can preserve containers across application retries such that the failure of a YARN AM only results in a resubmission of an AM by the RM. The outstanding containers from the previous submission are preserved, but it is up to the application to re-associate its previous containers with the new AM submission. REEF’s HA feature prevents users from having to worry about such details, so users will only need to focus on their core application logic. Note that HA can only be supported if the runtime supports preservation of worker processes on Driver failure and allows a new instance of the Driver to be associated with worker processes requested by the previous Driver. 

Handlers, Client Configurations, and Control flow

Aside from the regular event handlers, REEF HA provides a few special event handlers exposed to the client via DriverRestartConfiguration to handle the event of a Driver resubmission:

  1. DriverRestartHandler: Invoked in place of the DriverStartedHandler. Allows the user to know that there are potentially Evaluators to recover from the previous application submission. The IDs of the expected Evaluators can be retrieved from a getter in the DriverRestarted object (getExpectedEvaluatorIds()) in Java. If an Evaluator has failed in the process of Driver resubmission, the DriverRestartEvaluatorFailedHandler will be invoked.
  2. DriverRestartRunningTaskHandler: Once an Evaluator from a previous application submission has been recovered, the Driver will inspect whether the Evaluator has a running Task. If it does, DriverRestartTaskRunningHandler is invoked.
  3. DriverRestartActiveContextHandler: If the recovered Evaluator as mentioned in 2 does not have a running Task, DriverRestartContextHandler will be invoked to inform the user that a new Task can be submitted to the Evaluator.

  4. DriverRestartCompletedHandler: Invoked to notify the user that the restart has completed, when either after all expected Evaluators have reported back or a client configurable timeout has passed. A boolean getter (isTimedOut() in Java) indicates whether the handler was invoked due to a timeout. For all Evaluators from previous submissions that have not yet reported back to the Driver after the timeout has passed, the DriverRestartFailedEvaluatorHandler will be invoked.

  5. DriverRestartFailedEvaluatorHandler: Invoked to notify the user that an Evaluator has failed in the restart process. The Evaluator is either an Evaluator that has failed on a timeout for DriverRestartCompleted or an Evaluator that has failed before the resubmitted Driver has registered with the Resource Manager. The user can make the distinction with the set of expected Evaluator IDs provided in DriverRestartHandler.

Runtime Configuration

A supported RuntimeRestartConfiguration must be merged with DriverConfiguration in addition to DriverRestartConfiguration. This is because elements used in restart are runtime dependent. As of the time of writing, only YarnRuntimeRestartConfiguration is under development.

Driver HA Design

The Driver keeps track of the Evaluator recovery process via a finite state machine. The recovery process always begins in the NOT_RESTARTED state. The Driver starts the recovery process by determining whether it is a resubmitted instance through information provided by the runtime. In YARN, this is observed by parsing the container ID of the AM container in which the Driver is running. If the Driver is a resubmitted instance, the process enters the BEGAN state. Previous Evaluators and their statuses are then queried from the RM. The user is notified if any Evaluator has failed during the restart process, and the process enters the IN_PROGRESS state as the Driver waits for evaluators to report back.

In anticipation of Evaluators reporting back, the Driver also keeps a state for all its expected Evaluators. An expected Evaluator starts out in the EXPECTED state, as viewed from the Driver. When the Driver receives a recovery heartbeat from said Evaluator, the Evaluator moves to the REPORTED state. The Evaluator is subsequently added to the set of managed Evaluators in the Driver, and the Evaluator moves to the REREGISTERED state. The handlers for Evaluator context and task, as described in 3.3.1, are the section "Handlers, ClientConfigurations, and Control Flow," are then invoked, and the Evaluator finally ends up in the PROCESSED state. The Driver only transitions its state to the final COMPLETED state if it expects no more Evaluators to report back or if a configurable recovery timeout, as mentioned in the section "Handlers, ClientConfigurations, and Control Flow," has timeout has passed. Once the recovery timeout has been reached, all Evaluators that are still in the EXPECTED state are marked as EXPIRED, and will be ignored and shut down if it reports back to the Driver afterwards. The reports of such expired Evaluators are entirely transparent to the user.

In order for the Driver to inform the client of which Evaluator failed on restart, it would need to keep external state on the Evaluators. The Driver performs this bookkeeping by the EvaluatorPreserver interface, which records the Evaluator ID of the Evaluator allocated by the RM to the Driver. The EvaluatorPreserver would also remove the Evaluator ID when an evaluator is released. The current implementation utilizes the cluster distributed file system (DFS) to perform Evaluator ID tracking. 

Evaluator HA Design

Evaluators periodically send heartbeats back to the Driver. In the event of a Driver failure, such heartbeats will fail. After passing a threshold of heartbeat failures, the Evaluator will enter a recovery mode where it assumes that the Driver has failed. Under the recovery mode, the Evaluator will try to perform an HTTP call routed by the YARN RM to a well-known endpoint set up by its associated Driver. This endpoint would provide the remote identification of the running Driver instance. In the YARN runtime, this endpoint can be derived from the RM host port and the application ID of the YARN application. Once the Driver has successfully been restarted, the endpoint will become available, and the Evaluators will be able to recover the remote identification of the new Driver instance and reestablish its heartbeats. As an alternative to the Driver setting up a special endpoint for HA, REEF is also looking into utilizing the YARN Service Registry to recover the remote identification of the Driver.