Host HA

Bug Reference

CLOUDSTACK-9782 - Getting issue details... STATUS

Branch

Yet to start, share the PR.

Introduction

CloudStack lacks a way to reliably fence a host, the idea of the host-ha feature is to provide a general purpose HA framework and implementation specific for hypervisor that can use additional mechanism such as OOBM (ipmi based power management) to reliably investigate, recover and fencing a host. This feature can handle scenarios associated with server crash issues and reliable fencing of hosts and HA of VM.

Purpose

This is the functional specification of the Host-HA feature. This feature will add an ability to identify when host outage occurs and fence the host using IPMI interface.

References

This extends on a previous FS described here: KVM HA with IPMI Fencing

Document History

Glossary

Feature Specification

High Level Design Discussion

The CloudStack control plane requires a HA/fencing mechanism that fulfils the following goals:

Operational Simplicity: The service should be as simple to install, configure, and manage as any
other management server service. CloudStack's operational simplicity is a significant advantage that
should not be compromised for more complex use cases. I see it as our job to design the system to
make operation of these complex use cases as simple and straightforward as possible.

Leverage Existing Abstractions: The service should leverage the abstractions exposed by the control
plane. A significant amount of effort has been invested to integrate devices into CloudStack. By
composing our existing logical abstractions (e.g. volumes, power management, etc.), the service
transparently gains benefits of this work.

Integrated Resource Management: The service should feedback into other resource management
activities (e.g. allocation, scheduling, etc.) to understand the health of resources and support
advanced, contextual recovery modes.

To achieve these goals, CloudStack’s HA mechanism will be divided into the following components to
maximize reuse and compartmentalize responsibilities:

HA Resource Management Service: Manages the check/recovery lifecycle for resources in HA
enabled partitions – tracking and persisting per resource finite-state machine (FSM). This service will
be implemented as part of the CloudStack core engine.

HA Provider: Implements eligibility determination, check, and recovery operations for a resource
type (e.g. a host running the KVM hypervisor). They also define the partition and resource types on
which the provider will operate (e.g. a cluster of KVM hosts). These providers will be implemented
as part of their respective resource plugins.

This division of responsibilities will allow CloudStack’s HA mechanism to be consistently extended to any
supported resource type (e.g. KVM VMs, VRs, etc.) which can fit the check/recovery the HA Resource
Management Service’s check/recovery service.

HA Resource Management Service

The HA Resource Management Service manages the check/recovery cycle including periodic execution,
concurrency management, back pressure, persistence, and clustering operations. Administrators associate a
provider with a partition type (e.g. KVM HA Host provider to clusters) and may override the provider on a
per-partition (i.e. zone, cluster, or pod) basis. The service operates on all resources of the type supported by
the provider contained in a partition. Administrators can also enable or disable HA operations globally or on
a per-partition basis. The following assumptions and constraints will apply to the initial implementation of
the service:

Only one (1) HA provider per resource type may be specified for a partition
Nested HA providers by resource type will not be supported (e.g. a pod specifying an HA resource provider for hosts and a containing cluster specifying a HA resource provider for hosts)

The service is designed to be opt-in where by only resources with a defined provider and HA enabled will be
managed. While it is intended that all HA operations will be refactored to use the HA Resource Management
Service, it can co-exist with other HA implementations in the system without impacting their operations.

For each resource in an HA partition, the HA Resource Management Service will maintain and persist an FSM composed of the following states:

DISABLED: The resource is part of a partition where HA operations have been disabled or have been disabled for the resource.
AVAILABLE: The initial health and eligibility of the resource for HA management is currently found to be fine then it is transitioned to AVAILABLE. The resource stays in available state based on the passage of the most recent health check and it containing partition has an HA state of ACTIVE and all the eligibility conditions are met. When transitioning to this state, the number of retry attempts is reset.
INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its current state does not support HA check and/or recovery operations. If it is a single host in the cluster for KVM provider the host will become ineligible as KVM provider requires a neighbouring host to carry on its investigations. Any resource in maintenance mode is automatically transitioned to INELIGIBLE.
SUSPECT: The resource pending an activity check due to failing its most recent health check. If the maximum of recovery attempts has been exceeded, the HA state is transitioned to FENCED. Otherwise, the node will be scheduled for an activity check. When a node fails multiple activity checks/recovery attempts, the duration between re-attempts will decay to the maximum interval specified by the provider (e.g. first check after 10 seconds, second check after 20 seconds, third check after 40 seconds to a maximum interval of 250 seconds).
DEGRADED: The resource cannot be managed by the control plane but passed its most recent activity check indicating that the resource is still servicing end-user requests
CHECKING: An activity check is currently being performed on the resource. The HA provider defines the number of activity checks must be performed and number of failed activity checks required to trigger recovery. If the number of activity checks is greater than or equal to the total number of acceptable failures, the HA state of the resource is transitioned to RECOVERING causing a recovery attempt. If the total number of activity checks has been attempted and number of failure is less than the number of acceptable failures, the HA state of the resource will be transitioned to DEGRADED. If the number of activity checks is less than the total number of required and the number of failures is less than the acceptable number of failures, then the HA state of the resource is transitioned to SUSPECT – triggering another activity check.
RECOVERING: Recovery operations are in-progress to bring the resource back to a healthy state. If the recovery operation succeeds, the HA state of the resource will be transitioned to INITIALIZING. If the recovery operation fails, the HA state of the resource is transitioned to FENCED. Since Recovering is not idempotent it is further split into ‘Recovering’ and ‘Recovered’.
FENCED: The resource is not operating normally and automated attempts to recover it failed. Manual operator intervention is required to recover the resource. Since, Fenced operation is not idempotent it is further split into ‘Fencing’ and ‘Fenced’. Since it is more for internal process to manage fencing and recovery we will continue to refer as RECOVERD and FENCED in out state diagram.

The following diagram depicts the valid transitions between these states:

The HA management service maintains the FSM and determines when it should be evaluated. The HA provider determines whether or not a state transition should occur through the following operations:

isEligible: Indicates whether or not the passed resource is eligible for HA operations with the provider
isHealthy: Performs an idempotent check of the passed resource to verify its connectivity and the proper operation its API endpoint(s).
hasActivity: Performs an idempotent check on the passed resource that observes the side effects of a resource’s normal operation to determine whether or not it is functioning but unable to communicate through its API endpoint(s).
recover: Takes actions to change the state of the passed resource to bring it back to a healthy state. This uses IPMI power recycle (reset) command in order to bring back a crashed host. In case the host was deliberately powered off, this is designed to fail ie the cycle command will not be able to bring back the host. We can use ' echo c > /proc/sysrq-trigger` to emulate a crash on a KVM/linux host.
fence: Takes actions necessary to isolate an unrecoverable resource from other shared resources and avoid the spread of a failure. To fence a host a IPMI power off is issues and that is supposed to shut down a host if it is on.

When HA is enabled for a partition, the HA state of all contained resources will be transitioned from DISABLED to AVAILABLE. Based on the state models, the following failure scenarios and their responses will be handled by the HA resource management service:

Activity check operation fails on the resource: Provide a semantic in the activity check protocol to express that an error while performing the activity check and a reason for the failure (e.g. unable to access the NFS mount). If the maximum number of activity check attempts has not been exceeded, the activity check will be retried.
Slow activity check operation: After a configurable timeout, the HA resource management service abandons the check. The response to this condition would be the same as a failure to recover the resource.
Traffic flood due to a large number of resource recoveries: The HA resource management service must limit the number of concurrent recovery operations permitted to avoid overwhelming the management server with resource status updates as recovery operations complete.
Processor/memory starvation due to large number of activity check operations: The HA resource management service must limit the number of concurrent activity check operations permitted per management server to prevent checks from starving other management server activities of scarce processor and/or memory resources.
A SUSPECT, CHECKING, or RECOVERING resource passes a health check before the state action completes: The HA resource management service refreshes the HA state of the resource before transition. If it does not match the expected current state, the result of state action is ignored.

To address some of these scenarios, HA providers also specify the following parameters to bound HA service operations:

Maximum Recovery Attempts: The maximum number of attempts to recover a resource
Health Check Timeout: The timeout, in seconds, for a health check operation to complete
Activity Check Timeout: The timeout, in seconds, for an activity check operation to complete
Minimum Recovery Time: The time, in seconds, required for a recovery operation to succeed
Minimum Fence Time: The time, in seconds, required for a fence operation to succeed
Degraded Recheck Interval: The interval, in seconds, that an activity check should be re-executed. When this interval expires, the state of the resource will be transitioned to SUSPECT triggering re-execution of the activity check.
Activity Check Interval: The interval, in seconds, between activity checks when determining whether or not to recover a resource.
Ratio of Failed Activity Checks: The number of failed activity checks to total checks required to attempt to recover a resource.
Recovery Timeout: The timeout, in seconds, for recovery of a resource
Fence Timeout: The timeout, in seconds, for a fence operation to complete

The HA Resource Management Service will expose the following admin APIs to configure host HA operations for a resources within a partition:

listHostHAConfigurations: List the HA configuration information for all or a specific partition
configureHAForHost: Configures HA for a partition including the association of an HA provider to a resource type. By default, HA operations are disabled until they are explicitly enabled.
enableHAForHost: Enables HA operations for a partition
disableHAForHost: Disables HA operations for a partition

In order to support the definition of roles that control HA for different resource types, the HA resource management service defines per resource type APIs for configuration, enable and disable operations.

The following table depicts how these provider operations are employed by the HA resource management service to trigger state transitions in an HA state machine:

Any state will transition to ‘DISABLED’ if the HA resource is disabled. Any state will transition to ‘INELIGIBLE’ if the HA resource is found to be ineligible on a check. For sake of simplicity these state transitions are not mentioned in the following table.

Start State	Operations	Target State	Events/Alerts
AVAILABLE	isHealthy() == true && **isEligible() == true	AVAILABLE	N/A
AVAILABLE	isHealthy() == false && isEligible() == true	SUSPECT	(E) Resource <id> failed its health check.
INELIGIBLE	isEligible() == true	AVAILABLE	(E) Resource <id> has become eligible for HA operations.
CHECKING	hasActivity() == false && isEligible() == true Check Failures > Ratio of Failed Activity Checks	RECOVERING	(A) Resource <id> failed its activity check. Attempting to recover.
CHECKING	isEligible() == true Check Failures < Ratio of Failed Activity Checks	SUSPECT	N/A
CHECKING	hasActivity() == true && isEligible() == true	DEGRADED	(E) Resource <id> passed its activity check but is considered degraded.
DEGRADED	isHealthy() == true && isEligible() == true	AVAILABLE	(E) Degraded resource <id> passed a health check and is now available.
DEGRADED	isHealthy() == false && isEligible() == true	DEGRADED	N/A
DEGRADED	Recheck interval expired && isEligible() == true	SUSPECT	(E) Re-executing activity check for degraded resource <id>.
FENCED	isHealthy() == false && isDisabled() == false	FENCED	N/A
SUSPECT	isHealthy() == true	AVAILABLE	(E) Reinitializing HA operations for suspect resource <id> passed a health check.
SUSPECT	Maximum recovery attempts exceeded && isEligible() == true	FENCED	(A) Fenced resource <id> due to maximum number of recovery attempts exceeded.
RECOVERED	Recovery operation succeeded.	AVAILABLE	(E) Recovery of resource <id> completed.
RECOVERING	Recovery operation failed && isEligibile() == true && isMaintenanceMode() == false && isDisabled() == false	FENCED	(A) Fenced resource <id> due to failure of recovery operation.

**The eligibility check includes all the following:

The resource’s containing partition(s) are enabled
The resource’s containing partition(s) are not in maintenance mode
The resource is not in maintenance mode
The resource is in a state that HA operation can be carried out, like having a neighbour for Host HA.

Before invoking the HA provider’s fence operation, the HA resource management will place the resource in maintenance mode. The intention is to require an administrator to manually verify that a resource is ready to return service by requiring an administrator to take it out of maintenance mode.

To detect out-of-band correction of resource issues, health checks will be performed on all resources in AVAILABLE, SUSPECT, DEGRADED, and FENCED.

In order to prevent churn due to unrecoverable resources, the HA service tracks the number of recovery attempts per resource. Either a recovery operation will fail due to an error or timeout or the number of transitions to the SUSPECT state will exceed the maximum number of recovery attempts. In both scenarios, the resource will be transitioned to the FENCED state and the fence operation will be invoked on the resource’s associated HA provider. When a resource transitions to AVAILABLE, the number of recovery attempts is reset to zero.

Internally, the HA resource management service will maintain the following queues and associated thread pools to schedule operations to transition between HA states:

INELIGIBLE: A bounded, ephemeral queue of resources that require eligibility checks that have an HA state of INELIGIBLE.
HEALTH CHECK: A bounded, ephemeral queue of resources pending health checks that have an HA state of INITIALIZATING, AVAILABLE, SUSPECT, DEGRADED, or FENCED.
ACTIVITY CHECK: A bounded, ephemeral queue of resources pending activity checks that have an HA state of CHECKING.
RECOVERY: A persistent queue of resources pending recovery that have an HA state of RECOVERING.
FENCE: A persistent queue of resource pending recovery that have an HA state of SUSPECT or RECOVERING.

In order to provide back pressure and ensure that HA operations do not overwhelm other management server functions, the following global settings will be introduced:

Global Setting	Description	Default Value
ha.max.pending.health.check.operations	The number of pending initialization operations per management server. This settings determines the size of the INITIALIZATION queue.	5000
ha.max.concurrent.health.check.operations	The number of concurrent health check operations per management server. This setting determines the size of the thread pool consuming the HEALTH CHECK queue.	50
ha.max.concurrent.activity.check.operations	The number of concurrent activity check operations per management server. This setting determines the size of the thread pool consuming the ACTIVITY CHECK queue.	25
ha.max.pending.activity.check.operations	The number of pending activity check operations per management server. This setting determines the size of the size of the ACTIVITY CHECK queue.	2500
ha.max.concurrent.fence.operations	The number of concurrent fence operations per management server. This setting determines the size of the thread pool consuming the FENCE queue.	25
ha.max.pending.fence.operations	The number of pending fence operations per management server. This setting determines the size of the size of the FENCE queue.	2500
ha.max.concurrent.recovery.operations	The number of concurrent recovery operations per management server.	25
ha.max.pending.recovery.operations	The number of pending recovery operations per management server. This setting determines the size of the size of the RECOVERY queue.	2500

Unlike eligibility, health check, and activity check operations, recovery and fence operations will cause effects for example for a host reboot it or power it off. Therefore, these operations are further divided into two internally. A start of fence is ‘Fencing’ and when fence is complete the resource is said to be ‘Fenced’, similarly we have two states for recovery operation, at start of recovery the host is in ‘Recovering’ and when recovery is complete it is put into ‘Recovered’.

On management server startup, all resources in INELIBIGLE, AVAILABLE, SUSPECT, and CHECKING states will be re-checked for eligibility in order to re-evaluate their current HA state. The persistent tracking provided by the HA Management framework will allow a management server to resume recovery and fence operations. When a management server fails causing ownership of hosts to be handed off to another management server in the cluster, the same eligibility checks will be performed for the handed off hosts.

KVM Host HA Provider

The KVM plugin will be extended to include an HA provider that checks and recovers KVM hosts using a Shoot the Other Node in the Head (STONITH) fencing model. This HA provider will operate across a KVM cluster -- restarting hosts when there has been no disk activity since a host’s state transitioned to DOWN. Therefore, a health check for the KVM HA provider is checking the host state. When the state of a host is DOWN, this provider will report a failed health check.

A host must meet the following criteria to be deemed eligible for HA operations by the KVM HA host provider:

The host must be a member of a cluster using the KVM hypervisor
The host must have a power management status of ON or OFF
The version of the KVM agent deployed on the host must support performing activity checks
At least one volume attached to the VM(s) must support the activity check capability
There should be at least one other host in the cluster.

To address scenarios where a change in a host's power state affects its HA eligibility, the plugin would respond to the following host power state transitions:

OFF->ON: The host's HA state is transitioned to AVAILABLE when the other eligibility criteria is also met.
ON->OFF, UNKNOWN->OFF, OFF->UNKNOWN: The host's HA state is transitioned to INELIGIBLE because the host is not running
ON->UNKNOWN: All states except DEGRADED and FENCED are transitioned to INELIGIBLE because the control plane cannot communicate with the system management interface. The DEGRADED and FENCED states are maintained because an inability to access the system management interface does not affect the ability of the affect these states.
UNKNOWN->ON: All states except DEGRADED and FENCED are transitioned to INITIALIZING to reassess the HA eligibility and recalculate the host's HA state. The DEGRADED and FENCED states are maintained because a repair to a partition in the system management network does not indicate that the host's operation has been repaired.

In order to determine disk activity, the HA provider will query the storage subsystem to whether or not the files underlying any of the volumes attached to any of the VMs on the host have been recently accessed. In the event a host performing an activity check encounters an error during the check operation (e.g. unable to read the NFS mount), the HA state of the checking host will be transitioned to SUSPECT. In order to avoid time sync issues between the management server, host, and NFS server, the activity check will use a relative check that watches a file for a timestamp change within the specified number of attempts as configured by kvm.ha.max_activity_check_interval and number of activity checks. The first activity check is absolute and uses the suspect time and current time on host to check if there was any vm disk activity. The subsequent activity checks will look for any disk activity between the previous activity check time and current activity check time.

To support this functionality, the StorageProcessor will be modified to support checking the activity of a list of volumes and expose a capability for this function. By extending the storage subsystem to perform this check, support for activity checking different types of storage can be implemented without impacting the KVM Host HA provider. For the initial implementation, support for activity checks will only be added for NFS. The KVM StorageProcessor will delegate to the KVM agent of adjacent host whose state is UP to perform the check locally. To avoid live-locking of activity check threads in the HA resource management service, the KVM HA provider will execute activity checks using a request-reply model and a reply timeout. This approach will allow the management server to abandon activity check operations that do not return from the KVM agent.

If a host fails its activity check, the KVM HA provider will attempt to recover it by restarting it by issuing the power management operation specified in kvm.ha.recovery.restart_operation to the host. The HA resource management service will wait until the time specified in kvm.ha.recovery.wait_time has been exceeded to check that the host successfully recovered. It will not wait past the kvm.ha.recovery.timeout for the recovery operation to successfully complete.

When the HA state of a host transitions to the FENCED state, the KVM HA provider will power off the host using the power management operation specified in the kvm.ha.fence.stop_operation. This action is taken to ensure that any misbehaviour by the host (e.g. unnecessary NFS activity) is stopped until an operator has the opportunity to manually fix the host.

The KVM HA provider will provide the following global settings that can be overridden on a per-cluster basis to control the behaviour of the activity check, recovery, and fencing operations:

Global Setting	Description	Default Value
Kvm.ha.activity.check.failure.ratio	The activity check failure threshold ratio. This is used with the activity check maximum attempts for deciding to recover or degrade a resource. For most environments, please keep this value above 0.5.	0.7
kvm.ha.activity.check.max.attempts	The maximum number of activity check attempts to perform before deciding to recover or degrade a resource.	10
kvm.ha.activity.check.interval	The maximum interval of time, in seconds, between activity checks	60
kvm.ha.activity.check.timeout	The maximum length of time, in seconds, expected for an activity check to complete	60
kvm.ha.health.check.timeout	The maximum length of time, in seconds, expected for an health check to complete.	10
kvm.ha.degraded.max.period	The maximum length of time, in seconds, a resource can be in degraded state where only health checks are performed.	300
kvm.ha.recover.wait.period	The maximum length of time, in seconds, to wait for a resource to recover.	600
kvm.ha.recovery.timeout	The maximum time, in seconds, expected for a host to recover	60
kvm.ha.recover.failure.threshold	The maximum recovery attempts to be made for a resource, after which the resource is fenced. The recovery counter resets when a health check passes for a resource.	1
kvm.ha.recovery.restart_operation	The out-of-band power management operation to execute in order to restart host during recovery	CYCLE
kvm.ha.fence.stop_operation	The out-of-band power management operation to execute in order to stop a fenced host	OFF
kvm.ha.fence.timeout	The maximum time, in seconds, expected for fencing operations to complete	60

Cluster override values will be stored in the cluster_details table.

HOST-HA and VM-HA coordination

For KVM HOST HA to work effectively it has to work in tandem with the existing VM HA framework. The current CloudStack implementation focuses on VM-HA as these are the first class entities, while a host is considered to be a resource. The CloudStack manages host states and a rough mapping of CloudStack states Vs the KVM Host HA state is a s below:

VM-HA host States	KVM Host HA host states
Up	Available
Up (Investigating)	Suspect/Checking
Alert	Degraded
Disconnected	Recovering/Recovered/Fencing
Down	Fenced
--	Ineligible/Disabled

The Host HA improves on Investigation by providing a new way of investigating VM using VM disk activity. It also adds on to the fencing capabilities by integrating with OOBM feature.

In order for VM HA to work correctly and in sync with Host HA it is important that the state of host seen by the two is same as per the above table. To ensure that the present VM-HA model is modified to look for host state from Host HA states, instead of directly reaching out to the host in case where Host HA is enabled for the host and the host is eligible The mapping of states also ensures that the VM-HA is not kicked until the host is fenced by Host HA process.

Supported Hypervisor(s). Which hypervisor should the new feature work with?

For the initial release, only KVM with NFS storage will be supported. However, the storage check component will be implemented in a modular fashion allowing for checks using other storage platforms(e.g. Ceph) in the future. HA provider plugins can be implemented for other hypervisors.

Should users be able to access the new functionality via the UI?

While the HA service is designed to support any resource and partition type, it will only be exposed for clusters and hosts in UI for this release. The Host Details and Host Metrics views will be modified to include the HA state. The Cluster Details view will be modified to display whether or not HA is enabled. Finally, an HA tab will be added to cluster configuration to specify per-cluster HA configuration information.

Space shortcuts

Child pages