Purpose

Cloudstack(CS) relies on custom high-availability(HA) logic for user VMs running on Xenserver(XS). The reason for doing it like this may be due the fact that native HA capabilities in XS was not mature enough during the initial days. Also in the custom HA logic, CS has to correctly determine the state of a VM from the hypervisor(HV) before it can take any action. In case there are any issues in determining the state, HA mechanism can get impacted. Since HV best knows the state of the VM it is a better approach to rely on native HA capabilities.

Suggested changes

  1. Leverage native HA capabilities for user VM HA for XS 6.2 and above. For sake of backward compatibility the existing option will also be available and a choice will be given to use any one. Once a specific option is chosen and VMs deployed, it shouldn't be changed as it can have undesired consequences. This essentially means there is no mixed mode support where in a cluster has some VMs using native HA and some using CS custom HA. For older versions of XS (prior to 6.2) there is no change. HA for system VMs would still be based on application logic.
  2. The hack for rebooting host (check xenheartbeat.sh script) in case of primary-storage(PS) failure will be removed as CS no longer needs to take any actions on user VMs as part of HA.
    1. This will fix the problem described in [2].

Jira tickethttps://issues.apache.org/jira/browse/CLOUDSTACK-5203

Prereqs

XS clusters needs to be configured for HA outside of CS. Without this HA for user VMs will not work. Earlier this was not required as HA was managed by CS. Refer to XS 6.2 admin guide [1] for setting up HA enabled clusters.

Impacted scenarios

  1. User VMs created with HA enabled service offering will now be HAd using native XS HA (provided that option is chosen using global setting).
    1. Note: HA enabled VM deployment would still succeed even if HA is not configured in XS cluster.
  2. VM sync. will no longer do any operations on user VMs based on state mismatch, only the database state will be updated based on the state reported by HV.

Non-impacted scenarios

  1. Host maintenance would still be based on existing logic.

Implementation

The following changes will be made:

  1. VM creation logic will be modified for XS 6.2 and above. For HA enabled VMs, the ha-restart-priority property on the VM will be set to restart. For non-HA VMs it is not be set.
  2. Global configuration to enable native XS HA capabilities. If not set then existing custom HA logic will be used even for XS 6.2 and above.

System VM HA

For system VMs there is already application logic to do HA. Some of them will be enhanced further.

  1. SSVM/CPVM: These are monitored on a regular basis. If these are found to be stopped then the monitoring application restarts them, and if not found then recreated. In case the agent running on these VMs is not responding then only the agent state is updated to 'Disconnected' in the database but nothing else is done. Even in this case a new VM needs to be started. Created a bug for this [3].
  2. VR: For virtual router HA, the recommended way is to use redundant VR (RVR) functionality.

Upgrade

Upgrading an existing XS 6.2 cluster would need the following changes:

  1. Native HA needs to configured on the cluster.
  2. HA enabled user VMs must be manually modified to set the ha-restart-priority to restart.

[1] http://support.citrix.com/servlet/KbServlet/download/34969-102-704897/reference.pdf (refer section 3.8)

[2] https://issues.apache.org/jira/browse/CLOUDSTACK-3367

[3] https://issues.apache.org/jira/browse/CLOUDSTACK-5247

  • No labels