NOTE: While we reference IPMI in this design doc, it could be any utility or shell script that can take several workflow actions and/or run external commands/apis.
Detailed description
This is an enhancement on how CloudStack deals with KVM Agent uncertainties. As of now, there is no clear and automated fail over process due to CloudStack’s inability to fence a hypervisor. Here is a proposed scenario on how CloudStack would handle Agent/Link/Server Crash issues.
This feature would resolve the issue with crashing hypervisors – since this case has not been covered yet.
1) CloudStack notices that KVM host A no longer responds
2) CloudStack asks KVM neighbour host B to check on KVM host A
* In either case of KVM host B responding on state of KVM host A (or not) – yields no action – because we aren’t certain what happened, did the agent die and we get no NFS Heartbeat response? Or did server crash? With this feature request, we are trying to address the server crash instance only.
3) Logic to figure out what really happened to KVM host A:
i. If “rw” time stamp of any file is newer than the “disconnect” time stamp,– take no action, continue to monitor – until a failure occurs or host state changes to “Running”
ii. If looping through entire list of files yielded that rw is older than KVM host disconnect time stamp, allow for multiple (x) additional before taking next action –> x configurable value, if response came on the next check with update rw timestamp – take no action – continue to monitor, if no rw update came through, proceed with this logic
i. IPMI Username
ii. IPMI Password
iii. IPMI Hostname/IP
iv. IPMI Exec {user configurable string}
v. IPMI Action Stop {user configurable string}
vi. IPMI Action Start {user configurable string}
vii. IPMI Action Reboot {user configurable string}
viii. IPMI Action Blink {user configurable string}
ix. IPMI Action Test {user configurable string} -> simple operation to test IPMI interface
x. IPMI Execution Syntax “$IPMI_EXEC --username $IPMI_Username --password $IPMI_password --command $IPMI_ACTION --host $IPMI_HOST”
The reason for allowing a host level override for IPMI settings is due to the fact that a single cluster would contain a mix of hardware that may not conform to specific cluster level setting. In addition, the username and passwords may differ in some cases, so ability to override on the host level would be needed.
Detailed use-cases
Add an ability to identify when host outage occurs and fence the host using IPMI interface.
Recommended or proposed technical design & architecture
TBD
Supported Hypervisor(s). Which hypervisor should the new feature work with?
KVM with NFS primarily, however the storage check component should be modular, so in the future, Ceph or another integration can be written
Integration into CloudStack UI required?
Yes
Should users be able to access the new functionality via the UI?
Yes
Test Automation. Does the testing need to be automated and repeatable?
Yes