High availability of virtual machines (VM) is a critical requirement for enterprises for running their key workloads. OpenStack-based cloud environments are geared towards horizontally scaled, resilient, cloud-native applications. These applications are built with an assumption that application instances may go down, and therefore have additional logic to deal with such failures. But for important legacy applications that require resilience, there is insufficient platform support for high availability in OpenStack. Based on our interactions with customers, it has become clear that this drawback has hindered the adoption of OpenStack-based clouds in the enterprise.
In line with ZeroStack’s vision of creating a self-managed cloud which removes the operational burden of operating clouds, the ZeroStack Cloud Platform already supports high availability of the Cloud control plane, as discussed in an earlier blog post . The latest release of the ZeroStack Cloud Platform now extends this support to the high availability of virtual machines as well.
High Availability can be configured on a per-VM basis via the UI. The option is only supported for VMs whose storage volumes are all replicated or reside on external storage. Replicated or external storage is required for it to remain accessible once the VM is failed over to another host in case of the current host’s failure.
The following screenshot shows a set of VMs in a cluster, along with their high-availability settings.
The event timeline of a VM, also shows the actions taken to restart the VM after any errors.
The VM High Availability feature builds upon the control plane high-availability support in the ZeroStack Cloud Platform, outlined in an earlier blog post; as well as OpenStack support for node failure detection and VM evacuation.
A service called Starlife is responsible for monitoring VMs for health and restarting any crashed VMs; or evacuating VMs from any crashed or partitioned nodes. For VM remediation to work reliably, the Starlife service should meet two requirements.
- An instance of Starlife should always be running in the cluster.
- No more than one instance of Starlife should be running in the cluster, to avoid data inconsistencies in the VM, due to conflicting actions by multiple Starlife instances, like starting the same VM on 2 different nodes.
Both these requirements are guaranteed by the Service Monitor, also outlined in the earlier blog post. Service Monitor ensures that the Starlife service is running correctly. Since the Starlife service can get migrated to another node, on a node failure; it is associated with a virtual IP that migrates with it.
The Starlife service issues OpenStack Nova API calls to determine the health of the VMs and nodes in the cluster. It carries out additional health checks as well. On a network partition, the HA VMs running on the node that is partitioned away need to be shut-down before Starlife evacuates the VMs to a new node. This is to avoid data inconsistencies due to 2 instances of any VM being attached to the same storage volumes. Agents running on each node are responsible for shutting down HA VMs on detecting a network partition. To avoid premature shutting down of HA VMs; and needless evacuations of VMs from a partitioned node; Starlife and the agents run multiple node partition checks, and take remediation measures only when all the checks indicate a partitioned node. These checks are also a combination of OpenStack Nova API calls and custom health checks. On detecting a failure, a VM is restarted or evacuated from a failed node, again via existing standard OpenStack Nova API calls. Instead of working around OpenStack, we have built a complimentary layer on top of it to take care of host and VM failure detection and remediation.
High availability of virtual machines (VM) is a critical requirement for enterprises for running their key workloads. We are pleased to bring this feature to our customers, who can now run their critical workloads on the ZeroStack Cloud Platform.