VMware vSphere HA offers a robust set of capabilities for ensuring continuous uptime, even when one of the host servers fails. However, high availability, fault tolerant and vMotion depend on the correct configuration of its host options, hardware, storage, and virtual networking.
What Can Go Wrong?
As part of best practice, port groups must be configured for every host server in the cluster that is required to support VMs that depend on the port group. Virtual switches and subnetwork port group names are case and space sensitive. Configuring hundreds of settings with zero errors can be extremely challenging and human error is essentially unavoidable: especially for IT teams with stretched resources and assignment overload.
When a host fails, vSphere HA will assign the VMs running on it to other hosts in the cluster. If a port group associated with a VM has not been configured correctly (or even at all) on a target host, once the failover is complete the VM will not be able to communicate over the network. In other words, even though vSphere will consider the failover to be successful, the VM will not function. IT teams may never know about it until it’s too late.
Here are several examples of how this can occur:
- A simple typo
- Certain port groups are intended to be configured on some hosts in the cluster, but not all (common when affinity and anti-affinity rules are used). An error can occur when a port group is configured on host ‘B’ instead of host ‘A’ (Perhaps as the result of a miscommunication between team members, incorrect documentation, etc.).
- The automation scripts contain a bug.
Ultimately, it is the end-users that will be affected by the lack of communication between applications running on the virtual cluster. The risk level and possible damage depend on how critical the applications are to the organization and its users. Examples include:
- End-users that cannot reach websites or other online resources.
- Traders that cannot access real-time information or execute buy & sell orders.
- Customers that cannot access online accounts, online information, or other services.
AvailabilityGuard™ offers automated detection and analysis of performance and availability risks across your entire private cloud infrastructure. For example, AvailabilityGuard detects missing or incorrect port group configurations and alerts the IT team, enabling them to fix the issue before it impacts end-users.