Managing an impeccable virtual environment is a tough mission. Whether you are using vCenter, external tools or both, managing cross-vendor configurations, as well as the dynamic nature of the virtual environment make this task almost impossible – especially for large diversified datacenters. The continuous stream of changes in your virtualization and other layers such as storage, put VMware professionals in a constant chase of configuration and best-practice compliance. Falling short in even one of those many configurations, can introduce single-points-of-failure that can put the entire system’s availability, fault tolerance, replication and data at serious risk. Let’s take a look at three common storage configuration issues within your VMware environment that can jeopardize your service and data availability.
The first scenario is probably something we have all experienced at some point. As our VMware environment grows, we fail to consistently map all storage objects to all the nodes in our cluster. This could be a datastore, LUN or RDM that were added to the cluster but unintentionally was either not mapped to all the nodes or not configured correctly across all nodes. This is a pretty straightforward mistake – so how come it still happens? Well, this could be the result of a confusion between your VMware team and the storage team, or it can be a typo in the zoning or masking values in the SAN fabric which was not noticed until an HA failure occurs since vMotion will not be able to correctly balance the load.
Another common issue can happen when we have a VMware ESX host connected to multiple LUNs using the ESX native MPIO (multi-pathing I/O). VMware ESX storage best practice says that all paths to the same LUN must be configured with the same ID. So configuring one or more paths with a different LUN IDs puts our data at risk.
Usually, the complexity of maintaining path redundancy across multiple layers is the main reason that leads to those type of coordination issues that may result in various single-point-of-failure. Eventually, this inconsistency can impact replication and fail-over to a point where the data on all dependant VMs get corrupt.
Here is a problem that might seem harmless to some but can actually cause some serious damage. A host that is not part of the vSphere environment – let’s say a HP-UX host – gains access to the vSphere-managed LUN disks. This is obviously not something that anyone planned, but it can still happen as a result of masking or zoning mistake that was manually configured, scripted or an HBA upgrade. The impact of such an error is significant as accessing the vSphere LUN from the HP-UX host can cause hundreds of VMs to crash or lose data.
These issues above are only a couple of examples. There are actually many more potential configurations that could go wrong without us even knowing until it is too late. So what can we do to eliminate those issues and prevent critical failures from happening? Well, what we suggest is utilizing the power of big data IT analytics in order to proactively detect those misconfigurations before they become issues that can hurt your infrastructure and business.
We believe that prevention is always a better and less expensive solution than cure. Thus, validating your infrastructure resilience on a daily basis is another huge step towards operational excellence.