VMware ESX: data loss / downtime risks and how to avoid them
Continuing my previous virtualization posts, I’d like to take this time to describe additional examples of what-could-go-wrong-in-my-private-cloud. Here they are:
- Replication issues. For instance, a VM which is stored on an un-replicated LUN (or partially replicated set of LUNs); or maybe it is replicated but last synchronization was done months ago? replication was turned off for maintenance and never brought back online….
- SAN I/O multipath issues. In this category you may find issues such as dead I/O paths, paths configured with incorrect I/O policies, insufficient number of paths or unequal number of paths between nodes; all these and more could result in suboptimal VM/ESX operation/performance and reduced availability (in other words – more downtime). By the way, see vSphere 4.0’s release note about the use Round-Robin algorithm for I/O load balancing…
- Configuration drift between clustered ESX servers. Over time difference between the nodes may arise as it relates to Hardware, Software, patches, Network (etc.). These differences would result in different levels of stability, availability and performance depending on the node on which the VM is currently running.
- Image Consistency (aka point-in-time copies). Specific solutions and/or procedures must be applied in order to guarantee that a snapshot taken for a VM is consistent and usable. This may include different techniques of I/O freeze – such as Oracle hot backup, VM Suspension, Storage consistency groups and so one.
The award winning AvailabilityGuard by Continuity Software is a solution that can help you identify and report these configuration errors immediately as they occur; thus dramatically decrease the frequency of downtime events, reduce amount of work around DR testing and significantly improve recoverability.