HA Gap: What can happen when cluster passive nodes have no/partial access to cluster storage volumes
By Yaniv Valik,
SR DR Specialist, DR Assurance Team
One of the most common gaps in clustered environments relates to passive nodes getting out of sync with the currently active node.
A typical scenario occurs when new storage volumes are added to production servers. From time to time, the new storage volumes are only mapped to the currently active node (usually when IT teams are overloaded). Since this configuration error has no effect until cluster failover, this issue goes unnoticed. Then, when a failover does happen, the new storage volumes are not available on the new active node (formerly the passive node). Data cannot be mounted. The administrator then has to identify the missing devices and map them to the new active node. This usually involves downtime…which is exactly what you expect cluster to eliminate.
Scheduled and controlled switchovers are often used to overcome this issue. However, even when switchovers are used on a regular basis, an unexpected error like the one described here still results in downtime. Automated configuration monitoring technology (like RecoverGuard ) can reduce the number of these critical errors to zero. For instance, RecoverGuard will open a ticket and alert you when the passive nodes of a cluster do not have access to the same SAN storage volumes accessed by the currently active node, so you can fix it before a failover event occurs.