Gap Analysis #1: Replication Inconsistencies

IT Resilience & Downtime Prevention Blog

Gap Analysis #1: Replication Inconsistencies

by Mark Stensen on February 21, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Doron’s recent post about the different types of risks that occur when configuration gaps are created got me thinking that you might be interested in more details about individual gaps. In this and subsequent posts, I’ll give you a quick description of a gap we discover in many companies, explain why it occurs, and what it means to the business.

I’ll start with Replication Inconsistencies (Different RDF Groups)

What’s the risk?  Data loss and increased time to recover.

How does it happen? This is a common gap found in large EMC SRDF/S and SRDF/A environments where multiple RDF groups are needed. It occurs most often when storage volumes from different RDF groups are provisioned to the host and used by the same database. The provisioning tools do not alert or prevent this configuration. Each RDF group is associated with different replication adapters and potentially different network infrastructures. Rolling disaster scenarios can result in corrupted replicas at the disaster recovery site.

What’s the impact? A rolling disaster scenario is characterized by the gradual failure of hardware and network, as opposed to abrupt and immediate cessation. Most real-life disasters are rolling (for example, fire, flood, virus attacks, computer crime, etc.). In a rolling disaster, network components will not fail at exactly the same time, resulting in one RDF group being out of sync with the other RDF group. This will irreversibly corrupt, the database at the disaster recovery site. Data will need to be restored from a recent backup, increasing both the RTO and the RPO.

Why does a DR test miss this? When a company conducts an orderly shutdown of applications, databases and hosts, it leaves data in a consistent state. Gradual/rolling disasters that bring systems or network elements down one by one are extremely difficult to emulate in a DR test.

Note: Many companies actually experience this problem but incorrectly assume it is the result of some network abnormality. However, unless the issue is properly diagnosed and corrected, it will reoccur.

If this is of interest to you, you check out some other typical gaps on our website:

Mark Stensen
Mark Stensen
Resilience Specialist at Continuity Software

Comments are closed.