by Yaniv Valik
SR DR Specialist, DR Assurance Group
This time I’d like to share with you a recent incident from one of our new customers. Before I dive into the details, let me briefly describe the environment. The data centers rely on HDS USP storage with mixed Sun Solaris, HPUX, Linux and Windows servers. Ten percent of the environment is virtualized (VMWare ESX). On the databases side, Oracle and SQL server are in use.
This is a 24/7 environment in which downtime is a disaster. The company considers high availability and disaster recovery as mission critical. The IT staff is highly aware of change management in general and specifically how changes in production may impact high availability and readiness for disaster.
As far as replication, the following process is implemented:
– Local ShadowImages are created twice a day which are backed up to tape by Symantec NetBackup.
– Data is replicated with TrueCopy synchronously to a remote site. TC replicas are mapped to the DR servers.
– 3 point-in-time ShadowImage are taken on the remote site.
I believe every IT pro would classify this as an advanced and modern DR solution. And it is. Nevertheless, even in environments with heightened awareness of high availability and disaster recovery, gaps and configuration drifts are unavoidable. In this case, the local point-in-time ShadowImages taken for backup were not consistent! Data could not be restored from backup.
How did it happen?
As I mentioned before, this is a 24/7 environment. For this reason, database cold backup is out of the question. So they use hot backup such that the database is still online and accessible. To assure image data consistency, a very delicate process must be implemented. Any deviation from this process may result in image consistency issues…which would render the backup unusable.
In essence, the process is:
- Synchronize the replicas.
- Verify that full synchronization achieved.
- Enter hot backup mode.
- Split the replicas of data files.
- Verify that split was completed successfully.
- End hot backup.
- Switch logs.
- Split the replicas of log files… verify that split was completed successfully.
Additional steps may include creating copies of control files, enabling storage consistency solutions such as EMC / HDS Consistency Groups, etc.
Note that this process involves different silos, platforms and IT teams.
So how did it happen?
The timing of events was not fully synced.
The database entered hot backup “only” 2 minutes after replica split was already initiated. In this specific case, the issue was cause by the use of different schedulers by different teams (Control-M, crontab). Of course, there could be other various reasons (time not in sync, daylight saving time configuration differences, misunderstanding between IT teams, change performed by one team which another team was unaware of …).
Also, the customer uses Oracle ASM. This means greater risks of data inconsistency in hot backup scenario, since even without any client altering data, automatic rebalancing can be performed by Oracle while generating the point-in-time copies. (Rebalancing impacts replica data consistency similarly to performing database writes).
The first scan by RecoverGuard exposed this vulnerability. The various IT teams were completely unaware of this situation and were amazed when it was discovered. The error was immediately rectified, along with other errors and improvement opportunities detected for Hitachi Dynamic Link Manager (HDLM, used for multi-pathing), Microsoft SQL and Veritas Cluster.
The Business Continuity/DR manager further explained that their quarterly DR tests only included verifying the TrueCopy replicas, and even then they perform graceful shutdown on production, which does not simulate their routine normal backup/replication procedures nor a true disaster scenario.
Today’s data centers are just too complex to manage. Even with the best, configuration drifts and gaps are unavoidable in constantly changing data centers.