by Yaniv Valik
SR DR Specialist, DR Assurance Team
We recently encountered an interesting infrastructure problem on a customer site. Since this may be relevant to others as well, I thought it would be best to share with all.
The customer is using EMC SRDF/A for disaster recovery. It had thousands of production servers, thousands of databases and applications, hundreds of replicated servers. On the 10th of each month, the company had to generate a large number of monthly customer analysis reports, which caused database I/O on tier-1 storage to reach very high levels. As a result, the corresponding RDF group could not keep up with the pace, and went offline (suspended). Since the customer’s RPO policy is 30 minutes and since it usually took 24 hours until the issue was resolved (databases to return to normal I/O rates and manual intervention to bring the RDF group back online), this was unacceptable.
We looked at the tickets that had been generated by RecoverGuard during its regular scans of the environment and found that all the temporary databases for the MS-SQL and Oracle environments were replicated. We also found several swap devices being replicated. We suggested removing replication for these temporary databases and swap devices. Although it was not easy (some of the temporary databases had to be relocated, since they shared storage with other databases), it paid off. The change reduced load from the RDF group enough that it will no longer fail.
Conclusion: Avoid replicating temporary databases and page files. You can still locate those on high-end storage for best performance, but do not create replications for the corresponding devices.