Critical Recovery Risks Most DR Test Miss

Note:This is a transcript of a webinar that was originally presented on January, 2009.

 

 

Speaker: Doron Pinhas, CTO, Continuity Software

I’ll begin by introducing Continuity Software and then I’ll be talking about the issue at hand, which is DR testing and the vulnerabilities or risks that will remain uncovered during standard DR testing. Our mission is to help organizations keep their disaster recovery and high availability at peak performance, at all times. We have been working with hundreds of enterprises trying to learn how they handle DR and the kinds of problem they experience. We have gathered quite a lot of information about what’s going on in the market and we will be sharing this with you today.

In today’s webcast we’ll be covering a several issues.

We’ll begin with a short overview of current DR practices. We’ll let you know what we have seen in the field and have a short discussion about the problems we see with current testing approaches, and what can be done about it. We’ll give an overview of the areas we think are not covered well in traditional DR testing. We’ll see some examples of problems that can occur, which I hope will be illuminating for you. Finally we’ll have a short discussion about how to better cope with the shortcomings of common DR practices and DR testing practices. 

Investment in DR is a given. 

 
 
Most of you are here because either you are already invested in DR and have active DR sites, or you’re just on the verge of doing so and are considering your options. The motivation for doing so is pretty well established now. There is the cost of downtime, regulations and many other factors that all contribute to your need to build DR and high availability solutions. The bottom line is that the risk of not having DR outweighs any other considerations, including the significant cost of building such solutions.
 
 
However, keeping DR systems working and in good order is a true challenge. Most of you will probably identify with this story: You build your DR systems for the first time and test them and after some tweaking and lots of creative work, the systems are working. The real challenge is keeping them working for years while the production systems keep on changing.
 
Our main problem here is that DR can get out of sync because production systems undergo changes all the time. The changes fall into many categories. They can be small, such as increasing a certain kernel parameter value. It can be larger such as retiring an entire cluster every several years or replacing a storage array. And, every decade or so, you may find yourself struggling with a project which is extremely large, such as migrating your data center.
 
Some changes are well planned. Contingencies are taken care of just in case something does not go smoothly and then you pick your maintenance hours. The reality is that not all changes will go smoothly. That’s why we plan for testing. We have the production systems running to show us that everything works. If worse come worse, the next day users may complain but the net of it is everything we do to production is bound to work eventually.
 
When we have a DR site or standby systems in a local cluster, we need to apply many of those changes as well. If we add some more devices to a clustered environment, we need to make sure that all the standby nodes can see that extra storage, etcetera. And the same goes for DR servers. That’s the reason why most of us have very strict change control procedures. One of the important features of those procedures is to make sure that we apply the same changes to the DR site and to the standby systems. Assuming nothing is forgotten – which is regretfully not the case because some changes always go unnoticed and don’t find their way to DR -- but assuming everything is done, we don’t have a chance to make sure that the changes have been applied correctly. Standby systems are usually down. They may be repurposed and running other applications just to make use of the investment, but they are not running production applications.
 
So, that’s the scenario. We apply changes to an environment that we can test and some of these changes are very complex. Then we apply those same changes to the standby systems, but we have no real way to make sure it actually works there. Within that process, gaps or configuration drift are bound to occur. We have been working with many customers and we can testify that whenever we do an analysis or an audit of those environments, we find quite an alarming number of discrepancies between production and DR. That’s the real challenge: How can we know that the systems are ready when we are aware that they may be out of sync?
 
How do we make sure DR work?
 
 
Most of us are doing some form of DR testing. The common practice is to conduct annual testing. In all of the cases, practically speaking, this is done manually and planned well in advance. The investment in planning and carrying the test to fulfillment is quite expensive. The problem is that when we do testing, the failure rate is quite alarming. This is not only our opinion, based on our experience working with customers. A failure rate of 70% is quite common according to most analysts.
 
There is another approach called auditing, but it’s very rarely carried out rigorously. The truth is, you probably do it yourself whenever you plan a DR test. There are two types of auditing; one is a “white box” approach, meaning that you have rigorous change management processes. Then, prior to the DR test or prior to a significant event such as a major storm, you review your systems, go over all the change records done in production, and make sure one final time that they had been carried out correctly in DR and in the standby servers. That’s a useful approach, assuming your change controls are very strict. If something wasn’t recorded correctly or not recorded at all then you will still be less exposed. Another approach is to try to collect data and create a document describing the state of the environment just prior to the test. This involves quite a bit of manual effort. You need to understand which applications are out there, which databases are installed, how they are mapped to storage, how sever files are configured and how applications are configured. It’s quite a bit of work but it’s extremely useful in helping you understand whether there are any differences between production and DR. Auditing is not a structured process. It’s very rare to find an organization that does it on a regular basis although this is a practice we do recommend.
 
What does a typical test look like? 
 
 

DR testing is very risky and very expensive. It’s risky because there is a chance that data will get lost and there is the chance that there will be unplanned downtime. The planned downtime itself is a challenge. It sometimes gets extremely difficult to find the necessary maintenance window in which to do a full test of a DR site. Since the risk is so high and the risk of downtime is high, we all take great care to make sure everything is ready before the test.

In a sense, you can say that this is some kind of cheating because the most realistic test would be just to pull the plug on your power. But no one wants to take that risk, so we expend effort making sure our systems are prepared and if there is anything that needs to be fixed than that’s a sign that the system were not prepared in the days prior to the test. When we feel comfortable enough, we are ready to cut off communication lines or pull the plug and start the testing. 

The cost of these activities can be quite staggering. One of our customers shared with us the cost estimates they made. They did quite a thorough workup, collecting all the expenses they incurred before a major test. It was $1.8 million just for one test. Most of you probably are not keeping track of the costs, but if you do you’ll find that it is not an easy endeavor. With all that risk and all that preparation, something in the DR testing process is bound to be compromised. From our experience, all DR tests are some sort of a compromise. We are not finger pointing because no one can actually do better. Real life scenarios are extremely hard and extremely risky to construct because ultimately one of the best tests would be to torch your data center and see if you can recover. Now that would be very realistic but obviously you wouldn’t do it.

Looking at what happens in real disaster recovery scenario, you know that systems fail chaotically. They won’t ever get shut down in a graceful manner. Making a test truly realistic is very difficult, and perhaps irresponsible, so we don’t do it. Another reason that DR tests are not true to life is that in the majority of cases systems do not really carry a full load for significantly long period of time. It may seem that they work but in some cases if you were to try to exercise your systems for a week you will find many other issues arise that will not be uncovered in a DR test.

On top of all that, it’s extremely difficult to make sure you test everything when you test all of your systems at once. Sometimes it not even feasible because it would take too much time and you can’t afford to do that. Many organizations will resort to the option of fractional or progressive testing, meaning that they will test a subsection of the data center each month. At the end of each year they will have completed a cycle n which each and every system has been tested. That’s also unrealistic because in real life, systems won’t fail this way. If that is the case, what is compromised?

 
Failure simulation is unrealistic. Later in this presentation we will see some examples of what can go wrong and how that can leave many areas exposed to risk. You’ll see hidden production dependencies, which means there are some dependencies you are not aware of that will not impact the success of your DR test but in real life will actually cause your recovery to fail Another example is the situation in which your DR site is under-provisioned, meaning that either your hardware is not powerful enough or, more regretfully, may be powerful enough but is not configured properly. In that case, you’ll end up with systems that can’t function well. There is a very interesting area for those of you who are keeping remote point-in-time copies in which there are significant risks to the value of these copies.
 
At the bottom of your slide is a simple, visual attempt to demonstrate your readiness. The green color denotes the period in which your systems are in good shape. Just prior to the test, when you do your auditing and fix the known problems, you’ll eventually reach a pretty good state. After the test you will fix everything else that was unraveled. Then your systems are in top shape, but were they in such a great shape before the test? Since the failure rate is so high, as we discussed earlier, you should assume that they were not. How soon will they get out of sync again? That depends on your thoroughness. We will discuss some options to increase the ratio of those green sections to the red sections. We have learned that there is an alarmingly large portion of the time in which the systems are not ready. There are ways to measure that and perhaps we’ll have time to touch on that before we end.
 
Let’s review some examples of risk:
 
 
Example 1 falls under the category of unrealistic simulation. In this slide, to the left, we have a production site with certain production database which is stored on a storage area network or SAN device. For those who skilled in that art you are probably well aware of the fact that storage devices within storage frames can be guaranteed to be consistent. There are ways to do that. But in many cases you’ll find that different devices belong to separate groups, and those groups between themselves can’t be guaranteed to be recoverable. For example, RDF groups in Symmetric arrays, or any other leading storage vendor such as Hitachi and IBM. Within an array, consistency can be guaranteed but between two arrays it’s not guaranteed by default. It’s not that there is no way to make certain it will, but you need to apply distinct mechanisms to do that.
In this example we have a situation in which a database is stored on multiple devices that belong to different consistency groups. No steps were taken to make sure that all of these devices would be brought into the same consistency group. This means that if the links between Groups A, which is replicated all the time, are down, then the replica will stop immediately. So what happens to Group B? If it fails at the same time and we have a copy which guarantees the write order fidelity, and that copy is accessible, correct and consistent, then that’s fine. This is what always happens during test because during test we quiesce the databases and bring them down in an orderly fashion. Now we have no I/O, and we can disconnect the line. It doesn’t matter in which order because the replicas are now consistent to a point in time. When we bring the DR systems up, everything will work very well. In real life it almost never happens this way. If you are looking at a rolling disaster scenario your communication lines will probably fail at different times. You are bound to reach a situation in which communication lines fail one after the other. It means that there is a very high likelihood that some of the devices will get frozen while the rest will continue to carrying some I/O, at least for some time. Now your remote copies are totally corrupt, unusable, and unrecoverable. You’ll have to go to tape or go to any point-in-time copy you have. This is a situation which will always work during DR testing but it will almost always fail in real life.
 
Here’s another very common example:
 
 
In this example we have a production server that uses a certain network resources. There are many kinds of network resources. It can be a DNS server or a network file system, and so forth. In many cases we copy the configuration of the production server to the DR site and copy the link as well. We end up with a standby server which is configured to query the same server that is located on the production site. If you don’t take particular care to isolate each and every communication path between your production and disaster site, than your DR server, when brought online, will still be able to access the network resource on the production site and function correctly. Your test will pass. In a real DR event you won’t have that resource. It will go down as well as the entire site. Now your standby sever is trying to reach a resource which is not accessible, so it can’t start. And then perhaps an entire chain of adjacent applications won’t be able to start. We have to ask, was it really protected? If it was, then we are just talking about some downtime. It will take some time and the stress of dealing with all the issues found to isolate the problem. Once we find where the replicated network resource is located, we can remap it and finally we can start. So, it can take some time but you can fix it. But, if it turns out that we don’t have a valid copy or a copy that is not up to date, then we are in real trouble when we have to recover. This problem won’t be detected in real DR testing.
 
Another example:
 
 
Here we can see configurations of two servers. This is a cluster, meaning it is a production server with a standby that is supposed to be able to restart the applications when the node fails. What we can see here is a collection of discrepancies. Reviewing quickly, we’ll see some hardware incompatibility. The standby has only one HBA, meaning it has less I/O bandwidth as it relates to disk I/O, so it’s less powerful. It’s missing some service packs, it has differences in its kernel parameters setup, and there are other issues as well. Someone took the trouble to extend the threshold on the production server. In this example it’s a kernel parameter that specifies how many files the server can open at a given time. You usually increase it when you find that the server fails at some point. By extending the parameter, you can be certain everything will function properly. What happens if you don’t apply that change to the standby? It may be that this change was done a year ago, and now your systems are running perfectly. You know that if it is not set up this way, your systems will fail.
 
Here’s where the problem arises. When you do a DR test, you usually won’t put the system into full load. You’ll just start them, run some tests, and then you’ll shutdown and resume operations in production. You may not necessarily have the chance to notice that this server, while it is working, can’t withstand significant load, that it doesn’t have enough space or open files allowed so it will fail. This is something that won’t be detected in a DR test unless it very rigorous. Unless you take the time to apply the full load to the DR server, you will have a situation where the standby may run well for an hour but then it will crash. With the stress and confusion of solving other problems in a DR situation, it can be quite difficult to trace your steps. After all, in this example, the change was done a year ago.
 
Let’s look at example 4:
 
 
Here we can see a very common scenario. Around 40 percent of the customers we work with have started to use this configuration in the last two or three years. In this example, the production data is a file system but it can be any type of data-containing element in your data center. You replicate the data and you also take copies of that copy every once in a while. The primary copy is always synchronized. It doesn’t really matter if it’s synchronous or asynchronous. This is a live copy so if you have a failure and your production site fails abruptly you will be able to restart without data loss. In many real life cases, especially in rolling disaster scenarios, some of the data in production may also get corrupt. Some of the disaster scenarios involve massive data corruptions. It can be caused by a virus attack, deliberate human sabotage, a major software bug or multiple other reasons. If your data at the source is corrupt, it means that you’re most current copies are also corrupt. When you decide to fail over, you have corrupt data at both ends.
 
The only way to protect against that is to keep additional point-in-time copies, even though it requires some extra expenditure. You’ll purchase additional storage, then create some scheme -- for example keep a copy which is an hour old and another copy from yesterday plus two other copies from last week -- so that in the event of a logical corruption you will still be able to recover very quickly. Remember that a recovery from point-in-time copies is much faster than recovery from tape, so you are still protected. You will lose some data but you have no choice.
The trouble is that nobody really tests the validity of those copies. Typically, what happens is you shut down your production in orderly manner, or you disconnect communication lines in an orderly manner, so no one can plant viruses or deliberately delete production data just prior to the test. The last copy is bound to be okay. You test only that copy, and everything is just fine. In fact, it can be totally impractical to test all those point-in-time copies because of the time involved, especially if you are using snapshots. You may have dozen of copies and you can’t repeat your DR test for each and every copy separately. Now you don’t know really know if they are safe.
 
There are many reasons why these copies can go corrupt. We have found many incidents in which those copies have been corrupt for years. They keep on going through the motions to get synchronized every day and every hour but something in the setup or in the script generating them is not perfect and so there are not ok. As you won’t have the time to repeat your test several times for each copy and if you are unlucky enough to have a real DR scenario in which you actually need to use those copies, you’ll find that you are not protected and that’s a bit too late. It’s a real challenge. I’m not suggesting that you should start testing every copy. There are probably other ways to audit those and make sure they are ok.          
 
Moving on to example 5:  
 
 
In this example we can see some production data that is replicated and accessed correctly by your DR host so everything is fine so far. But because of some configuration errors or other mistakes, some of the devices which are intended for recovery are also visible by unauthorized hosts. Usually when you do DR testing you don’t just restart and repurpose all of your servers on the DR site. It is perfectly reasonable to restart only the standby host. It will get access to the devices, work perfectly, and you will conclude that everything is fine. Perhaps the unauthorized host will remain shut down or it may be doing something else, so everything looks fine. In a real DR event you are bound to restart each and every server on the DR site and then you may discover that the unauthorized host actually comes into play. If that happens there is a great chance of corrupting your only remaining copy. This can be extremely frustrating because you test each time and it looks fine. When there is a DR event, after you assumed your safe, there goes your unauthorized server, corrupting your only valid copy. Each and every time we do an audit at a customer site we find this issue.
There are several other examples that can translate into downtime, data loss or performance hits.  
 
 
I have shown you these examples to impress upon you the fact that DR tests are not fully realistic and can’t detect all of the risks to recoverability.
 
Let me share some tips with you for controlling the unknown. The main idea is to start getting aware of your “blind spots” because we don’t pay enough attention to these. We have our plans, we stick to them and we believe that they are perfect. Many organizations will put tremendous levels of effort and creativity into building those plans. That’s the best that we can do, but we are missing some blind spots
 
 
 
 

Given these examples I’ve shown you, it is not too difficult to isolate some of those blind spots. Do we look into our point-in-time copies? Do we look into our server configuration? Do we put loads on the servers? Once we recognize our blind spots, we can either be reactive, proactive or both. What does it mean to be reactive? At a minimum, I would urge you to create contingency plans so in the event you have a failure during a real DR situation – for example, your point-in-time copies are corrupt – then you will know what to do. A contingency plan outlines your strategy. For instance, will you recover from tape? Same goes for clusters. What should you do? And the plan should be well documented. You won’t have the time to think about how to solve those issues in the heat of recovery. Understand your blind spots and plan ahead. That’s the reactive mode. 
How can it be more proactive? There is the opportunity to audit these particular areas in which you are more vulnerable, in which the test has not performed well. Once you see your blind spots there is also an opportunity to improve your plan. You may consider having every other test cut your communication lines without bringing down the production application first. That can be an interesting idea. Please exercise that with care because if you are using synchronous replication then it may actually pause your production applications. It can be done if well prepared. Consider running more load on the servers, consider verifying that you actually replicate your network resources. That’s being proactive. If you are doing fractional testing, try to think again. Can you afford to have full test? If not once a year, then at least every other year. That’s extremely important because some of the dependencies will never be tested unless you do that, so that is also an important practice. Take extra care to make sure your DR site is fully isolated. That is one of the main reasons why your test might look successful but it actually fail.
Let’s get back to auditing. Auditing means that you constantly collect and capture configuration information that relate to your servers, pure storage, databases and applications that need to be somehow documented. Once you have that, you have the first opportunity to compare production to DR. Knowing your blind spots, you can look into cluster nodes and every once in a while just compare kernel parameters or software versions. That’s something you can’t do unless it documented. The reason you need to document them in advance and not do it on the fly is that when you have a real DR event, you won’t be able to take a sneak peek at production and see what is different between production and DR. You also need to make sure that your change controls are strict enough. You need to document each and every change. Once you have all that information you can start doing auditing more frequently.
 

The next stage in the evolution is to try to introduce some automation into auditing. Can you collect the data automatically? There are some technologies out there to do that, including our own, and that’s definitely something worthwhile. Finally when we look at automation, there is a plethora of provisioning tools in multiple areas. Automation is a great idea. The main problem is that DR data centers have multiple technologies installed. While you have automation tools from your favorite virtualization vendor, your server vendor or your data center management tool vendor, they are often not well integrated. When you end up considering your solution, pay extra attention to the question of whether your automation solution will integrate well with the rest of your environment. 

How can we help? We at Continuity Software have a platform that solves some of the problems we have discussed. 

 
 
We have an agent-less technology, a software product, which really does some interesting things. It will automatically discover your high availability and DR configuration. It will actually collect – automatically - the data from your server, storage and databases to build an online documentation of your environment. That documentation is updated all the time, so you have live repositories of your environment. With that knowledge, we can deploy quite an interesting technology which is signature-based. Today we are able to test for around 3,000 different distinct DR and HA risk scenarios. The scenarios we have discussed today are covered by those risk signatures. The technology reviews that signature database, which we update each day, to determine whether you have any of those vulnerabilities. If any problems exist, you get flagged. The benefit is that you have full daily analysis of your entire environment. If any risks exist, you’ll be flagged immediately so you can address the problem and be less exposed. By automating data collection, you will be able to do auditing much more effectively.    
With that, I’ll end my presentation and we’ll get to your questions.
 
Question # 1: This question is about using VMware Site Recovery Manager and how it fit into the plan we’ve just outlined.
 
Answer: That is a great tool and if it fits your environment, by all means use it. However, I don’t really think that, at this stage, it will really fully address the problem we’re discussing today. VMware environments tend to be second tier and definitely most of your production is running on physical nodes. Even if you deploy VMware to a great extent, you’ll find that Site Recovery Manager can automate only the VMware portion of the data center. While it is something you should definitely consider, it won’t solve the entire problem. Data center failovers would still need to be monitored and provisioned manually. A word of caution: even if you are using VMware Site Recovery Manager, I would still urge you to audit your environment because while the tool is great it doesn’t necessarily test each and every best practice that relates to setting up high availability environments.
 
Question # 2: There is a question about techniques used to reconcile data loss associated with synchronized mirrors.
 
Answer: There are two issues here: Is it a real DR, and did you lose your data. If your data is lost, there is little you can do in real time to reconcile your data. If you have a large database spanned 50 storage devices and some of these are not consistent, meaning they are not on the same point in time, then there is little to do in real time. You can try to prevent this situation. The way to do that is to correctly map your production devices to your replicas, make sure that you can verify that they are in the same consistency group. And that is something we can show you how to do. My best strategy for this is prevention. Every now and then, audit your database, make sure that the storage is consistent and that the replica is defined in consistency mode. That can be a challenge if you are unprepared but with the right techniques it can be done.
 
Question # 3: Someone has asked me to clarify the “white box” vs. the “black box” approach.
 
Answer: I must admit that this is our own invention; I’m not sure these phrases are that common. What we mean by “white box” testing is that you actually review your change records. It means you have a system in which you document production changes as part of your change management process. You are rigorous enough to update details of the change -- if you apply a software fix you specify which version, the source of the software etc. Then in between tests you can backtrack all of your changes, review them and manually check that these are actually apply to the DR. If you are very strict in your change management processes, the “white box” approach is extremely effective. If you have the suspicion that change management not that tracked that strictly and some of the changes are not documented well, then you go to a “black box” approach. That means you don’t assume anything. You just create some sort of a dump of your configuration at both ends and compare it. It can be tedious but it is the most rigorous and comprehensive approach.
 
Question # 4: We have quite a lot of questions about how often should IT operations should exercise DR plans.
 
Answer: The more often, the better. If you recall the diagram showing what percentage of the time you are covered versus. What percentage of the time you are uncovered, then theoretically if you’ll test every day you are best off. But that is not practical. The average is once a year in most organizations. There is some tendency to go to twice a year. My suggestion is that if you are testing at those frequencies you should be conducting at least a monthly audit in between tests. Then if you have some configuration drifts in between tests, you’ll be able to catch some of these and you’re overall preparedness will be much higher.
 
Question # 5: What are the advantages of storage-based replication as compare to database and host-based mechanisms?  
Answer: We see a trend in which many customers who have been using storage-based replication are now seriously considering moving portions of their data center to database replication. When you replicate using your storage platform, especially if you’re using platforms from large vendors, then you enjoy the benefits of having a unified mechanism for replication. When you start going to database replication you’ll find that you need to create practices for Oracle and SQL Server at the minimum. You’ll probably end up with multiple solutions from multiple vendors for host-based as there is no single solution available that covers all platforms. One of the main advantages of storage-based replication is simplicity and better control, so it’s quite effective. When you want to keep point-in-time copies then consider using snapshots as opposed to actual copies. When you start using point-in-time copies it may become somewhat complex to use storage-based replication and keep track of all the replicas. Today vendors are trying to create those tools to help you do that. In this particular area, databases excel because the underlying principle in replicating databases is that you actually keep a full copy of your archives. That’s the gist of the most of database replication mechanisms. It means that you can roll back to any point in time so you have in a sense a continuous set of point-in-time copies. That’s assuming you configure your database replication correctly. It is more difficult to manage. You have fewer management tools to control replications of hundreds or dozens of servers. You should consider the trade offs and actually some of our clients combine both approaches.
Downloads: