Putting Disaster Recovery to the Test

In this IT Link podcast hosted by Mike Vizard, Continuity Software CEO Gil Hecht explains why an inability to test disaster recovery plans usually leads to cataclysmic failure.

Mike: Today we are going to be talking about disaster recovery with Gil Hecht, who is the founder and CEO of a company called Continuity Software.

Gil, we hear a lot about disaster recovery but it seems to me it’s a lot of disaster and not much in the way of recovery. It almost feels like the entire process is kind of fundamentally flawed. Is there a different way of thinking about disaster recovery today, especially when we live in an age where sustainability and business continuity are becoming bigger boardroom issues?

Gil: You are absolutely right. Companies are moving more and more to a 24x7 environment. It is expected from the market and from customers that businesses are always going to be online. Historically, systems were not meant to be always online. There were all kind of issues and then the world started evolving, creating two industries: high availability and disaster recovery. If you look at how people do disaster recovery and high availability, they have a production environment, a high availability environment and a disaster recovery environment. These are in three different locations, with three different sets of servers. Obviously, systems in production always change. Changes are made left and right, IT adds things and removes things all the time. That’s actually why we need IT people, in order to make those changes. Every time they make a change, they have to go to the high availability environment and make the same change there, and then they have to go to the disaster recovery environment and make the same change there. The problem is they can’t turn on those systems to make sure they work. When they are tested, usually once a year, and nothing works, they have an issue. Or even worse, when they have a real disaster which could be caused by anything from a virus to a failure, even locally, not necessarily a failure on a large scale, and it doesn’t work, they have an issue. You start to see cases where companies have major availability issues and major data loss issues. You see it all over the news.

Mike: Are these two skills sets coming together and if so, how is that going to manifest itself?

Gil: Most companies are trying to create one strategy to cover both and you can see it going in all kinds of directions. Sometimes the high availability and disaster recovery systems are actually one and alike, where you use the secondary site for both high availability and disaster recovery. In other cases, you see that high availability is done locally in the same data center whereas disaster recovery comes from a different site altogether with different characteristics. There are many reasons to do either of those, perhaps because of costs, or the distances that are required by law, or how many transactions you are willing to lose or not lose, etc. 

Mike: Traditionally, we’ve framed this whole approach from an IT perspective. It started with storage and worked its way up to the servers. More and more, you hear people talking about more of a top-down modeling environment. They are figuring out what the business processes are, making those redundant, and then making the systems underneath them mutually supportive of each other so if something happens, one will pick up for the other. Are we getting to a new level of sophistication around disaster recovery in terms of planning and execution? 

Gil: we have seen companies trying to do all kind of things, such as trying to break the business services apart and trying to provide services from multiple serves providers, etc. It is becoming clear that companies are investing more and more in this space. While it is quite a few years after 9/11, I guess it is still driving some of this activity. We have seen companies starting to deploy testing of all kinds for disaster recovery and high availability. We also see them trying to do more manual testing at the same time, so I think that is another level of development for this industry.

Mike: Is there a more intelligent way to approach the testing side of this equation? That seems to be where things fall down. Everybody has a plan, and every time I look at somebody’s top ten agenda, disaster recovery is on it. Then I hear about the disaster that resulted from the original disaster and it turns out that some part of the system wasn’t capable of supporting the environment so the recovery time ends up being days instead of hours.  

Gil: Absolutely. We hear this all the time as we are operating in the field. We started Continuity Software to address the exact problem that you are mentioning: the problem of not being able to test. Continuity Software has a product, RecoverGuard, which can look at the whole environment and find problems that will impactt recoverability whether it is for high availability or disaster recovery. You know about the issues immediately as they happen, as oppose to once a year when you do a test. Or, when you have a disaster, which is typically too late because you’ve already lost data and costumers. That is one approach. There are other potential approaches as well. We have seen people trying to solve it with virtualization and other methods but regardless of the methods you use, you are going to have to eventually use a testing system like RecoverGuard, or one of its competitors.   

Mike: How does this system work? Is it something I deploy on premise or is it something that you deliver like a manage service?   

Gil: The answer is yes to both questions. Typically, customers use it as a monitoring system that they install in their environment. To start the process, we say to the customer, “Choose any environment you want, let us in for 48 hours.” We literally come with a CD and put it on one server. The server scans the environment without any agents and it immediately shows the customer a comprehensive assessment of all the issues that were found in the environment. How is this done? The product has a very large database with over 2,000 different signatures of known problems that can happen in disaster recovery and HA environments. When we scan the environment, this database is literally scanning for occurrences of those issues and can show the results to the customer immediately. That’s how the system really finds the problems. Fully 100 percent of the environments we have scanned have had critical issues uncovered. In most cases the customer is either pretty shocked or is expecting to see issues but not as material problems. They inevitably end up buying RecoverGuard.

Some customers tell us that they don’t have the high-end professionals in disaster recovery and high availability that can really use the scan results to solve the discovered problems. In those cases we can offer our managed service. It’s the exact same system except that the scan results are sent automatically to our secure NOC. Then, our experts in business continuity and disaster recovery/high availability review the results immediately and if there is a critical issue in the environment the customer will be alerted. The customer can be confident knowing that the environment is always being monitored, that the disaster recovery and high availability systems will work as expected, and that experts are testing the systems 24x7.      

Mike: Business requirements are getting a lot more dynamic. Companies are responding to business issues that change frequently and the end result is that the system needs to be more flexible. Yet the disaster recovery plan is a little more static, so how do you close that gap?

Gil: Yes, we have seen more testers moving this way. There are service providers that offer this testing based on our tools. Hosting providers in large companies that allow you to host disaster recovery are starting to offer those testing services -- powered by our product -- to their customers. We also see risk officers in large enterprises driving this as well.

Mike: The end the result is that I need to get to some kind of a continuous testing model around disaster recovery because doing it once a year just isn’t going to cut it anymore.

Gil: That’s exactly right.

Mike: How complex is it to set this up in terms of cost? What does it look like from a managed service point of view to do some kind of a continuous testing?            

Gil: It depends on the environment size. Prices start at about $50,000 USD annually. From complexity stand point, setup time is typically less than two or three hours. The system needs to scan the environment for just 24 to 48 hours and then it starts to give meaningful results and identifying risks.

Mike: What are the most common mistakes that you see customers making when it comes to setting up their disaster recovery strategy?

Gil: I’ll give a very simple example that we see in 100 percent of the environments. Typically disaster recovery is done with replication technology. The replication of data can happen either in the storage layer or in the host layer or in the database. Let’s say you have database instance using two volumes on a storage area network and those two volumes are replicated to a remote site. But one day the DBA runs out of space. He asks the storage administrator if he can receive a little more space and he is given another volume. The DBA adds the volume to the database but neither the storage administrator nor the DBA know that they need to add it to the DR site because they forgot that this is the protected database. Suddenly you have three volumes which contain the database information in production but only two volumes that are replicated on the DR site. If one day you have a failure, you will find that you only have two–thirds of the data but since the data is striped you have 100 percent data loss, which is a terrible situation for any organization. That’s just one very simple example of a problem that can happen. I can go on and on with many other examples. As I mentioned earlier, we’ve identified over 2,000 risks so far. We’ve listed many of these on our website. People are welcome to take a look by going to www.continuitysoftware.com/commongaps

Mike: Doesn’t it make it a bit hard to plan for this because it sounds like with the number of problems and scenarios that I can run in to, it becomes Murphy’s Law. It’s always the small things that I didn’t plan for that get me verses the big thing that I have planned for.

Gil: That’s right. Typically the design is flawless and people devote a lot of time to the high level design. The problems happen in the small things that people do every day. It’s adding more storage, changing the configurations, changing parameters, replacing a server, retiring hardware. It’s the small things that happen every day that really take things out of sync and create a sort of configuration drift that must be resolved. The high level design that people do once a year or once when they build the DR really is flawless in most of the cases.   

Mike: Everybody has a great plan until they meet the enemy and then the plan falls apart, so you going to have to plan for every contingency. I want to thank our guest for being on the show today and sharing his knowledge.

Downloads: