Why your HA/DR Systems Will Fail and How to Make Sure They Won't

 

Note: This is a transcript of a webinar that was originally presented on May 14, 2009.

 

Speaker: Christine Taylor, Analyst, Taneja Group

 What is corporate disaster recovery? 

Corporate disaster recovery is a big, serious, business; people spend a lot of money on it. In my IT life I was in charge of disaster recovery planning for a division of Avery Dennison. Fortunately nothing happened on my watch. We were ready, or we hoped we were. We had a plan on paper, the secondary systems were in place, and we had good corporate support. But when it came right down to it, although we tested as much as we could, there was really no way to run periodic and adequate tests of the DR environment. It was too disruptive, it was too expensive and a lot of this is guesswork that it would work if something were to happen. As they say “We prepare for the worse and hope for the best” but we just couldn’t be sure. The corporate headquarters and the corporate data center, five miles away from us, had exactly the same problem. Now, this experience was a few years ago but even today it is still common.

 

Disaster recovery is important. Companies invest in backup, they invest in replication, in snapshots, failover, and more, all hoping that if a disaster does occur the system will be backed up within an acceptable amount of time. Remember: it’s not just the data but also the systems that the data feeds into. What is an acceptable amount of time?That depends on a lot of things including the priority of the given application and its data, but a common Recovery Time Objective, or RTO, is just four hours for the most critical applications. Not everything has to go up at that amount of time, of course. Other apps and data can follow in  priority order, but you are looking at four hours before incurring major damage to the business for the most critical and urgent applications.

 

The best and most common ways to protect these systems are two-fold. First of all - and of course this is Backup 101 – backing up and, increasingly, replicating the data so that you can recover data from current specific recovery point. The second part is the secondary servers that can take over from the primary servers in case of disaster. This might be a local cluster or it might be a secondary hot site and everything in between, but what you are talking about is a system that is supposed to reflect all the applications, all the data, all the configuration on the primary server so that it can transparently failover if you lose the primary. But here is what happens over time, and remember I came out of IT - IT makes changes to the primary environment. Everyone does. You’ve got patches, upgrades, new versions, new applications, bigger and better hardware. The more important and obvious changes are usually made in secondary hosts because people remember to do them and if it’s a matter of clusters it’s very common anyway. But the smaller changes, especially where the secondary hosts are geographically, are often forgotten about. You are also usually talking about two IT teams that are a couple of hundred miles away. We’ve come to call what happens “Configuration drift”. 

 

 

 

Clearly the process of managing change to these systems is key to being able to track those changes and to make sure that the right changes are made at the right time to both systems. However it’s harder than it sounds. There are multiples teams and if you are at dispersed sites that’s an added challenge. You really need to test any sort of change management or to know even what is not being changed. You have to run regular DR tests to spot configuration gaps That’s hard and expensive, and the more complex your environment the worse it get.   

 Let’s take a closer look at three challenges I have already mentioned: DR testing, configuration drift and being able to fix or mitigate those gaps.

 

Challenge #1:  DR testing

 

It sounds good on the surface to say “just test more often” if you know that you’ve got configuration gaps and you know they are threatening your availability. But of course it’s expensive and disruptive. Full DR testing requires downing systems or, if it’s a production system which you can’t bring down, you need to clone it and test it there. You have to clone for 24/7 or don’t do it at all. Even when you can do it, you run the risk of not being able to bring the production system back up again in a reasonable amount of time. Not to mention you’ve lost transactions, you’ve lost work and you have the IT overhead for testing it in the first place. So what happens is companies -- even when they are spending a huge amount of money to attempt to protect their data, especially in high availability environment -- are likely to run localized tests when they are deploying their software and equipment. Or, if they have mirrored or are replicating to a secondary system then they’ll test that. But they frequently will test afterwards because it’s so expensive and time consuming. But then you can’t trust it. Of course the more critical the data the more expensive it is to test and the less likely you are to be able to restore the very thing that you have got to restore faster than anything else.

 

 Challenge #2: configuration drift 

 

 

As we’ve discussed, IT commonly makes ongoing changes to primary systems. Some are big upgrades, some are minor changes. It’s really the minor ones that create problems because they are more frequent and more numerous. It gets really subtle sometimes but these things add up when they are not reflected in the secondary systems. Over time, the drift grows to the point that the entire DR process is threatened between systems. Without consistent DR testing and within a year’s time, configuration drift can result in a 75 present chance of failure rate. Of course because most money spent on the most complex systems, these are the most important applications you’ve got to get back.

 Challenge #3: Closing the Gaps

 

DRM to the Rescue

 

Now, this all very depressing and IT deals with it every day! So let’s feel better about this. I’m going to define Disaster Recovery Management for you and then talk about how we really believe DRM, and particularly DRM that can handle multi-vendor environments (in fact that is exactly what you should be looking at), can really help to turn this state of affairs around. DRM uses technology to automate DR testing and change management in DR environments. The more complex the environment, the more useful DRM becomes. DRM is not the same thing as Data Protection Management. They are commonly confused because they sound alike – DRM, DPM – and they both operate as part of a disaster protection strategy. But, DPM is really the technology class in your backup plan. You need that and it is necessary, but DRM make certain that replicated or clustered configurations are in fact actually mirroring each other. It is different. DRM automates testing and risk mitigation in replicated environment. There are certainly vendors that offer replication testing for their own software, but it’s not going to help you in multi-vendor environment which still many of us have today.

 

Typical DRM Workflow

 

Let’s take a look at what the typical DRM workflow would be. This slide tracks it from the beginning when you first install it in your replication environment. The DRM product builds a baseline topology map. It starts by scanning the disaster recovery environment on both sides, whether it’s local servers or multiple sites. It scans the environment to collect configuration data and to form dependency analysis -- now this is a very big deal -- from storage servers and databases. It must be using it own dynamic knowledgebase of dependencies and configuration issues because it is dealing with multi-vendor environments. The product then builds the detailed topology map and uses it is a baseline for any further changes and mitigation. In cases of configuration drift or gaps the DRM product should alert administrators with actionable directions and very clear focus on exactly what is wrong and suggestions for how to fix it. This can take care of nine out of ten problems right out the gate because IT knows what the problem is and know how to fix it. This is something you don’t just want to do and not do again for another year like manual testing. You want to set the DRM to run at set intervals and on demand, for example if you are adding new systems or changing configurations yourself.

 

Typical DRM Scenario

 

Now, let’s put this in the real world and take a look at a typical DRM scenario. This is a true story, although I can’t name the company. A lot of you will recognize yourselves in this.  This is a large financial institution, and they’ve got to protect their data and their customer data. They also have to be in compliance. It is very difficult to do without regular testing. They’ve managed it any way, tested it as much as they could, and had very large replication schemes in place. They really felt that they had it under control but they decided to deploy a trial run of a particular DRM package in their multi-vendor infrastructure. They had the DRM product going for 48 hours. That’s how long it took to build the configuration and dependency maps, the topology map and baseline, and then discover and report on the configuration gaps. The company was shocked – and these are very sophisticated IT people with a very large data protection budget - they were shocked at what was discovered. There were dozens of configuration gaps, a good two dozen of which would have kept them from failing over in the event of the disaster. By using the analysis, they were also able to mitigate the gaps quickly, even the serious ones, because the DRM product told them exactly where the problems were and gave them suggestions for best practices and fixes between the vendors. They adopted the DRM application and it now runs automatically at scheduled intervals. That particular DRM package also tracks and audits activities so the institution can prove they are in compliance for governance and for regulations that covers data protection at financial institutions.

 

 

 

Let me put this into context. We did a number of interviews with IT and DR managers at large corporations and asked for their reactions to this type of technology. All of them got that it will be a very useful thing but they were concerned because they perceived further complexity in their environment from introducing the DRM package. What we told them is that from our experience that is not true. The DRM, although it is additional data protection software, actually lessens complexity a good deal and certainly lessens risks because it can automate the process of a completely consistent replication tree. It automates where they might had been skipping or spending a great deal of money on disaster recovery. It greatly simplifies interactions and configurations between primary and secondary systems and it finally replaces manual processes with nondisruptive gaps mitigation. So we believe in DRM. It’s a relatively new technology class, but it is a growing one. When you are looking into DRM, do not be limited to a single vendor’s product for a single replication product. Look into a DRM package that offers automatic testing, baseline monitoring, SLA monitoring, and compliance monitoring and does it in multi-vendors complex environment. You should give DRM a serious look.

And now I’ll turn this back over to Doron Pinhas from Continuity Software.

 

Speaker: Doron Pinhas, Vice President, Continuity Software

First, just a few words about Continuity Software. We have been here since 2004. We focus purely on helping large organizations manage their HA and DR environment with unique , award winning technology. I’m going to be giving you some examples of the typical issues our customers encounter.

 

 

 

We all build our HA\DR environments to make sure no data will be lost, and so that whenever downtime events happens we’ll be able to recover. To do that we will build everything from redundant data centers to local clustering and remote data replication. These all require a lot of effort just to set up and once they are done, problems will still occur.

In this slide, I’ll give you some perspective of what types of technology layers form your DR/HA solution and why the problems can be so intense.

 

 

 When we build high availability and DR solutions we need to address multiple layers of technology. There is the storage layer, SAN and NAS, and the storage access networking which can be fiber-based, IP-based, iSCSi, and so forth. We have the host layer, which contains all the things that will run our databases, applications, and multiple other technologies areas that need to be taken care of. Beginning at the OS, we have various layers like kernel parameters, network settings, and then databases, which have their own layers, and so on. And finally we have to deal with utilization and cluster technologies, make sure we set up replication on the different layers and, as Christine has suggested, the complexity increases when you have a multi-vendors environment. Even if you have a single storage vendor you most likely also have some database replication, so it can be quite complex. To top that, let’s look at the individuals that will manage those different layers. You have multiple silos within the organization. The storage admins takes care of the storage area and the network and perhaps even addressing some HBA issues with the host and so on. DBAs are at a different level. You have network admins, system admins, and others. Poor coordination and miscommunication between all those entities might result in a setup which is not fully consistent.

 

  What can go wrong?  

 

Now I’ll show you some examples of what can go wrong, to give you an idea of how easy it is for things to slip by. The first problem is a classic case in which you replicate important application data  but somehow it is not replicated. Either it was forgotten or not understood or it was actually configured but something went wrong. Since you can’t test it, you now have just a partial copy.

 

 

 

This slide shows several examples of discrepancies between cluster nodes or primary and standby nodes. These can be insufficient I/O capacity, something that hard to test. Incompatible software versions, incompatible network configurations, insufficient memory  and storage resources, and so on. Any of these differences may result in your DR systems and cluster systems not being able to work when needed.

 

 

 

 This slide shows what can happen when an environment has grown and new devices were added and replicated. In this case, what went wrong is that the new devices were not part of the same storage consistency group.  And now you have a replication that seems to work and your storage management system will report that everything is replicated while in fact the copies are not consistent because they are not part of the same consistency group. This can happen in any vendor environment. 

 

 

 

That happens especially in complex SAN environment that have hundreds and up ports You are bound to have several hosts that would see storage they shouldn’t. That actually happens very easily. There are dozens of scenarios that will lead to the fact that you have a server somewhere out there that can see either your production data or your replicas and it may corrupt them at any minute. This can go unnoticed until it’s much too late and usually this is something that won’t be revealed in DR testing.

Here is another issue that can happen as a result of miscommunication or wrong setup

 

 

Wrong network dependeny can be using a certain network attached storage and copying this very same configuration to your  standby systems. The correct configuration, or course, is to have a copy, or replica of the service at your DR site as well, and use it instead.  

 

 

 

Many DR tests will fail to show you the error.  But when a real disaster incident, such as a power outage will strike you primary datacenter, you standbys will try in vain to access those very same failed resources.  Now you’ll have to figure out if a correct network resource really exists – in which case it’s only downtime – or does not – which might also result in data loss.

 

 

 

 

 

 

 

The last example: point-in-time copies. Many of you will retain, in addition to the synchronous replica, point-in-time copies just in case you have a rolling disaster or a logical corruption meaning that your data goes bad on your production site and your DR start just at the same time because you are synchronously replicating. To remedy that, people will sometimes keep additional snapshots, clones, BCV or true copies, depending on your vendor. But these are never tested and what we’ve found, in many environments, is that these copies are not complete or not consistent. This is something you should take care of.

How do we address all of those issues?

 

 

We have is a product called RecoverGuard. It is a totally agent-less technology, which means we cat set it up in about 24 hours and the next day you’ll have a full report showing how your data center is built, how it is configured, the dependencies, and the risks. We show you the risks by utilizing a unique technology. I believe we are not only the pioneer but probably the only company with this particular approach. We have a signature knowledgebase that is actually a huge repository of human knowledge which we packed in to the software. Today it contains more than 2800 unique scenarios that our customers have encountered, scenarios like those we just discussed in the previous slides. The system knows how to detect any of those scenarios. After creating  the topology mapping and dependency mapping (and I believe that we fully meet the guidelines Christine has suggested, perhaps even more) we will make sure that your environment is free of those risks. In the event that one of those 2800+ scenarios existing in your environment, you’ll see is a very clear indication via an alert and receive a full support to solve the issue.

 

 

We cover problems within data protection and availability management. I won’t go into all the details here, but RecoverGuard goes very deep into all related data protection technologies – replication, SLA management, best practices, clustering as it relates to availability, and so forth.

 

 

RecoverGuard provides a dashboard that will show you your business applications, which you can see on the right side of the slide. It shows you whether any of these applications are at risk, meaning, Is there any chance that your production data or copies or DR copies or clustered data could become corrupt or incomplete? Is there any chance that your standby cluster nodes or your remote DR servers are not configured perfectly in order to assume the role of the production servers? It will show you if you have optimization opportunities. That’s a unique category that we offer which other vendors do not. You can see whether the practices you are deploying are optimal, and in many cases we’ll find that there are issues that a customer needs to fix. Sometimes their solution would actually work but it can performed in a much more efficient way, so RecoverGuard suggest that as well. Many of our customers can testify to saving huge amounts of bandwidth  and storage just with these particular features.

 

 

 

Finally we have very comprehensive model that will manage SLA compliance, meaning do you have your RPOs in place and can you meet your RPOs. Whenever a risk is detected, you can zoom in from the dashboard and get a very detailed report for that individual problem. It shows you not only how the environment looks and how the topology relates, but also many details at the various levels like storage and operation systems parameters, how to fix the issue, the resolution, the history etc.

 

 

On top of that, there is also a very rich set of reporting that will allow you to optimize your storage, verify that you are compliant with your data protection SLAs, allow you to compare  the configuration of your primary and secondary sites right down to the single server level, and many other features.

With that, I’ll end my presentation and we’ll get to your questions.

 

Question: One of the participants is using backup technologies. Essentially they are backing up everything including bare metal recovery. The plan is to bring up the secondary server, restore everything and start working. Why do they need a separate DRM solution?

Christine: That’s a good question. That’s something a lot of IT people wonder because they’ve already put a lot of time and effort and money into this. If you have a relatively simple environment, you have a single vendor backup, you’ve got maybe a cluster using rather simple commodity servers, you probably won’t need DRM for that. But what I find is that a lot of IT people don’t look that far beneath the covers to see what they need to restore, because if you’ve got any complexity at all you need to ask, first of all, what kind of applications are going to have to come up first? And, do you really have the secondary servers in place or are you going to have to buy them? If you are, and it comes from a bare metal restore, then how far is it going to restore? Is it going to restore just for the initial version, for example, of the applications in which case will or will not, your data restore read in to it? If you have replication or snapshots you’re dealing with point in time recovery. Are you quite sure that you are going from the right point in time? Even mid-sized businesses now run hundred of applications and a pretty good average of that is that 15 to 20 percent of them are critical, meaning that if they don’t restore within generally 4 to 24 hours, then you are going to have a very serious business impact. Even in a simple environment you have got to be doing consistent DR testing or you will have changes present problems. The more complex the environment, the more you need to do that and the greater your need to automate the whole process, using something like disaster recovery management software.  

 

 

Question: This is from a customer that is is using NetApp extensively. They are deploying SnapMirror and FlexClone and they are feeling pretty confident that everything is replicated. The question is: “Should I use a DRM solution in such an environment?”

Christine:  A lot of it really depends on the relative simplicity of the environment and which tools you are using. If you are only using, say, FlexClone for NetApp, or if you are an EMC shop only using local clusters and you are using the EMC testing for its own replications, you are probably ok. It’s frankly not ideal, because you still do have drift, but you are probably ok especially if you do some periodic DR testing. But you have to remember, you are not only talking about being able to restore the data, you are also talking about needing to rebuild or replace the filers. Here’s the thing: If you don’t have a secondary setup all ready and you’re going to have to bring the files up from bare metal, then you might have a problem when you are talking about very quick recovery time objectives. You have to not only bring the servers up, you have to reinstall the applications, you have to reinstall the proper versioning, you have reinstalled the data within a certain time period at a certain point. Even though this is a Continuity webcast I’m always very careful to keep my comments vendor-neutral But in all honesty, one of the reasons that we like Continuity is that it simplifies these questions for you especially when it can feed up to management frameworks. There are so many questions around properly bringing something up in the event of a disaster especially when you are talking about very critical data. That’s the last time in the world that you want to see that you were only getting rather shallow and very broad DR tests back. You need something that will be doing it for you all along and, even more importantly, been mitigating these problems for you all along. That’s what a consistently running scheduled DRM can do for you.

 

Question: There is a question about the support matrix of RecoverGuard, especially as it relates to clustering and IBM storage.

Doron: We are vendor-neutral and we support multiple environments. Among storage vendors we support the entire product range for EMC, Hitachi, and NetApp. IBM support will be ready in about three months. We support all the open operating systems including HP-UX, Solaris, Linux, and all flavors that are commercially available in Windows. We support clustering as well, including IBM and Veritas clusters. We also fully cover all the major databases. including Oracle Sybase, SQL Server, and UDB.  

 

 

Question: In your best judgment, who is the appropriate owner of the DR process? And what do you see as you work with customers?

 

Christine: That’s really depends on how centralized IT is. It also depends on the size of the company. The more centralize you are, the fewer owners you are going to have. You may not actually have a DR Manager but you are going to have someone that is responsible for that function. It might be the CTO, it might be the data center manager. At larger corporations, whether or not they are centralized, you are probably going to have a disaster recovery manager or an IT staff member or manager whose primary work is DR. It depends on complexity, it depends on the size of the organization, it depends how centralized they are. In general, you must have someone who is responsible for documenting and running DR testing for critical systems. Usually that’s the corporate data center. The regional data centers can also have critical information. This particular manager even needs to be responsible for those regional centers, or work as a team to make certain that everyone, in all the centers, will be able to properly restore in the way we’ve been talking about in the event of a disaster.

 

 Last question:  I’m using VMware for DR purposes. Wouldn’t that help me manage and control configuration drift?

 

 

Doron: The answer is yes and no. Let me explain that cryptic answer. We at Continuity Software are great fans of virtualization in general, and VMware in particular.   There is a great suite of tools out there to help you better manage your environment and it will especially help you alleviate some of the potential gaps that can happen in server configuration, because that is virtualized and easily replicated. But, virtualization is yet another layer in your data center. You still have to manage your storage, the ESx infrstructure for networking, and not to mention that each individual VM has its own personalized configuration and performance characteristics. You need to make sure all of these match. While virtualization can help, and we recommend doing that, there are other layers to take care of. Another important distinction is that virtualization technologies in general have great promise but companies have not yet developed the required skill set to feel confident they can resolve the potential issues and configuration risks that can happen in those environments. We’ve seen many customers complaining, at least in the first two or three years of deploying virtualization that, every once in a while their experience issues that they cannot really explain. Should it happen in a physical environment, they have all the knowledge and best practices to just isolate the problems. As the environment becomes highly virtualized there seems to be a gap in knowledge. This is another area RecoverGuard will address, with hundreds best practices which will allow you to make sure you are doing the right things. Even if you are using VMware, and that’s a good trend, you will still need to see all the data center layers from the data base to performance to networking to storage and replication. In order to get that comprehensive view, you will need the DRM tool and RecoverGuard is a great solution to do that. 

 

 

Please register to access our library








Downloads: