Solving HA/DR Configuration Drift with RecoverGuard 4.0
Note: This is a transcript of a webinar that was originally presentes om March,2009.
My name is Doron Pinhas, I’m the CTO of Continuity Software and I will be leading you through this presentation and demonstration. Continuity Software was established in 2004 and since then we have built quite a distinguished list of customers, some of which you can see on the slide below.


Continuity Software is focusing on helping IT organizations address two main risks that are really nightmares for IT staff. The first nightmare is loss of critical data. That’s one of the primary drivers to loss of revenue and many customers who suffer significant data loss will actually close within two years. That is most likely one of the greatest nightmare. The other is unplanned and significant downtime. These are the two risks IT organizations struggle to prevent.

To do that, people set up redundant data centers with data replication in between the two, with standby system ready to take the role of the primary systems in case of any failure, data protection across point-in-time copies for immediate use, backup for archiving and finally setting up manual procedures for failover and clustering solutions. There are various products out there for that. We see many of our customers deploying local clustering solutions as well as remote or geo-clustering.
This involves significant capital expense, time, attention and skill to set up those systems in the first place, make them work as expected with the right level of performance, and iron out all the issues that may prevent the systems from working. But setting these up is just the first step. What we have learned together with many of our customers is that there is a major investment later on just to keep that system running.

Here is a diagram showing the anatomy of what happens once you have set up your systems and they are running. And of course new systems are will be introduced all the time so this challenge is constantly changing.

RecoverGuard is software technology. One of the strengths of the technology is that it is totally agent-less, meaning that you set up this centralized management application and you can actually scan thousands of hosts and petabytes storage without the need to deploy a single agent. By the way, a single server can scan and support very large data centers but you can obviously scale up. Agent-less technology also means we are able to set up RecoverGuard in about 24 hours. Your first report on the state of the data center will be available immediately after 24 hours.
We’ve chosen a very unique approach. We have a knowledgebase of potential vulnerabilities. This is quite a large repository of knowledge from hundreds of companies that we’ve been laborious accumulating. We call these risk “risk signatures.” We talk to our customers and understand what configuration problems and processes they have implemented or not done perfectly and how can they impact recoverability, data protection and availability. Whenever we find a new risk, we model it into our knowledgebase. RecoverGuard will compare that knowledgebase to your environment each and every day, so whenever you make a change it can be almost immediately tested to make sure it doesn’t match any of these signatures that represent the experience of hundreds of large enterprises. This is a valuable, community-driven knowledgebase.

In order to provide good coverage and give you a good understanding of whether you have issues in you environment and what the problems are, we have compiled two main areas of signature reporting and tools. To the left of the slide you can see a data protection areas that RecoverGuard has been focused on up through Version 3. You would be able to find out any of these data protection risks in your data center, production, local clustering or DR site. What we have introduced with Version 4 is an expansion of this solution to provide complete coverage of all the risks in your data centers, including availability management. Not only can we assure that the data is correctly protected and provide a good view on how it is protected, but also whether your standby servers are ready for the tasks. We will look for data protection and consistency issues, make sure that all of the critical assets are indeed replicated according to an SLA which matches the value of the data in an efficient way with no waste. We have optimization analysis of your solution. Looking into various technology areas, scanning from databases and best practices of storage, vendors and operating system vendors, load balancing, SAN, etc. As we look to availability management, we can help you make sure that you have standby systems which are configured correctly from the host configurations, meaning you have all the right software versions operating systems, packages, parameters and network settings in your standby host, and whether these are clustered. You also have a wide array of best practice testing to make sure that your clusters are built correctly, that you can actually access the correct data, that you have enough redundancy and performance stored in place to take the load of you production systems should they fail, and finally some very interesting root cause analysis which I will demo now.
Prodact Demo
The next slide is the main screen of RecoverGuard. This is what you see one you have logged in.

In this screen you can see, at a glance, the health level of your entire IT environment, including all of your data centers. To the left you can see the scan status of your environment, what servers, databases and arrays were discovered recently, which of those were scanned properly, any areas that RecoverGuard currently does not cover. On this screen you can also see some statistic about issues that were found. On the right side of this screen you can incorporate your business services, applications, or business divisions as well. This is done pretty easily. You can just drag and drop your critical resources into logical entities, which you can set anyway you like, including hierarchies, to represent your business processes and main applications. This allows you to see the health metrics of those particular business services and whether there are any data protection risks to your data either on production or on your backup copies or replicas and DR. It shows whether there are any risks to availability, meaning that, you have standby systems and maybe the data is ok but your standbys cannot really assume their role. You can also see whether the environment is optimized enough. If there are any optimization opportunities you can see those as well. And finally, it shows whether you can meet your corporate replication goals in terms of RPO. This is obviously a very compact representation of all this information. Let’s see what happens if we drill down into on of these areas. I will click on the “Data Risk” for “Corporate Email” under “Business Entities.”

RecoverGuard will immediately route me to the ticketing model (above). To the left you can see a navigation tree, the corporate email service is pre-selected and there are four data-related risks identified for the corporate email. If we pick the first one, a detail technical ticket will appear.

The ticket describes the problem in a very compact way. Let’s click
on “Open Topology.”

What we see here is that a file system is stored in inconsistent way as it relates to replication. We have three devices comprising File System S and as you can see one is not replicated and the other two have complex replication structures. In this case it’s an EMC environment and you have SRDF replication. For the other two devices there are several point-in-time copies which you can click on to get more details if you’d like. In the ticket you can see that data is only partially replicated. If you’ll take a closer look you can see that the replicated devices were probably provisioned at one point in time, because their numbers are consecutive, and one that was probably added later. In a minute we’ll see what changed, and how. So this is a pretty clear way to understand the risks. Consider the fact that looking manually at all of those risks in your data center may take quite a significant amount of time. But, once you are aware of the problem, all you need to do in order to fix it just expand this view, hand it to your storage admin and it’s very clear how to solve the problem. The ticket will contain details about the impact of each problem, what needs to be done to resolve it and a look at the history of the problem in case it’s a repeating issue. You can also add your own notes or define it as a non-issue by suppressing the ticket, or suppressing the entire signature in case this is how you do things in your organization.
Let’s see another example of a data protection risk. Let’s look at all the risks that have been identified for the entire data center.

Let’s click the second one, which is from a NetApp environment. This will also demonstrate the capability of RecoverGuard to map and model your databases.

In this case we have a SQL database. It is stored across two volumes on a NetApp SAN, which is a NAS device. We can see that those two devices are indeed replicated, but the status of those devices is inconsistent. When clicking on the first link we can easily see that this particular link is in re-try state and the data slightly more than one day old whereas the other portions of the storage is in sync and actually 12 second behind. Obviously we have a case here of corrupt copy. If you were to have DR event right now, you would have no valid copy of that particular database in your data center. That’s a significant risk your data.

Let’s take a look at an availability issue in a cluster. We can easily understand from the ticket is that we have a production system with a volume group that is clustered. The volume group is stored across two Symmetric devices, but the standby system is mapped and zoned at the SAN level, only to one of these devices, meaning that if a failover were to happen, the failover will fail. It will only work properly once you map and zone this device to your other cluster notes. This is pretty obvious once it is pointed out, but it is difficult to find out manually that this risk exists. And remember that we are doing it for near 3000 risks once a day. That’s the equivalent of running a full DR test each night.

Let’s take a look at another issue. Here we can see a UNIX system. We have an Oracle database stored in a volume group spanned across several devices. As we can see on the SAN level, these are not provisioned with SAN I/O paths in a consistent way. One of the devices has one path and the others have more than one. Actually, one has four, one device has five paths. There are even some dead paths. RecoverGuard can identify this and issue an alert. The remedy here is to set up the right numbers of paths and revive the dead ones to make sure that the systems are in perfect shape and in perfect performance.

In this list we find more issues. For instance, the DR host doesn’t have the right performance characteristics. Another show a production file system stored on multiple RAID levels which can lead to poor performance. We find often that these issues will happen across time. You set up your system with great care for the right tier of storage but as time passes and new storage is provisioned, sometime this model will break and you’ll end up with large databases stored on devices which are not always on the right tier.

Here’s another interesting example. In this case, we have a volume group assigned to multiple fiber optic SAN devices over NetApp. The volumes are not all set for the same image property. With NetApp, when you assigned a volume to a particular host, you can actually tune it for performance by choosing the right image type. Some of the devices here are tuned for Linux but many are not. The overall performance of this volume will not be optimal. Plus, there are some stability issues. This is not a recommended best practice.

We have here a SQL server stored on two NetApp volumes and just one of these is replicated. That means that in the event of a DR event, half of your data will not be there. We can drill down more to see the actual file systems the data is stored on and which particular files are stored where. In order to see how this has happened, we can go to the “Reporting” tab and look into the Data Center Change Log:

Here we can see everything that happened on your data center since yesterday. Since this is a new ticket, this is the best way to look into the sequence of events that led to this problem. By the way, you can see some interesting information about critical issues that relate to availability and data protection here: clusters, failovers, etc. By scrolling down we can quickly find that particular file server. We can see that a new network drive was added and than mapped to a NetApp volume. Two new data files were added to the SQL databases. One of these was on the new volume which is not replicated. This all leads to the problem RecoverGuard identified.

This can happen as you retire some servers, and they are not totally out yet, and the applications have moved but devices were not reclaimed. Or you provisioned storage in advance but no one end up using that storage, so it can be there provisioned for a very long time.

As this chart clearly shows, there are 21 terabytes of storage which is not provisioned plus around 40 terabytes more of replicas. You can understand how old the devices are, and view the report by this factor. So you can, for instance, look for storage that has been provisioned for more than three months but still not used, which is the case here. You can drill down by array or by host and see your top areas of waste. You also see all the relevant details, summary by site, by host, and various devices, etc. This is something which worthwhile running about once a quarter to identify what can be reclaimed.

Here we can see replicas which are provisioned but left untouched for pretty long time. I have selected to run the report for replicas which are more than 6 month old. Your tier 1 storage devices are definitely not the best place to archive data. As we can see here we have a significant number of very old replicas which could be reclaimed. Many of these are not just snapshots but actual full copies. In this case these are clones which are consuming capacity. We have many terabytes that can probably be freed up

Another report is a side-by-side comparison of your production and standby systems. These can be either local cluster nodes, which will be compared in a very intelligent way, or primary and DR hosts. Some of the differences can be trivial and some are not. We show you only the meaningful issues. You can run this report with greater granularity but in this particular view it will show you the significant issues. Here we can see a 64-bit system in production and a 32-bit system in DR, which is not a great practice. We can also compare install packages and versions. Here we have Java installed on production but someone forgot to do it on the DR host. On another host we have difference in the storage management application and you have more advanced version on production. None of these are good practices. You can run this report occasionally to see if there are any issues that should be addressed.

Let’s take a look on the SLA dashboard, which is a very interesting tool that will allow you to catch in a glance - but also drill down for more detail - the retention and replication structure in the data center. It will show you and flag SLA violations (slide above). In this a new file system has been introduced and it’s on a server that should be replicated but it’s not replicated as planned. We can easily vet it. If that’s by design, we can just suppress the ticket and if it’s not, we can act upon it.

It’s done semi-automatically. RecoverGuard will look into production servers that have data which is replicated using a certain technology and understand which servers on the DR site, or on any site for that matter, access the replica. These are immediately candidate for being production and DR servers. If it’s a manual configuration you’ll just need to approve, you can click the “suggest” button and you will see some additional pairing that the system considers as cluster nodes. This is for manual pairing but as it relates to clustering RecoverGuard will actual parse the cluster configuration metadata to understand which are the nodes.

