Solving HA/DR Configuration Drift with RecoverGuard 4.0

Note: This is a transcript of a webinar that was originally presentes om March,2009.

 

 

My name is Doron Pinhas, I’m the CTO of Continuity Software and I will be leading you through this presentation and demonstration. Continuity Software was established in 2004 and since then we have built quite a distinguished list of customers, some of which you can see on the slide below.

 

 

Continuity Software is focusing on helping IT organizations address two main risks that are really nightmares for IT staff. The first nightmare is loss of critical data. That’s one of the primary drivers to loss of revenue and many customers who suffer significant data loss will actually close within two years. That is most likely one of the greatest nightmare. The other is unplanned and significant downtime. These are the two risks IT organizations struggle to prevent. 

 

 

To do that, people set up redundant data centers with data replication in between the two, with standby system ready to take the role of the primary systems in case of any failure, data protection across point-in-time copies for immediate use, backup for archiving and finally setting up manual procedures for failover and clustering solutions. There are various products out there for that. We see many of our customers deploying local clustering solutions as well as remote or geo-clustering.

This involves significant capital expense, time, attention and skill to set up those systems in the first place, make them work as expected with the right level of performance, and iron out all the issues that may prevent the systems from working. But setting these up is just the first step. What we have learned together with many of our customers is that there is a major investment later on just to keep that system running.

 

Here is a diagram showing the anatomy of what happens once you have set up your systems and they are running. And of course new systems are will be introduced all the time so this challenge is constantly changing.

The basic are that you have production systems that are active, running on one or more sites and you have standbys. Some of these are local and some are remote. Production systems will change all the time. This is done in most organizations on a daily basis because problems will occur. Some of these are planned, some are not. A drive will fail, you run out of storage space somewhere so you add another disk, you need to apply patches and modify user characteristics, change your network, etc. There are multiple changes, some are minor and some are pretty massive. For example consider replacing a large storage array that serving a hundreds of servers or even migrating an entire data center to a new location.
The problem begins when you need to set up those changes to your standby system as well. In production systems you have ample opportunity to make sure that every change that is made will actually work. You have your test procedures and the systems are up. And you have your users, who will complain if something goes wrong. This is not the case with standby systems or DR systems, which are usually shut off while the servers may be up but the applications are usually down. So, as you apply the changes you have no practical way of making sure everything works perfectly. There are occasions to test that, of course. Most of our customers will do annual testing and some of them do it even more frequently, even monthly testing. But none of this is frequent enough to actually make sure that your standby systems are ready each and every time. The reality is that you apply changes all the time to production but you can’t really test your standbys. What we help customer measure is when these implementation gaps or inconsistencies occur. There is an alarmingly high rate of failing in DR testing which implies that in between testing, your systems, or at least some portion of your systems, will not be ready if something goes wrong. What we try to do is help organization make sure that their standby and DR systems are perfectly aligned with production at all times. We do that using our monitoring and analytical solution called RecoverGuard.
 

RecoverGuard is software technology. One of the strengths of the technology is that it is totally agent-less, meaning that you set up this centralized management application and you can actually scan thousands of hosts and petabytes storage without the need to deploy a single agent. By the way, a single server can scan and support very large data centers but you can obviously scale up. Agent-less technology also means we are able to set up RecoverGuard in about 24 hours. Your first report on the state of the data center will be available immediately after 24 hours.

RecoverGuard does two main things: 
One is to collect data from your storage arrays, servers and databases, to a great level of detail. It will look at all the devices and technologies so you can understand how it is presented, which host to mount, which device is where, how to put the data into use using logical volume management, multi-pathing software, file system databases, and so forth. What RecoverGuard will do is to create a topology understanding of the environment each day, so you’ll have a full documentation of your data centers, your clusters, your production servers, your storage, replication, the matching standby systems and how they are configured, to a very great detail. You will see this through reports, both visually or graphically, and through contextual reports. The main feature of RecoverGuard and probably the most interesting is that it can look particularly topology view that it is constantly updated and identify any vulnerability within your environment. This is an extremely powerful piece of technology.

We’ve chosen a very unique approach. We have a knowledgebase of potential vulnerabilities. This is quite a large repository of knowledge from hundreds of companies that we’ve been laborious accumulating. We call these risk “risk signatures.” We talk to our customers and understand what configuration problems and processes they have implemented or not done perfectly and how can they impact recoverability, data protection and availability. Whenever we find a new risk, we model it into our knowledgebase. RecoverGuard will compare that knowledgebase to your environment each and every day, so whenever you make a change it can be almost immediately tested to make sure it doesn’t match any of these signatures that represent the experience of hundreds of large enterprises. This is a valuable, community-driven knowledgebase.

 


In order to provide good coverage and give you a good understanding of whether you have issues in you environment and what the problems are, we have compiled two main areas of signature reporting and tools. To the left of the slide you can see a data protection areas that RecoverGuard has been focused on up through Version 3. You would be able to find out any of these data protection risks in your data center, production, local clustering or DR site. What we have introduced with Version 4 is an expansion of this solution to provide complete coverage of all the risks in your data centers, including availability management. Not only can we assure that the data is correctly protected and provide a good view on how it is protected, but also whether your standby servers are ready for the tasks. We will look for data protection and consistency issues, make sure that all of the critical assets are indeed replicated according to an SLA which matches the value of the data in an efficient way with no waste. We have optimization analysis of your solution. Looking into various technology areas, scanning from databases and best practices of storage, vendors and operating system vendors, load balancing, SAN, etc. As we look to availability management, we can help you make sure that you have standby systems which are configured correctly from the host configurations, meaning you have all the right software versions operating systems, packages, parameters and network settings in your standby host, and whether these are clustered. You also have a wide array of best   practice testing to make sure that your clusters are built correctly, that you can actually access the correct data, that you have enough redundancy and performance stored in place to take the load of you production systems should they fail, and finally some very interesting root cause analysis which I will demo now.

 

Prodact Demo

The next slide is the main screen of RecoverGuard. This is what you see one you have logged in.


In this screen you can see, at a glance, the health level of your entire IT environment, including all of your data centers. To the left you can see the scan status of your environment, what servers, databases and arrays were discovered recently, which of those were scanned properly, any areas that RecoverGuard currently does not cover. On this screen you can also see some statistic about issues that were found. On the right side of this screen you can incorporate your business services, applications, or business divisions as well. This is done pretty easily. You can just drag and drop your critical resources into logical entities, which you can set anyway you like, including hierarchies, to represent your business processes and main applications. This allows you to see the health metrics of those particular business services and whether there are any data protection risks to your data either on production or on your backup copies or replicas and DR. It shows whether there are any risks to availability, meaning that, you have standby systems and maybe the data is ok but your standbys cannot really assume their role. You can also see whether the environment is optimized enough. If there are any optimization opportunities you can see those as well. And finally, it shows whether you can meet your corporate replication goals in terms of RPO. This is obviously a very compact representation of all this information. Let’s see what happens if we drill down into on of these areas. I will click on the “Data Risk” for “Corporate Email” under “Business Entities.”

 

RecoverGuard will immediately route me to the ticketing model (above). To the left you can see a navigation tree, the corporate email service is pre-selected and there are four data-related risks identified for the corporate email. If we pick the first one, a detail technical ticket will appear.

 


The ticket describes the problem in a very compact way. Let’s click
on “Open Topology.”


What we see here is that a file system is stored in inconsistent way as it relates to replication. We have three devices comprising File System S and as you can see one is not replicated and the other two have complex replication structures. In this case it’s an EMC environment and you have SRDF replication. For the other two devices there are several point-in-time copies which you can click on to get more details if you’d like. In the ticket you can see that data is only partially replicated. If you’ll take a closer look you can see that the replicated devices were probably provisioned at one point in time, because their numbers are consecutive, and one that was probably added later. In a minute we’ll see what changed, and how. So this is a pretty clear way to understand the risks. Consider the fact that looking manually at all of those risks in your data center may take quite a significant amount of time. But, once you are aware of the problem, all you need to do in order to fix it just expand this view, hand it to your storage admin and it’s very clear how to solve the problem. The ticket will contain details about the impact of each problem, what needs to be done to resolve it and a look at the history of the problem in case it’s a repeating issue. You can also add your own notes or define it as a non-issue by suppressing the ticket, or suppressing the entire signature in case this is how you do things in your organization.

Let’s see another example of a data protection risk. Let’s look at all the risks that have been identified for the entire data center.

 


Let’s click the second one, which is from a NetApp environment. This will also demonstrate the capability of RecoverGuard to map and model your databases.

 

 

In this case we have a SQL database. It is stored across two volumes on a NetApp SAN, which is a NAS device. We can see that those two devices are indeed replicated, but the status of those devices is inconsistent. When clicking on the first link we can easily see that this particular link is in re-try state and the data slightly more than one day old whereas the other portions of the storage is in sync and actually 12 second behind. Obviously we have a case here of corrupt copy. If you were to have DR event right now, you would have no valid copy of that particular database in your data center. That’s a significant risk your data.

 


Let’s take a look at an availability issue in a cluster. We can easily understand from the ticket is that we have a production system with a volume group that is clustered. The volume group is stored across two Symmetric devices, but the standby system is mapped and zoned at the SAN level, only to one of these devices, meaning that if a failover were to happen, the failover will fail. It will only work properly once you map and zone this device to your other cluster notes. This is pretty obvious once it is pointed out, but it is difficult to find out manually that this risk exists. And remember that we are doing it for near 3000 risks once a day. That’s the equivalent of running a full DR test each night.

 

Let’s take a look at another issue. Here we can see a UNIX system. We have an Oracle database stored in a volume group spanned across several devices. As we can see on the SAN level, these are not provisioned with SAN I/O paths in a consistent way. One of the devices has one path and the others have more than one. Actually, one has four, one device has five paths. There are even some dead paths. RecoverGuard can identify this and issue an alert. The remedy here is to set up the right numbers of paths and revive the dead ones to make sure that the systems are in perfect shape and in perfect performance.

 


In this list we find more issues. For instance, the DR host doesn’t have the right performance characteristics. Another show a production file system stored on multiple RAID levels which can lead to poor performance. We find often that these issues will happen across time. You set up your system with great care for the right tier of storage but as time passes and new storage is provisioned, sometime this model will break and you’ll end up with large databases stored on devices which are not always on the right tier.

 


Here’s another interesting example. In this case, we have a volume group assigned to multiple fiber optic SAN devices over NetApp. The volumes are not all set for the same image property. With NetApp, when you assigned a volume to a particular host, you can actually tune it for performance by choosing the right image type. Some of the devices here are tuned for Linux but many are not. The overall performance of this volume will not be optimal. Plus, there are some stability issues. This is not a recommended best practice.

Those were a few sample tickets that I wanted to show you. Now I’d like to cover some more areas in the product. I’d like to show you some of the root cause analysis capabilities that are available in RecoverGuard Version 4.0. I’ll do that by choosing another ticketing and drilling down into the problem.
 


We have here a SQL server stored on two NetApp volumes and just one of these is replicated. That means that in the event of a DR event, half of your data will not be there. We can drill down more to see the actual file systems the data is stored on and which particular files are stored where. In order to see how this has happened, we can go to the “Reporting” tab and look into the Data Center Change Log:

 


Here we can see everything that happened on your data center since yesterday. Since this is a new ticket, this is the best way to look into the sequence of events that led to this problem. By the way, you can see some interesting information about critical issues that relate to availability and data protection here: clusters, failovers, etc. By scrolling down we can quickly find that particular file server. We can see that a new network drive was added and than mapped to a NetApp volume. Two new data files were added to the SQL databases. One of these was on the new volume which is not replicated. This all leads to the problem RecoverGuard identified.  

In the Report module, there are additional reports that relate to performance, optimization, and best practices. One of these is a report that can identify servers which are configured with a single SAN path and that may be by design for a tier 2 server, but it definitely not a good practice for tier 1.
 Another interesting report is the Storage Provisioning Optimization Report that will show you a storage that was provisioned that’s not actually used.
 


This can happen as you retire some servers, and they are not totally out yet, and the applications have moved but devices were not reclaimed. Or you provisioned storage in advance but no one end up using that storage, so it can be there provisioned for a very long time.

 


As this chart clearly shows, there are 21 terabytes of storage which is not provisioned plus around 40 terabytes more of replicas. You can understand how old the devices are, and view the report by this factor. So you can, for instance, look for storage that has been provisioned for more than three months but still not used, which is the case here. You can drill down by array or by host and see your top areas of waste. You also see all the relevant details, summary by site, by host, and various devices, etc. This is something which worthwhile running about once a quarter to identify what can be reclaimed.

Here is a report which will also reveal potential waste:
 


Here we can see replicas which are provisioned but left untouched for pretty long time. I have selected to run the report for replicas which are more than 6 month old. Your tier 1 storage devices are definitely not the best place to archive data. As we can see here we have a significant number of very old replicas which could be reclaimed. Many of these are not just snapshots but actual full copies. In this case these are clones which are consuming capacity. We have many terabytes that can probably be freed up

 


Another report is a side-by-side comparison of your production and standby systems. These can be either local cluster nodes, which will be compared in a very intelligent way, or primary and DR hosts. Some of the differences can be trivial and some are not. We show you only the meaningful issues. You can run this report with greater granularity but in this particular view it will show you the significant issues. Here we can see a 64-bit system in production and a 32-bit system in DR, which is not a great practice. We can also compare install packages and versions. Here we have Java installed on production but someone forgot to do it on the DR host. On another host we have difference in the storage management application and you have more advanced version on production. None of these are good practices. You can run this report occasionally to see if there are any issues that should be addressed.

 


Let’s take a look on the SLA dashboard, which is a very interesting tool that will allow you to catch in a glance - but also drill down for more detail - the retention and replication structure in the data center. It will show you and flag SLA violations (slide above). In this a new file system has been introduced and it’s on a server that should be replicated but it’s not replicated as planned. We can easily vet it. If that’s by design, we can just suppress the ticket and if it’s not, we can act upon it.

Another interesting aspect of this report is the ability to drill down into individual components and see how their data retention copy policies are put in to practice. I will drill down in to the data warehouse service. You can drill down to the volume group level or data base level and see all the data protection points you have here.  
That’s the way to drill down in to individual and see the exact data protection layout and hoe many restore points you have at any given point in time.
I want to make sure we leave time to answer your questions, so I’ll end my presentation now and take your questions.
Questions and Answers:
Question #1: One of the attendees is asking about down DR Servers. In his environment they are using DR servers which are typically down and the question is: How can RecoverGuard actually detect availability or data protection risks for those down servers? That’s actually a common question and we provide an excellent solution to this particular problem. Even if you don’t have servers at all we can still track the production servers, map their storage, understand how it is replicated, whether these copies are consistent, whether they are presented at the SAN level at all, so when you bring up your servers, they will have the correct data to access. As it relates to data protection, we can give you a pretty good picture even if you don’t scan any servers or remote sites. You will understand what’s being replicated and how and is it really ready and mapped for use. That’s pretty valuable. In order to compare the server’s configurations themselves you will need to scan them at least once in a while. Most of our customers will scan the DR servers whenever a DR test is conducted even if it’s once a year. We will grab the entire configuration of the server on this occasion and compare it to production, but then once you shut down you servers and they don’t change anymore we keep on tracking production servers . Their configuration of the production servers will inevitably change whereas the dormant servers have not. Any changes that may affect recoverability will be immediately flagged. That’s a very nice way to know that you have done something to production and now you need to bring up your DR server and do the same otherwise it won’t work. There is a good value, even if your standby servers are down, and especially if this is so because you can’t even test them manually anymore.
Question #2: There are several questions about coverage. We do support the full range of storage arrays by Hitachi including all the relevant replication technologies, EMC Symmetrix and CLARiiON and the entire NetApp product range. We will be adding support for other vendors such as HP XP IBM DS this year. 
Question #3: Another question about support for Oracle ASM. Yes, we do support all sorts of volume managers based whether these are OS-based or database-based such as ASM. That’s an interesting question because there are around 60 or 70 different best practices that we will test for database storage provisioning in that particular environment. It’s very valuable.
Question #4: There are some questions about deployment of the product and what does it take. I would like to remind you that the technology is totally agent-less. You just set up the software, you scan the storage to discover the array - this is done in 5 minutes - and then you select those you would like to include in the analysis, hit “next” and it will collect the data. The next day you’ll have the results. It has a very low footprint on your network, and just a few kilobytes of data are transferred per server just once a day. It’s all in read-only, so it’s not risky and pretty fast.
Question #5: Another question: Can the technology be utilized for DR exercise? Yes, of course. But this is not a full substitute for a DR exercise. It is a complementary technology to audit your environment very thoroughly in between your test. This can be done on a daily basis. Some tasks must be automated, because it is not physically possible to manually review your servers each day . By introducing RecoverGuard you can actually do that and since we test for around 3000 different issues, the software will do a much more thorough job than what you can expect to do manually. And, it’s done every day. Running a scan is the equivalent of running a test. For those customers who scanned their systems for the first time with RecoverGuard before a test, we’ve been able to demonstrate that we capture all the issues uncovered by the test. I would still argue that you should test your DR once in a while just to make sure that the power is on, etc. but definitely RecoverGuard is a perfect solution to run in between tests. And, it makes you much more ready for your DR testing. Just consider the value of having all the documentation of the environment up to date and available to you as you begin the testing exercise. Also having advance awareness of all the potential issues that exist is something that will shorten the time that it takes you to prepare for the test.   
Question #6: There are several questions about how we collect data: I kept that at a high level intentionally in today’s presentation, but we do have a very detailed presentation for that. I’ll just keep this answer at a high level, but for those of you who are interested, just send us an email about that I’ll be happy to follow up. Collection is agent-less but it is done through standard protocols and is very well documented. There is nothing to hide. As we work with customers the first thing we do is introduce the security and network teams to our scanning methodologies. Again, it is very well documented. All of our collection methods are totally exposed so once you deploy the product you can immediately see what it’s going to collect and how. It’s all in read-only. We will collect data from storage arrays through the native array API. We’ll also be able to connect it those live databases that are up and understand where they restore their data files and log files and archives, and so forth. This is all done through read-only commands, all are valid, all are through standard protocols which are highly firewall-friendly so even if your network is segmented you should expect to be able to set up the environment for RecoverGuard to scan your systems pretty easily. The only thing you need to do is type in some credentials (read only, non-privileged) to allow the system to collect the data.
Question #7: There are several questions about clustering relations and how to establish primary and DR systems. There is an area in which RecoverGuard can actually make an intelligent suggestion.

It’s done semi-automatically. RecoverGuard will look into production servers that have data which is replicated using a certain technology and understand which servers on the DR site, or on any site for that matter, access the replica. These are immediately candidate for being production and DR servers. If it’s a manual configuration you’ll just need to approve, you can click the “suggest” button and you will see some additional pairing that the system considers as cluster nodes. This is for manual pairing but as it relates to clustering RecoverGuard will actual parse­­ the cluster configuration metadata to understand which are the nodes.

Question #8: There is a question about whether we have a CMDB system and does it interact with external CMDB systems. The answer is yes, in a sense, RecoverGuard does have its own CMDB repository but it can also interact with external CMDB systems. What we did find is that in order to do all the detailed analysis we provide, you really need to get very deep into the storage configuration and layout to find those risks. Most CMDB systems are not up to that task, but we can definitely connect to external systems and we actually do that to collect some of the data. 
Question #9: Here’s a question about the granularity of scanning databases. We do support Oracle, Sybase, IBM DB2 UDB and Microsoft SQL Server. With all of these RecoverGuard looks very deep below the server into the databases and instances and different portions of the data base meaning database, meaning data files, log files and archives. We understand and treat each of those portions uniquely so if you keep more copies on your data files then on your archive, which make perfect sense, than we’ll definitely not flagged that but rather test each of these entities uniquely, see that their configured for the right performance and so forth.  
Question #10: There are several questions coming in about server virtualization and VMware support. There are some new features coming out later this year. Currently we do support ESX grids. If you have ESX grid which is replicated, we make sure that the data is provisioned correctly meaning all of your nodes can access the same data with the same performance, that the data is consistently replicated using the various technologies that you use, that you have enough point-in-times, that the servers are provisioned correctly with enough redundancy and enough resources to take the roles of each other. All of this is done just like any other cluster. What we don’t yet do is to actually find vulnerabilities within the virtual machines themselves. This is something that we plan to do on the next two releases. Bottom line: we will scan the infrastructure and find vulnerabilities today and we will also scan the virtualized machines in the near-term future.   
Question #11: There is a very interesting question about ROI. I would argue that we spend millions and millions of dollars on a DR strategy, and thousands of man hours before each test, so this must work right. This is a tool that can reduce the labor associated with making sure the systems are running. Labor reduction is significant. Much of the data that needs to be gathered constantly on a weekly basis for auditing to find what’s provisioned and what’s not is a manual task which is usually out of the scale of storage management systems. They don’t understand clearly what’s being replicated for DR and how clusters work. If you want to map your resources it would take hundreds of hours each time you would like to refresh documentation. With RecoverGuard, it is totally automatic. You really need that mapping before you have DR test so what we see our customers doing before they had RecoverGuard is spend hundreds or thousands of hours refreshing their understanding or their documentations just to make sure that they can run the test efficiently. That is demonstrated by many of our customers who say DR testing is much less stressful because they have far fewer issues to find in each test.   
Thank you all for joining us today for our presentation.   
Downloads: