RecoverGuard Demo

Note: This is a transcript of the RecoverGuard Demo Video.

 

 

My name is Doron Pinhas and I am the CTO of Continuity Software. 

I’m here today to present to you RecoverGuard™, a High Availability and Disaster Recovery configuration management and analytics tool that will put you back in control of both data protection and application availability.

Background
Enterprises spend huge amount of capital expense and labor to build a resilient IT infrastructure to protect against data loss and allow fast recovery of applications, business services and their data.
In all other areas of IT, when configuration complexity reaches a certain critical mass, the right management tools are deployed. This makes lots of business sense, because resorting to excel-spreadsheet management style will inevitably result in lack of control, inefficient resource utilization and hidden configuration errors. However, when it comes to the HA / DR infrastructure – people would still rely on point-solutions, custom scripts and manual, risky and labor intensive planning, auditing and testing. The sad reality is that in many cases, when your systems – or investment – is put to the test you might find that you still suffer from downtime and data-loss, because of configuration drifts your were not aware of between various elements in your infrastructure, such as storage and storage networking, servers, operating system configuration, replication, clustering and so on.
Today I’m going to show you how the right tool can protect your investment in HA / DR and allow you to reach unprecedented levels of control and confidence in your ability to recover at any time.
RecoverGuard is totally agentless software which can be deployed in a couple of hours; and after completing its first scan – show you where your risks lie and how to fix them to reach optimal protection level. In addition, RecoverGuard will help optimize your HA /DR configuration, in terms of server and I/O performance, storage and replication network utilization, and many other.
Let’s login to RecoverGuard through its state-of-the-art web-based interface.
 
The Dashboard
Once you log in, you’ll get to the Dashboard – the first product tab, which allows you to review, in a glance, what’s the status of your IT environment and the results of the last RecoverGuard scan. As you can see, additional product tabs exist, which we shall cover later in this demo.
The dashboard is divided into several areas, from the top-left, it will show you:
•  The status of the last scan, and what major changes were detected in your IT, such as the addition or removal of servers and storage arrays
•  Trending information of risks in the datacenter, by category
•  Top issues; and;
•  To the right – the readiness of your business applications and services
 
Let’s take a moment to review the last item. 
RecoverGuard keeps track of dependencies between your high-level business applications or business processes, and the collection of IT resources that support them. This way, it can report back on identified configuration risks and show you how they affect your business.
There are four main metrics which could be green to indicate all is well, yellow to indicate warning and red to indicate call for urgent action; if non-green:
  • Data risk – indicates that your data – either in production or in one of your backup or recovery environments – is at risk
  • Availability risk – indicates that RecoverGuard has detected a configuration issue in one of your standby systems – local or remote – that might result in extended downtime as a result of failure of your standby systems
  • Optimization – indicating that RecoverGuard can suggest ways to increase performance and efficiency, or better comply with various vendor’s best practices
  • SLA – allows you to track your IT environment compliance with required data protection and availability goals, such as RPO, retention, I/O capacity, and so on
Any of the non-green signs, like most parts of the dashboard, is clickable – and will take you straight to the appropriate section of the product to get more detail.
Let’s see what happens when we click on the data risk sign for the ERP business entity.
 
 
The Ticket Module
As you may notice, this brings us into the tickets tab, which we shall explore for the next several minutes. In this tab you will find a variety of views, search options and tools to allow you to clearly focus and drill down into any area of interest.
 
As you can see, now in focus are data-related risk categories that apply to the ERP business service – as a side note, RecoverGuard supports a wide range of potential risk categories – which you can view by clicking on the category link – this capability is used not only for interactive search, but also for scheduled report and alerting customization. For now – let’s continue to explore the current view.
On the top of the screen you can see a summary list of data-related risks for the business entity. When you click on one, its details will show up at the bottom. There is lots of information you can drive out of a RecoverGuard ticket
 
•  What’s the issue – the ticket descriptions contains all the information you will need to understand and fix the problem!
•  What’s the business impact [if not fixed]
•  How to solve the issue
•  What’s the history
•  Note area to preserve knowledge and comments
 
Finally, there is a variety of drill-down capabilities and ticket life-cycle management tools. Let’s view one such capability as we review our first ticket; I’m going to click on “show topology”, which will bring up a graphical representation of the ticket. This view interacts with the ticket to help understand the problem.
Basically, what this ticket shows is an incomplete replication - notice how a new storage device was added to the storage volume group without configuring an appropriate replica like the other volumes have. From this point, the volume group could no longer be recovered in the event of disaster!
Notice how you can get additional information as you click on any item on the topology window.
 
Let’s move to few other data protection ticket samples, as I introduce more product capabilities. I’m going to use the Filter tree at the left to switch context to the entire datacenter - not just ERP – by clicking on the root. You can also narrow down by host, database or site. Here is another ticket. I’m going to open the topology and use the “incremental view” button above to get a better angle. This is an interesting case, in a Hitachi Storage environment. A windows host is innocently using SAN storage volumes, unaware that these might get corrupted at any moment – since those storage volumes are replicas of other, unused ones. While this might seem strange at first sight, this a not a rare phenomenon. A possible explanation is that the two volume pairs were once used by another system, later reclaimed, and then allocated to host “blaked”. The system administrator does not suspect that the two new devices are in danger. Sometime in the future, the two source volumes could be allocated to another host. At a certain point in time later, the storage admin will scratch his head saying: “why is the replica split?” – From here - the road to disaster is steep; once re-synchronized, host “blaked” will immediately lose its data. The solution, of course, is to dissolve the connection between the pairs and avoid the risk altogether.
One last data protection example, again, I’m going to open the topology as well.  What we see here is an incorrect replication configuration of an Oracle database. 

Let’s zoon in to see the Oracle configuration. I’m using my mouse wheel to do that, but you can also use the magnifying glass buttons at the top. You can clearly see that the database is stored on three filesystems that are in turn stored on a mixture of SAN and DAS volumes. Let’s click on the filesystems and use the expand button at the top to get more details. Now it becomes apparent that the database data files are stored on Symmetrix devices, which have a complete replication structure to the DR site, including synchronous replication plus 5 point in time snapshots. The log files, however, are stored on a local disk, or DAS – and, of course, are not replicated. This practically means that our remote copy would not be recoverable. Our DBAs would never be able to tell that without working closely with the Storage team – and vice versa.
 

Let’s move to some availability risks.  I’m going to click at the “category” link and from the menu choose only the availability category. Now I’m going to re-search.  Let’s pick an example.  Opening the topology – the issue becomes clear – this is a NetApp SAN environment – we see a two node Veritas Cluster. Unix host “system1” is the active node using four NetApp LUNs to store an operating system Volume Group. As you can see, the standby node does not have access to one of the volumes – which mean that it will not be able to restart correctly if the primary node fails. This is quite a common issue, BTW.
 
One more example. This is another common issue in all cluster environments – you add a filesystem and create a cluster mount resource that will let the cluster know it should use the new filesystem as part of starting your application. However, there is one more step that needs to be done; you need to manually create a corresponding directory at each node – otherwise, the filesystem could not be mounted. If you miss one node, or even mis-spell the directory name in one of them – this standby can no longer run your application. Of course, the issue will remain hidden until there’s actually a failure – and then your cluster will not perform – and then, guess who’s going to get a call in the middle of the night, while a critical system is down?
Finally, let’s take a look at some optimization suggestions. To speed things up, I’m going to switch to the dashboard again, and click on the optimization sign of the ERP Business Entity.
 
Here’s an interesting example.  Let’s open the topology. I’m going to clean up a bit by selecting three of the four multiplexed log groups and clicking the “hide selected entities” button. 
This is a violation of an Oracle best practice that will result in significant performance degradation. Oracle can take care of multiplexing, which really means mirroring, of critical files for us. While any modern storage can also take care of mirroring – it’s actually a very good idea to let Oracle do that. Beyond redundancy – Oracle can optimize the I/O to practically double performance. This for sure makes the extra work worthwhile.   In this case the issue involves the database log-files, which are usually a major bottleneck of busy databases. Without getting into too much detail, one of the reasons log files are so busy, is that they are accessed by two I/O intensive processes – the logger, which writes the transaction logs, and the Archiver, which copies full log files while the logger moves on to use the next file. By allocating several spindles and using a clever file-layout scheme – Oracle can guarantee that the writes and reads will always be performed on separated storage volumes, dramatically improving performance. As you can see, something went wrong here; the files are stored on two separate filesystems – so the DBA seems to really try to do her job… However, a closer look reveals that the two filesystes are actually stored on the same volume! Instead of writing once (and reading from a separate spindle), the same date is written twice to the same volume AND read – three times as slow as it should.
You can check, BTW, whether the SAN storage device is replicated, which could actually make things worse ([click on device 288, and then hit the “Display replications” button above]). Well it is replicated, but at least not synchronously (split links suggest that the replication is performed periodically) .
 
Let’s just breeze through some more sample issues.  We can see examples of un-required replication of volatile information, such as tempdb, which needlessly suffocates your WAN, using off-site storage, swap file placement on slow disks, SAN I/O multi-pathing mis-configuration, and many others.
Let’s take a quick look at some of the other modules at our disposal.
 
 
The Reporting Module
The reporting module offers a rich set of views into various areas of IT - including in-depth storage optimization, comparison of production and standby host configuration, datacenter change logging, SAN I/O pathing configuration and many others.
All reports can be scheduled to be received by e-mail, and exported into Microsoft Office of PDF formats.
 
 
The SLA module
An additional module which would deserve a session of its own allows tracking your current effective-vs.-planned protection level. Let’s take a quick look.
You begin with a top view showing statistics about your recovery SLA. In this case – your RPO and retention. Let’s drill down into one of the Business Entities to get more details.  You can repeatedly zoom-in to get fine-grained detail.
Few tools but RecoverGuard can show you this level of detail. Notice that you do not only see all of your replica sets, but also an indication of their validity – just by hovering on top of each copy. You can clearly see that this specific database is not well replicated. If we drill down into another database on the same server, say, db5, we can see that it is well protected. In addition, you can set policies that reflect your required RPO, retention, performance and so on, and RecoverGuard will generate alerts in any case of violation. To quickly review the root cause of inconsistency of the nature of SLA violations, you can click the signs under the “Replication SLA” and “Retention” columns near each item.
 
 
The Configuration Module
This is beyond the scope of this short demo, but let me just tell you that RecoverGuard is totally agentless and requires very little configuration. After 20 minutes of basic set-up it can immediately scan your datacenters and perform gap analysis, generate reports and create graphical maps of your environment.
Let’s just jump for a minute to the topology tab ([Click the tab]) – and use the incremental view to tidy up the diagram. Of course you can interactively drill down into any component.
Getting back to the configuration tab – it’s all done through a single wizard your hosts are discovered; your storage arrays, and so on;
This is where you also define your policies, manage users and roles, schedule reports and so on.
 
Conclusion
RecoverGuard is an easy to deploy and easy to use analytics tool to get you back in control over your HA / DR environment. Bridging the gap between various IT silos, with an extremely wide knowledgebase of more than 3,000 risk signatures, it allows you to keep your systems in top shape. It’s the equivalent of running a full-load DR test every day – without disrupting production.
RecoverGuard will show fast ROI in terms of resource optimization, elimination of downtime and data loss, and significant reduction in the effort of manual auditing and preparation to full-blown DR test.
I hope you have enjoyed this presentation. Please contact us at our site, www.continuitysoftware.com for any additional information.
Thanks for watching this RecoverGuard product demo.
 

 

Downloads: