About virtualization and disaster recovery

About virtualization and disaster recovery

by Yaniv Valik on December 29, 2011

Virtualization and solutions such as VMware SRM certainly encapsulate great advantages for DR testing. The assurance that the production and DR servers are 100% identical is a most appealing feature of virtualization. Other benefits such as the ability to run production and DR in parallel, run a DR exercise whenever you want and the simplicity of virtualized servers also bring progress to the field of Disaster Recovery Testing. However, even with virtualization, successful recovery is far from being a slam dunk. In the next paragraphs I’ll try to outline few of the challenges around recovery in a virtualized environment.

Configuration errors and best practice violations may still render your replicated virtual machines/data corrupt and/or inconsistent, thus irrecoverable. Very much like in the physical (pre-virtualization) world, if you do not follow the rules and devotedly make sure that implementation and day-to-day changes meet the guidelines of the different vendors , your recovery will be at risk. For example, a point-in-time copy (created with EMC TimeFinder, HP StorageWorks XP Business Copy or alike) taken while the source virtual machine was not shutdown or suspended is at high risk of being inconsistent. Some of you may recall similar concepts for creating consistent images for databases such as Oracle (cold/hot backup), UDB (I/O suspension) and other DBMS. Of course, this is just one example of a long list of pre-requisites, guidelines and recommendations – and each vendor has its own list. On top of that – there are specific cross-vendor guidelines e.g. NetApp and VMware, Hitachi and Hyper-V, etc. – but we’ll get to that later.

A DR exercise is still a complex operation. Yes, with tools such as VMware SRM, in theory a DR test is just few clicks away. In reality, there remain many challenges that prevent a frequent DR test, such as:

Complete Prod/DR separation is difficult and mistakes gravely affect production

  • Networks collisions, conflicts, etc
  • Dependencies on physical elements (file server or other sensitive un-virtualized application  – that at the same time interacts with virtualized components)

DR testing is more than just failing over – it’s a complex operation

  • Different teams must verify storage, servers, databases and applications are functioning properly
  • Real life scenario must be simulated including peak load scenarios
  • Workstations must be manned with end users
  • It takes time
    • Problems must be resolved
    • Processes must be coordinated
  • It’s not enough to bring everything online, reasonable performance must be also be tested and assured (see example ahead)

Manpower – as implied by the previous bullet – a DR exercise requires dedicated cross-domain human resources

  • BCP personnel, Project managers, IT managers,…
  • Storage administrators
  • Unix and Windows system administrators
  • Network administrators
  • Security personnel
  • Oracle DBAs, MS-SQL DBAs, …
  • Application owners – WebSphere, Bea, Exchange, Lotus,…
  • End users

Dependencies and overlap areas between different domains and areas of responsibility create vulnerabilities which jeopardize the ability to recover successfully. Virtual machines depend on correct storage and replication configuration. For example – reduced RAID level configuration put your VM at risk. Furthermore, if VMFS is being partially remotely replicated or if consistency groups do not include all required resources, data will be lost upon disaster. There are plenty of other samples of what-could-go-wrong in the VMware-Storage overlap (is your remote ESX configured with the same multipath level at the production ESX? Same load balance algorithm? Queue depth? I can go on and on). Other dependencies exist between databases and VMware – depending on your required level of recovery assurance, you may need to put the database in backup mode (Oracle lingo) while creating VMware or storage snapshots (unless you’re willing to settle with recovery-not-guaranteed crash consistent copies…). Another overlap area is between virtualized and un-virtualized environments. In the real world, not all assets are virtualized. Those assets interact with virtualized components. Hence, all the complexities of the physical disaster recovery drill still exits (some may say that having to deal with two types of environments creates even a greater challenge). Examples of such virtual-to-non-virtual relationships are:

  • Virtualized client accessing non-virtualized NFS/CIFS file server
  • Database on VM interacts (via DB links for example) with database on a physical server
  • Virtualized business line relies on data from other non-virtualized business line – or vice verse
  • Virtualized clients accessing not virtualized applications (Exchange, Lotus, etc)
  • Physical domain / DNS servers serving virtualized environments
  • And so on…

To sum up – Yes, virtualization is a game changer for Disaster Recovery Management (DRM). Nevertheless, many of the traditional BCP/DR challenges still exist as well as several new challenges which have emerged as a result of using virtualization. Running a DR exercise is simple in theory but not in practice. To ensure successful recovery, an enterprise organization must put significant time, money and human resources. Automation is the key. The use of HA/DR monitoring solutions such as Continuity Software’s AvailabilityGuard/DR and Symantec’s Disaster Recovery Adviser (DRA) can give BCP and IT teams visibility into dependencies in virtualized and physical environments and automatic availability/recovery vulnerability detection.

Yaniv Valik
Yaniv Valik
VP Product Management & Customer Success at Continuity Software

Comments are closed.