Using LVM mirroring for disaster recovery?
One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.
Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.
Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:
#1: Incorrect mirror configuration
The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a local SAN array. Dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by RecoverGuard on datacenters implementing LVM mirroring and snapshots.
#2: Missing mirrors
When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… check it out). The outcome is yet again complete data loss upon disaster.
#3: Not build for a large scale datacenter
One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on. Several sample weak spots that can be included in this category are:
No federated consistency. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
Difficult to manage PiT copies. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.
#4: No true async mode
Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.
Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).
LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as RecoverGuard can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our website for additional information.