Availability, Recoverability and Data Protection are critical to any enterprise. The alternative cost is unacceptable in capital and reputation losses. Thus, significant time, resources and money are allocated to ensure that all business lines are highly available and can be recovered at various circumstances (from accidental file deletion to earthquake). However, the datacenter is constantly changing and despite the huge effort and world-class IT experts, new risks emerge on a regular basis. Traditional mitigation approaches do not salvage and regularly cost in downtime, potential data loss and excessive operations since:
• Discrete data is gathered by discrete systems, but none correlates all required layers: storage, OS, Database, replication, clustering, etc.
• Home-grown data collection and correlation is economically irrational
Using an HA/DR analytics solution, organizations can dramatically increase availability and recoverability levels in one hand and on the other hand – save significant time and money. An HA/DR analytics solution such as RecoverGuard by Continuity Software or DRA by Symantec analyzes thousands of potential risks by correlating configuration of applications, databases, file systems, servers, storage, replication, clustering and “what’s between”. It keeps getting updated regularly and many like to think of it as an “anti-virus for HA/DR”.
So how can an HA/DR analytics solution help cut down HA/DR costs? It’s simple really. Most enterprises are over spending in the three following direct-cost areas:
• Cost of avoidable downtime
• DR testing operations expense
• HA/DR related sub-optimal resource utilization
Let’s explore each of these cost areas.
Cost of avoidable downtime. Unsuccessful cluster failover, single point of failures in storage network/multipath, RAID level issues, risky layout of database files, suboptimal configuration of database vs. file systems vs. storage leading to unacceptable performance… all these and much more can be completely avoided by deploying an HA/DR analytics solution. If an hour of downtime costs 100K, a very serious cost reduction opportunity lies here.
DR testing operations expense. The organization can become aware of DR readiness before actually performing the test and failing over. Thus, only execute a DR drill only after resolving known recoverability issues. By doing so, significant time and resources are spared. Furthermore, by identifying new threats on the spot, as they emerge, it is guaranteed that resolution time and involved manpower are minimal. Last night changes are still fresh and identifying the root cause is easy – unlike when an error occurs in a yearly DR drill and no one remembers the specific change (one of many…) that was performed months before and created the error. Moreover, with the reporting features and in-depth visibility to dependencies between production and DR systems, replication, cluster configuration (and so on), a DR test requires less resources from the various IT teams and much less manual labor from the BCP personnel.
HA/DR related sub-optimal resource utilization. While it is not the main purpose of HA/DR analytics, the data gathered by such tools allows them to identify saving opportunities. Examples of such opportunities around storage saving are allocated but unused devices, old replicas, file system or raw device allocated to database but hardly used and so on. On the storage network side, replication bandwidth can be optimized with detection of excessive replication, swap replication, temp database replication and so on. Naturally hidden saving opportunities exist in other layers as well.
HA/DR readiness verification is a too complex task to be performed manually without the right tools for the job (check out my “BCP is not different than other IT departments” post). Considering the different teams involved, different layers, different products, vendors and the endless details embedded within each unique component, it is practically mission impossible manually. You know when a DR test starts but you don’t really know when it is going to end…and you don’t know whether data is recoverable at any given time and when the next downtime event will hit. Automation is the key to success and significant cost reduction. HA/DR analytics solutions can dramatically increase control over HA/DR readiness and at the same time reduce the costs considerably.