Recent outage events catch BCP with their pants down
There has been a flurry of recent reports regarding outages leading to downtime and data loss. Here is just a small sample:
- This Article in USA Today is titled “United’s flight mess latest caused by computer glitches,” but also discusses similar outages at Alaska Airlines, Southwest Airlines, Spirit Airlines, and the Dutch city of Maastricht.
- An Exchange online outage at no less than Microsoft.
- ‘Signal storm’ caused Telenor outages: 3 million users left with no service for 18 hours.
- TDS Telecom restores service in Hancock County: Firm’s office in Arcadia, Ohio was struck by lightning (can happen everywhere, really).
Clearly all the above invest heavily in HA/DR solutions, so why were they caught unprepared when an outage occurred?
We’ve addressed this question in previous posts such as Why DR testing doesn’t work, Think that little config change is minor? Think again and BCP is not different than other IT departments . The short answer is that today’s datacenters are too complex and dynamic to rely on periodic tests or audits.
Conclusion: real-time verification of HA/DR readiness is required.
How can this be accomplished? Considering the amount of labor put into a single audit/test, automation is a must. Disaster Recovery Management solutions such as RecoverGuard and DR Advisor can help you achieve the following benefits:
- Continuous non-intrusive end-to-end infrastructure scan for 24×7 visibility into DR risks
- HA/DR risk detection and readiness of passive systems for failover
- Ability to measure the de-facto RPO and other HA/DR SLA metrics
- Comprehensive HA/DR readiness reporting