Stories We Tell, and the Real Reasons for IT Outages
IT outages seem more common than ever, as a quick online search will readily reveal. Outages seem to escape any clear classification; they span industry segments (we’ve seen dramatic outages in financial services, social media, airlines, telco, retail, etc.), they cover both new and well established design-patterns (from pure public-cloud infrastructure, to “traditional” IT, and any hybrid permutation in between), and impact enterprises of all sizes.
The business implications of service disruptions also seem to increase exponentially over time. IT outages had always inflicted “direct” losses (e.g., lost revenues, productivity loss, recovery expenses, etc.) and these become more pronounced with digital transformation tightly entwining critical business processes with IT. With readily-available avenues to vent their frustration, end-users are more vocal than ever, and this does not escape the attention of the media, board of directors, and regulators. We’ve seen at least one CEO step down in 2018 as a result of prolonged IT challenges, and have repeatedly heard of increasing regulatory fines, and much more rigorous enforcement.
Public communication patterns also adapt. Only 3-5 years ago, it was not uncommon to encounter blissful avoidance of the subject, or, at best, a laconic announcement by a company spokesman. The burden has since been gradually shifting to senior IT execs, then to the CIO and, of late, to the CEO.
Stories people tell
When looking at the reasons given, we still see an interesting pattern of over-simplification (or naiveté?). Way too often it’s a “hardware issue,” or the notorious “technical glitch,” and the like.
Of course, such explanations are hardly satisfactory! As we’ll explain shortly, at best, they expose just tip of the iceberg – completely obscuring the real deep-rooted cause (and at worst, they’re simply wrong…).
Broadly speaking, the triggers for outages are either physical (e.g., hardware-related, power outages, natural disasters, etc.) or “soft” (e.g., human error, software bug, wrong processes).
Pinning the blame on physical triggers is almost pointless. For decades now, every mission-critical component in enterprise IT has been configured with full redundancy! From redundant power supplies and network connections at the server-level, through redundant core networking and storage infrastructure, all the way to the use of clustering, load-balancing, elastic computing, etc. Even if all of those layers of defense fail (for example, when an entire datacenter is lost, or a cloud provider collapses in an entire US coast), almost all enterprises also implement Disaster Recovery solutions that enable critical applications to quickly restart. So indeed, any type of hardware issue cannot, in and of itself, account for a prolonged outage.
“Soft” triggers, while indeed more challenging to weed-out, do not make convincing explanations either. Before performing software updates, rigorous Quality Assurance testing is expected. Anticipating that some issues will still escape testing, elaborate roll-back plans are made that allow reverting to a previous, tried and true, environment.
So what are the real causes, and why use technology as a scape-goat?
Outages are, in fact, very similar to violent natural or biological phenomena, such as erupting volcanos, or unexpected heart-attacks. They are almost always the culmination of multiple faults that were building up over time, and lying in wait for the right set of circumstances to materialize. As an example, let’s look at an airline blaming a blown power-switch as the reason for days-worth of service disruptions. Let’s break it down to its components: when the switch failed, multiple servers (and other gear) went down. This is natural. However, these must have been configured for High Availability (clustered, load-balanced, etc.), right? So there had to be something wrong there in the first place. And what about Disaster Recovery – it should have allowed for a slightly longer, but still fast recovery, right? So there must have been at least one fault there too… And what about testing? Enterprises do that frequently (some are obligated to by law). Why didn’t it reveal the underlying faults?
At Continuity, we’ve been analyzing thousands of other downtime scenarios – which, in one way or another, all boil down to the fact that signs were present for quite some time – but no one knew how to read them.
When under duress, it’s very human to lay the blame on “fate,” “nature,” or force-majeure – when in fact, much like most erupting volcanoes and heart attacks, outages can be detected in advance, given the right technology, processes and approach.
For years, Continuity Software has devoted its attention to building the tools, processes, and know-how to help enterprises take control of and assure IT resilience. In future posts we’ll discuss some of these in detail.