The outages demonstrate the big risks of no visibility into IT environment health
You’ve probably heard the phrase “too big to fail.” It’s used with respect to huge corporations whose failure would impact a nation’s economy. These words came to mind in the very different context of the March 13-14 Facebook outage felt around the world and which affected millions of Facebook, Instagram and WhatsApp users. And again in April, a more than two hour outage prevented access to this same “family of apps.” We ask, “Can this actually happen at Facebook?” The question implies recognition of Facebook’s unique status as a mega enterprise with a service availability commitment to 2.3 billion users which, until recently, has been dependably fulfilled. How is it possible that they were not resilient to this kind of mega-incident? Not only was the average person surprised by the extent of the outage and its duration, IT professionals were also astounded.
Is Facebook assuring the resilience of its environment to the degree it should or could?
There’s no need to re-chronicle the approximately 14 hour March outage. It is resolved and reported to have been “a result of a server configuration change.” But again, we are stunned. We assume, we believe, we know, that at Facebook every mission-critical component in their IT environment is configured with full redundancy – from redundant power supplies and network connections at the server-level, through redundant core networking and storage infrastructure, all the way to the use of clustering, load-balancing, elastic computing, and so on. And yet, there was a very long outage.
Software triggers don’t make convincing explanations either. Before performing software updates, or configuration changes, rigorous QA testing is expected and roll-back plans are prepared that allow reverting to a previous, tried and true, environment.
How could Facebook have known of the potential faults in its IT environments?
We don’t have the inside story on what made this server configuration change so ill-fated. Nevertheless, we can say with a pretty high degree of certainty that if FB would have had a proactive resilience assurance solution in place, it would have helped them identify the point(s) of failure in their IT configurations that led to the outage before it occurred. That is, the outage could have been prevented.
At Continuity Software, we’ve been analyzing thousands of other downtime scenarios which, in one way or another, all boil down to the fact that signs were present for quite some time – but no one knew how to read them. We have devoted our attention to building the tools, processes, and know-how to help enterprises take control of and assure IT resilience.
Recommended resilience assurance practice
Root-cause analysis following an outage generally leads to a conclusion about its cause. This is a requirement. To more reliably assure resilience, IT teams should ideally also be required to continually analyze what could potentially trigger downtime or service disruption, and in this way, repair the problem before it causes damage. This is precisely what our AvailabilityGuard NXG™ solution does.
We are certain that Facebook is taking a good look at how it assures resilience and prevents downtime. We offer a proven and holistic approach used by institutions that also have 24×7 uptime commitments to millions of users worldwide.