[vc_row][vc_column][vc_column_text]In a recent blog post, we published a noteworthy finding: over half of the companies we surveyed had an outage in past 3 months. And such outages continue to occur with alarming frequency even in many of the world’s largest and best-run shops. How is this possible? Can we not do more to avert such outages? And if we can, why aren’t we doing so?
Life in a Seismic Zone
As I pondered this question, I was reminded of life in California in the late 90s. At the time, I was working for a leading software company about an hour’s drive south of Los Angeles and not far from the coast. We were headquartered in the top floors of a sleek, modern office building. One week we were visited by John, a colleague from our London office. On the second morning of his visit he approached me, looking rather unsettled.
“Gil,” he asked tentatively in a quiet voice, “is the floor shaking?”
Of course, I should have quickly pulled him aside, explained to him about microtremors, and reassured him that our building (like other modern high-rises in the area) had been specially engineered to absorb such minor seismic shocks. So, while we locals had learned long ago to tune out these occasional vibrations, the floor probably was shaking. But it was all quite normal. Nothing to worry about.
But I wasn’t quick enough. A nearby colleague had overheard everything.
“Hey Pete,” the kibitzer called out across our open space, “John here wants to know if the floor is shaking.”
“Shaking?” retorted Pete, pounding the floor ostentatiously with his foot. “Nope, seems quite solid to me. I guess someone had one too many in the bar last night, eh?”
Too Many Near Misses
Almost every day I speak to companies about their IT infrastructure concerns. I hear many stories not just of outages, but also of near misses. One manager told me about a recent planned cluster failover. During a scheduled maintenance window, his team switched the cluster over to the passive node, so as to apply an upgrade to the active side. The switchover attempt failed. After scrambling for a couple of hours, a misconfiguration on the passive node was found and fixed, and the planned maintenance went ahead.
“We ended up overrunning the scheduled window,” confessed the manager. “But we still counted ourselves lucky. Had this been an unplanned failover, we would have had a major outage on our hands.”
The post-mortem revealed that the misconfiguration on the passive node had lain dormant for many months. And this wasn’t the first time, either. Another infrastructure executive shared with me the travails of her push towards converged infrastructure.
“The initial implementation and onboarding went relatively well,” she explained. “But now that the various teams are making routine production changes, we’re having trouble keeping the environment as resilient as it was on Day One. It’s proving a longer and more painful learning curve than we expected. I realize that a major outage is just a matter of time.”
So the signs – those microtremors – are everywhere. Are we listening?
Poor John. We sure did have a good laugh at his expense. But was the joke really on him — or on us?