Isn’t Black Friday great! If you shop online, you get the same great discounts as in bricks and mortar stores and you avoid the stampeding crowds. Well, that’s the idea, but as we just saw on Friday, the websites of major retailers such as Walmart, Lowe’s, Lululemon, and J. Crew, and many others couldn’t handle the sales traffic. Too bad! They lost hundreds of millions of dollars, not to mention their good reputation.
Probably, in addition to hoping, praying and advertising for the impressive increase in online sales that actually did occur this past week, retailers should have also been making super sure their sites were truly resilient.
The big question is why didn’t redundant servers and systems kick in to meet the anticipated onslaught of customers descending on their sites?
We know that online retailers typically invest tens of millions of dollars in various Redundancy and High Availability (HA) technologies; they have continuous availability and resilience architecture and systems in place to make sure they meet 100% uptime, and to avoid incidents exactly like these.
Typically, architecture which aims for resilience is built as follows:
Power – should arrive from multiple sources and have no single point of failure.
Networking – should arrive from multiple sources and have no single point of failure.
Compute – each system/function should be able to relocate to another server in case of server failure.
Storage – Complete redundancy in the form of RAID or similar fail-proof storage. This should include the ability to roll back in time and have synchronous replication (data copy) to the other datacenters.
Systems/Software – Software that’s designed to work in Continuous Availability mode.
This complicated resilience architecture, is built so that if an outage / freeze like those on Friday occurs, you:
We don’t know how each of these retailers’ architecture and environment is built or what went wrong but we do know that if correctly planned and built, and maintained with automated resilience verification tools, Friday’s scenarios wouldn’t have occurred.
We don’t want to be accused of being Monday-morning quarterbacks, especially since this is what we’ve been saying all along. If you/retailers/other enterprises want to ensure uptime, resilience of the IT environment must be taken on as a comprehensive and holistic project. And, resilience is a very achievable goal.
Resilience is also a very important goal because hybrid IT environments are becoming more complex with more hardware and software technologies and more interconnections between them. As a result, it’s more difficult to avoid single-points-of-failure across your IT stack and maintain the highest levels of IT resilience.
Good luck on Cyber Monday!