The Lesson of the Black Friday Outages: IT Resilience is Paramount!
Isn’t Black Friday great! If you shop online, you get the same great discounts as in bricks and mortar stores and you avoid the stampeding crowds. Well, that’s the idea, but as we just saw on Friday, the websites of major retailers such as Walmart, Lowe’s, Lululemon, and J. Crew, and many others couldn’t handle the sales traffic. Too bad! They lost hundreds of millions of dollars, not to mention their good reputation.
Probably, in addition to hoping, praying and advertising for the impressive increase in online sales that actually did occur this past week, retailers should have also been making super sure their sites were truly resilient.
The big question is why didn’t redundant servers and systems kick in to meet the anticipated onslaught of customers descending on their sites?
We know that online retailers typically invest tens of millions of dollars in various Redundancy and High Availability (HA) technologies; they have continuous availability and resilience architecture and systems in place to make sure they meet 100% uptime, and to avoid incidents exactly like these.
Typically, architecture which aims for resilience is built as follows:
- Geographical redundancy – At least 2 (and typically 3) datacenters with failover capability and a synchronous copy of the data at all times.
- In each datacenter, complete redundancy of all layers:
Power – should arrive from multiple sources and have no single point of failure.
Networking – should arrive from multiple sources and have no single point of failure.
Compute – each system/function should be able to relocate to another server in case of server failure.
Storage – Complete redundancy in the form of RAID or similar fail-proof storage. This should include the ability to roll back in time and have synchronous replication (data copy) to the other datacenters.
Systems/Software – Software that’s designed to work in Continuous Availability mode.
- Automated Resilience Assurance Software – Verification Software (or in some legacy cases –manual processes) to make sure there are no single points of failure at any of the above levels and that all systems/storage/servers/etc. are configured correctly to work in case one or more of the components fail for whatever reason.
This complicated resilience architecture, is built so that if an outage / freeze like those on Friday occurs, you:
- Immediately roll back your application if this is some application feature that wasn’t tested thoroughly enough before it moved to production
- Or, you take advantage of your redundant environment and move everything there until you figure out what was the root cause of the problem.
We don’t know how each of these retailers’ architecture and environment is built or what went wrong but we do know that if correctly planned and built, and maintained with automated resilience verification tools, Friday’s scenarios wouldn’t have occurred.
We don’t want to be accused of being Monday-morning quarterbacks, especially since this is what we’ve been saying all along. If you/retailers/other enterprises want to ensure uptime, resilience of the IT environment must be taken on as a comprehensive and holistic project. And, resilience is a very achievable goal.
Resilience is also a very important goal because hybrid IT environments are becoming more complex with more hardware and software technologies and more interconnections between them. As a result, it’s more difficult to avoid single-points-of-failure across your IT stack and maintain the highest levels of IT resilience.
Good luck on Cyber Monday!