StorageGuard - by Continuity™ - is the ONLY Security Posture Management solution for Storage & Backups, helping to ensure these systems are securely configured, and compliant with industry & security standards.
You know how at the end of every year lists are created intending to reflect, usually, 10 of the year’s best examples of a particular phenomenon or event? Well, we’ve also compiled a list but ours has a couple of twists. One, while most lists are described as the “best of,” ours can be described as the “worst of.” Two, our list doesn’t have 10 items, but 19, in line with the number of years into the 21st century. So, without too much further ado, we present to you our list of 19 of the worst IT outages of 2019. But first…
What makes an outage bad enough to warrant a place on our list? We decided on three criteria to use for listing an outage: 1. length of the outage; 2. number of people/systems affected; and, 3. cost of the outage. We’ll admit, though, that a couple made it onto the list because they were unusual, sort of groundbreakers.
Outages seem to escape any clear classification; they span industry segments (we’ve seen dramatic outages in financial services, social media, airlines, telco, retail, etc.), they cover both new and well established infrastructures (from pure public-cloud infrastructure, to “traditional” IT, and any hybrid permutation in between), and impact enterprises of all sizes. In other words, no one, not even cloud providers themselves, is immune against outages!
To help you place the outages in the right context of reasons why whole environments went down, we’ll review the causes leading to the outages covered here. Generally, the outages fall into three categories: technical software or hardware issues; unusually high traffic / load; cyberattacks and/or ransomware.
IT outages due to technical software or hardware issues
As our table below shows, software or hardware issues are the key reasons organizations experience outages/unavailability and performance disruptions. Not every enterprise will let the public know why its site went down and will often issue a non-informative reason, attributing the outage to a software or computer “glitch.” Usually, “glitch” can be interpreted as a misconfiguration or a failure of redundant systems to failover (see our post on this). As a rule, when a system becomes unavailable, a secondary, redundant, site should take over operations. Many outages occur when the transition to the secondary site fails to take place.
No. | Name | Date | Description |
1 |
CenturyLink (communication, network and related services) |
(very end of) December 2018 |
22 million subscribers in 39 states were affected by an outage and 17 million customers across 29 US states were unable to reach emergency 911 services and at least 886 calls to 911 were not delivered. These subscribers and others in the UK and Singapore lost connectivity for two days. Additionally, customers could not make ATM withdrawals, access sensitive patient healthcare records, and more. The outage was attributed to equipment failure exacerbated by a network configuration error; redundant systems did not take over. (In 2015, CenturyLink was fined $16m for a six-hour 911 outage.) |
2 |
|
March 2019 |
Facebook’s first, but not only, outage of 2019, lasted 14 hours and was reportedly the result of a “server configuration change.” Things happen, but why didn’t redundant systems take over? See our post on the FB outage. |
3 |
AeroData (weight and balance calculations for flight planning) |
April 2019 |
A “mere” 40 minute outage delayed close to 3,000 flights. Affected airlines included Southwest, SkyWest, United, Delta, United Continental, JetBlue and Alaskan Airlines. The outage was referred to as a “technical issue.” Although recovery was fairly quick, damage was significant. Was the outage due to a misconfiguration? In any case, redundant systems did not kick in. |
4 |
Microsoft Azure |
May 2019 |
A nearly three-hour global outage affecting core Microsoft cloud services, including compute, storage, an application development platform, Active Directory and SQL database services. Cloud-based applications, including Microsoft 365, Dynamics and Azure DevOps, were also impacted. Microsoft stated that the outage was caused by “a nameserver delegation change affecting DNS resolution, harming the downstream services” and occurred “during the migration of a legacy DNS system to Azure DNS.” |
5 |
Salesforce |
May 2019 |
A 15 hour global outage due to a permissions failure allowed users of Salesforce’s Pardot marketing automation software to see and edit the entirety of their company’s data on the system. Salesforce cut access twice to the larger Salesforce Marketing Cloud to stop exposure of sensitive information and to handle what was discovered to be a database script error. Sales agents and marketers around the world lost access to customer information. Restoration of permissions was not simple; customers’ admins had to set up permissions again, some could do so automatically, some needed to manually restore. Here’s our take on the outage. |
6 |
Google Cloud Platform |
June 2019 |
A four-hour outage affecting many millions of users including tech brands that use Google Cloud as well as Google’s own services such as YouTube, Gmail, Google Search, G Suite, Google Drive, and Google Docs. The problem occurred during maintenance operations and was caused by a software bug combined with two misconfigurations, which Google characterized “normally-benign” misconfigurations. In our experience misconfigurations are never benign. If they don’t immediately cause an outage, they eventually will. We referred to this in a post earlier this year. |
7 |
Verizon |
June 2019 |
A roughly three hour worldwide outage of major websites like Google, Amazon, and Reddit did not originate with Verizon but, it was the company that allowed the fault to propagate. A misconfiguration in the routing optimization software of a small internet service provider led to incorrect routes that were eventually taken up by Verizon, which did not have software in place to block and filter them. These faulty routes caused massive volumes of traffic to be directed through small networks not equipped to deal with it, leading to packet loss, unavailability, and disruption of services at major websites. |
8 |
Slack |
June 2019 |
Seven hours of connectivity problems at the extremely popular internal communications platform used by tens of millions of office workers worldwide. Some of the company’s servers became unavailable due to “glitches” that ultimately caused the inability to connect, and degraded performance for job processing for many hours during which there was a 10-25% job error or failure rate. |
9 |
|
July 2019 |
As Facebook explained, for nearly a day “many people and businesses experienced trouble uploading or sending images, videos and other files on our apps.” Due to the nature of this disruption, those mainly affected were Instagram’s more than one billion users. FB further disclosed that during “one of our routine maintenance operations, we triggered an issue that is making it difficult for some people to upload or send photos and videos.” Sounds rather similar to the reason given for the March 2019 FB outage. |
10 |
Comcast |
August 2019 |
Xfinity users across the United States experienced an outage in their high-speed internet services during the night (prime viewing hours). The outage was due to “routine maintenance” but problems continued longer than originally predicted. |
11 |
British Airways |
August, September & November 2019 |
Tens of thousands of passengers were stranded in August in cities around the world due to cancellation of about 130 flights and the delay of 200. In September, 120 flights were delayed and 300 cancelled and in November, 114 flights were delayed, some by as much as 22 hours. Various interdependent, new and legacy systems which malfunctioned were responsible for the chaos experienced by the airline and its passengers. We shed light on the circumstances leading up to the kinds of outages seen at BA and other airlines. |
12 |
Facebook & Instagram |
Thanksgiving: November 2019 |
On Thanksgiving, one of the most social and sharing of days, users of Facebook’s family of apps across the United States and central Europe experienced problems and were unable to post to FB or view stories on Instagram, the app which seemed to be the most severely affected. At one point during the day, there were more than 12,000 reports of problems. Described as a major outage, the company traced the problem to “an issue” in one of its central software systems. |
IT outages due to unusually high demand / traffic / load
When sites experience unusually high traffic and do not have enough bandwidth and/or extra load fails to be diverted to the additional servers enterprises undoubtedly have available, an outage occurs. These kinds of outages occur at all types of sites and in different industries. Outages can grab headlines when they result after a great buildup that creates expectations among users – for example, on Black Friday. But, man/woman does not live by shopping alone, and not being able to access critical and potentially life-saving information also angers people.
Both online stores and online customers look forward to special sale days. The stores rake in many millions of dollars in sales while shoppers enjoy deep discounts on desired items. Online sites are deluged customers requesting access to their sites but some apparently don’t have the bandwidth in place to accommodate the amount of traffic they receive, causing their site to crash. In 2018 we reported on the major retailers that saw their sites go down and miss out on hundreds of millions of dollars in sales. These included Walmart, Lowe’s, Lululemon, and J. Crew, and many others. It’s been estimated that for every hour of the site is down, a retailer will lose 4% of the day’s sales.
This year, H&M experienced outages on Thanksgiving eve and Thanksgiving Day while Nordstrom Rack shoppers reported technical issues and Home Depot had frustratingly slow load times. We can expect more reports. In the meantime, the most upsetting outage was at Costco.
13 |
Disney+
|
Launch day: November 2019 & December 2019 |
The highly anticipated start of Disney+ left millions of viewers worldwide with a blank screen. The streaming company bought by Disney to handle the technical aspects of streaming to tens of millions of people globally apparently could not handle the traffic. And, less than a month later streaming was unavailable again, long enough to garner 2500 complaints from viewers worldwide, though this time the problem was resolved fairly quickly. |
14 |
Costco |
Thanksgiving: November 2019 |
The most reported outage thus far of Thanksgiving. Costco, an umbrella/aggregator for more than 1,000 retailers, experienced an outage of more than 16 hours, affecting approximately 2.65 million consumers. The estimated loss in sales was about $11 million. Why didn’t Costco’s secondary site take over? |
15 |
Dexcom (glucose-monitoring software for diabetics) |
December 2019 |
A four-day outage affecting diabetics using Dexcom’s app to keep track of glucose-levels. Diabetic children, in particular, need intensive monitoring. Outage was caused by a server overload “due to an unexpected system issue that generated a massive backlog, which our system was unable to sufficiently handle.” Did failover to redundant systems fail? |
IT outages due to cyberattack and/or ransomware
A recent survey found that more senior managers at enterprises were worried about a cyberattack (53%) than an IT outage (36%) or network failure (24%). Twenty percent of managers in the UK revealed they had already experienced 3-4 cyberattacks in the preceding 12 months and suffered the usual fallout from such attacks.
Ransomware has caused significant problems for major corporations and cut off access to core data. The financial industry recently simulated a fictional scenario highlighting what would happen if a ransomware attack targeted the biggest financial institutions, taking critical parts of the global financial system offline.
Indeed, enterprises and organizations of different types from shipping corporations and telecoms to municipalities and healthcare providers and hospitals have been suffering from cyberattacks, frequently in the form of ransomware attacks. Malicious actors attack targets in order to sow disruption and extort large payments from victims so they can access their key data. To guard against attacks organizations must build cyber resilient environments and have cyber recovery solutions in place. Solutions for cyber recovery differ from those for disaster recovery, which are appropriate for catastrophes resulting from events such as earthquakes, floods, or human error.
The proliferation of attacks has already led security experts and government regulators to declare that it is no longer sufficient to implement comprehensive security solutions alone. Enterprise must now ensure that their IT environment is resilient, which means that under circumstances of a cyber event/incident or attack, the enterprise must be able to continue delivering services. Organizations must ensure their critical data is not compromised and is recoverable, and to do so, they must build cyber resilient environments protected by cyber-resilience and recovery solutions.
16 |
Norsk Hydro |
March 2019 |
Apparently the largest ransomware attack of 2019, it impacted 22,000 computers in 40 countries, across 170 sites and cost the global aluminum producer, Norsk Hydro, a reported $75 million to recover from. LockerGoga was the suspected ransomware used and it’s believed the attack also involved “Active Directory – used for authenticating and authorizing all users and systems on a Windows domain type network.”
Norsk Hydro did not pay the ransom to receive decryption software. It took months to restore its files and verify they were malware-free, during which time production was offline. See some of our recommendations for building a cyber resilient environment. |
17 |
City of Baltimore
|
May 2019 |
RobbinHood ransomware encrypted the city’s hard drives locking access to data. Many systems were attacked including payments, email, real estate transfers, dispatcher services; hospitals, factories producing vaccines, airports and ATMs were also affected. The city did not pay the ransom of about $76,000 (13 bitcoins) but is investing a minimum of $10 million to rebuild a cyber-secure system. The city lost roughly $8M due to downtime until they were somewhat operational again – a several-months process.
In writing about the Baltimore attack, it was pointed out that even a single system vulnerability can grant access to an attacker. |
18 |
Connectwise
|
May 2019 |
Connectwise offers a platform of software built for technology solution providers. This cyberattack, their third of 2019, occurred in Europe through a breach to an offsite computer used for cloud performance testing. Other than being forced to go offline, they suffered minimum fallout from the attack. However, in April, attackers used its software to seed 100 of Indian IT company WIPRO’s servers and distribute their attack. In February, an integration between ConnectWise and rival MSP platform Kaseya was exploited by cyber criminals. Connectwise CEO stated that MSPs (managed services providers) were increasingly becoming cyberattack targets. |
19 |
Capital One |
March to July 2019 |
It took four months for Capital One, the bank holding company specializing in credit cards and auto loans, to discover their data storage had been breached. Personal and financial data of 100 million customers were stolen. Along with credit card numbers, birth dates, addresses, names, phone numbers, transaction history,
140,000 Social Security numbers and 80,000 bank account numbers were also stolen. All this sensitive information was stored on Amazon S3 which the hacker accessed via a misconfiguration in the web application firewall. |
Summary
The IT outages reviewed here impacted hundreds of millions of people worldwide. Many other IT outages occurred for a variety of reasons. These were simply some standouts. It’s clear that to avoid such incidents, organizations should implement solutions which proactively and continuously keep IT and InfoSec up to date on the status of IT resilience and cyber resilience.
Our solutions help organizations around the world to prevent the vast majority of outages; 6 of the top 10 banks in the U.S. use our resilience solutions.
Contact us to learn more.
It’s time to automate the secure configuration of your storage & backup systems.