It’s been an unfortunate week for US air travel. On Sunday, Delta experienced the kind of multiple-system computer outage that airlines dread. As a result, flights were grounded for over five hours. The outage led to the cancellation of 280 flights and caused the delay of 223 flights over the course of two days.
The outage had knock on effects, with flight cancellations failing to show up on the airline’s website and app, passenger delays after landing, agents resorting to manually checking passengers into flights, and arriving flights being held on the tarmac for hours.
Delta is not alone. On January 23, a computer glitch in United Airlines’ systems caused the airline to ground all domestic flights for an hour. It was quickly followed by a blizzard of angry tweets from delayed passengers.
Delta had an even larger outage in August last year. On that occasion is led to the cancellation of more than 200 flights over the course of several days. Three weeks earlier, a computer outage at Southwest Airlines caused the cancellation of over 1000 flights.
Airline IT systems are no more complex than those of the banking or insurance sectors. In fact, they share common complexity issues that originate in their multi-layered architecture, with a combination of new and legacy technologies.
Speaking to the Wall Street Journal, Gill Hecht, Continuity Software’s CEO explained this: “You have a very wild combination of very old systems sitting on old mainframes and some pieces of business services that reside on the main frame, in private clouds, on web services…some in remote locations.”
As a result, IT teams face real challenges in making sure all systems work flawlessly at all times. And even though, January was not very successful for US Airlines IT, in average the airline industry does not experience more downtime. With the ever-increasing size and complexity of their environment and the constant changes in IT, airlines just like financial and telecom companies have similar struggles. The outcome is that downtime and outages are pretty much inevitable.
Service disruptions in the airline industry, however, do tend to get much more attention as they could seriously impact thousands of passengers who can find themselves stranded in airports. When thousands of customers are affected by flight cancellations and delays they are quick to make their displeasure known on social media. Adding poor customer satisfaction and confidence to the already heavy cost of flight cancellations, can make a single outage incident cost an airline well over $100 million.
As mentioned above, airlines IT environments are highly complex and almost impossible not to make human errors and careless blunders with the constant changes that are required. But through implementing testing and process the frequency and damage can be significantly minimized. Testing is also expensive, time consuming and intrusive. As a result, the common practice entails a periodical audit or a DR test to make sure that critical systems are protected against meaningful downtime. The problem is that in between tests (and as time goes by) the risk inflates in a way that IT cannot know if fail-safe mechanisms still work as planned.
Thus, the more frequent we test, the less we are exposed to risk. The key then, lies in automation – being able to successfully testing without the need for an intrusive manual process. Leading banks have already adopted this approach to frequently validate their critical systems and underlying infrastructure. Providing IT operations teams with system-wide visibility and proactive risk detection allows them to truly mitigate risk on a regular basis and as a result, minimize critical system outages.