Just a month ago, and again yesterday, British Airways (BA) suffered another IT outage leaving thousands of passengers stranded in various international cities when about 120 flights were cancelled and 300 were delayed, affecting more than 15,000 passengers. Due to the outage, frustrated passengers couldn’t even access information about the status of their flights. Two systems were affected by the outage – online check-ins and flight departures.
These two outages were resolved. The reasons for them? They were due to the ubiquitous “IT glitch” though yesterday’s was attributed to a “computer problem,” which we assume is another way of saying “IT glitch.” But, what that precisely means is anyone’s guess. The path to a glitch, however, is no secret and in this post we’ll describe the conditions that lead to it. Our experience at Continuity Software in analyzing thousands of downtime occurrences, reinforced by a June 2019 report on airline IT outages conducted by the U.S. General Accounting Office (GAO), give us plenty of material to work with and provide real insight into what occurs behind the scenes leading up to outages like BA’s.
The GAO conducted a study of airline outages in the U.S. from 2015 to 2017. In addition to the obvious repercussions of outages for passengers and the airline, the report realistically describes the prevailing state of complexity of modern IT environments in airlines worldwide.
The GAO report tells of complex, increasingly hybrid IT environments made up of lots of moving parts originating in diverse old and new sources, as well as some third-parties that manage key systems: In other words, environments ripe for misconfigurations that cause outages. GAO reports on a “data-intensive environment that demands around-the-clock availability and real-time information.” There are myriad systems that interface with many other systems responsible for booking, check-in, etc. (see diagram below). Along with newer mobile applications, they are all critical to airline operations and involved in providing a complete travel experience for passengers.
Like other long-established industries, airlines typically rely on some legacy systems which are hard to maintain; IT contends with integrating them into existing systems while simultaneously transitioning to newer systems. Similarly, the late 2000s saw numerous airlines consolidations and mergers with each airline bringing along its own IT environment. Getting these IT infrastructures to work as an integrated system presents additional challenges.
Airlines are also increasingly moving merchandizing and retailing online, as well as relying on regional partners or third-party IT providers to manage certain key systems. These are external systems airlines depend on to run operations and which must have 24×7 access to real-time information.
It stands to reason that the British Airways outage developed under similar circumstances which were definitely complex and misconfiguration-prone.
The GAO reports that some airlines are attempting to “reduce vulnerability” by adding datacenters, moving to the cloud, and using multiple telecom providers. Some also conduct routine system testing and outage drills. These are certainly called for however, the effort to assure resilience is an ongoing process, preferably carried out on a continuous basis.
As mentioned, today’s IT environments are highly complex. The situation isn’t unique to British Airways or to airlines and airports. Our AvailabilityGuard NXG™ solution is geared to assuring the resilience of precisely these types of inter-dependent, hybrid environments. Based on proven methodology, it is used by enterprises in all industries and helps ensure service availability and prevent outages before they impact business.