The outage and its fallout
In February 2018, BB&T Bank (recently renamed Truist Financial), one of the largest banks in the US, experienced a 15-hour outage during which many of its services became unavailable. In part, its customers could not access online banking or their mobile banking app, make debit card purchases, or use ATMs. Though the cause of the outage was discovered and repaired within 15 hours, not all systems were fully restored for several days.
The bank’s customers were angry and frustrated. To compensate them, the bank extended their hours over a period of several days and cancelled service and other fees charged to customers during the outage and immediately thereafter. The cost to the bank was about $15M in lower deposit service charges and about $5M in higher operating expenses.
Surrounding this outage is a dispute brewing between BB&T and Hitachi Vantara about where responsibility lies for the cause of the outage. Hitachi sold the bank storage disk array equipment for their datacenter. More about all that below.
BB&T: The outage cause, explained as “…equipment malfunction”
In the nearly two years since the outage, the bank hasn’t specified the reason for the outage. At the time, without referring to Hitachi and their equipment by name, they said the outage was due to a “simple but serious equipment malfunction.” They added that the outage shouldn’t raise “alarm with regard to our infrastructure in terms of IT and its resiliency and its redundancy.”
However, “resiliency” and “redundancy” do appear to be the reasons why the outage occurred and one could speculate that this means one of two scenarios.
The first is not enough redundancy in the production IT environment – that there was a single point of failure in the way bank applications accessed Hitachi storage. Such hidden points of failure may exist in different layers of the infrastructure – an example would be a single Fibre Channel adapter (Single HBA – Host Bus Adapter) through which all the I/O traffic flows to the banking applications. Similarly, a storage networking port as well as disk array director ports could in fact be hidden single points of failure if the IT environment is not configured correctly with redundancy at every layer of IT.
The second scenario is that a production failure triggered failover to the disaster recovery systems, typically located at a remote site, but the failover was not successfully completed and this led to the outage.
BB&T sues Hitachi
Finally, after close to two years of being vague, in November 2019, BB&T pointed the finger at Hitachi Vantara (since renamed Hitachi Corp.), as the vendor of the storage disk array equipment they say “seriously malfunctioned.” BB&T sued Hitachi for $75,000 in damages.
BB&T claims that Hitachi did not install the storage equipment properly and was slipshod in their maintenance. They assert that Hitachi was “grossly negligent in installing the fiber optics cables… and performed insufficient performance testing” that would have detected problems “before a critical outage occurred.” This implies that BB&T concluded that the lack of redundancy was in the storage network connectivity.
Hitachi counters that BB&T ignored “the system maker’s advice regarding how a high availability system should be architected and managed” and that the bank maintained “the system on the cheap.” In other words, Hitachi is clarifying that once the new storage equipment was installed, the responsibility for configuring it according to high-availability best practices lay with the customer, i.e., BB&T was responsible for maintaining best practice configurations to ensure redundancy.
Since the specific details leading to the outage have not been made public, we can only rely on facts that made it to the press. Recently, BB&T announced that it invested “$300 million on a new data center with duplicate redundant data hauls,” which the bank’s CEO said “addresses the cause of the February outage.” This indicates that lack of redundancy was indeed a key problem at the BB&T datacenter – an issue reflected in the outage scenarios we presented above.
At the same time, BB&T sort of hit the nail on the head when they said that system performance testing was needed and would have caught misconfigurations before they led to the outage. But, testing alone is not enough.
The truth is that it doesn’t much matter if an organization installs the newest equipment and systems; if they aren’t configured correctly to begin with and if best practice configurations aren’t consistently (daily) examined and followed, disruptions and even outages can occur. We emphasize that “testing” and even more so, redundancy testing, which typically takes place annually or at best, quarterly, would miss the misconfigurations that are a natural result of modern complex and dynamic IT environments. Misconfigurations happen all the time – daily, hourly.
A central question in this case is which company was ultimately responsible for executing the critical tests BB&T says were missing – the equipment vendor or the customer? Our perspective is that no matter who is responsible, continuous and proactive maintenance that assures IT resilience and prevents outages must be conducted.
Our AvailabilityGuard NXG™ solution proactively detects misconfigurations, errors and single points of failure which can eventually lead to the kind of outage BB&T experienced. The solution is used by six of the ten largest banks in the United States to assure the resilience of their complex IT environments and reduce performance disruptions and outages.