The Story of the IRS’s 11-hour Outage on Tax Day 2018
It’s already 2019! Here’s a fun thought! Before you know it, tax season will be here.
Under the U.S.’s self-reporting system, filing tax returns is something all citizens do. Typically, there’s tension around tax preparation, filing, and getting everything in by the deadline. Let’s hope that this year, deadline day won’t be a repeat of April 2018’s fiasco.
Tax day panic
Last year, in April 2018, on the final day tax returns could be filed without a late penalty, there was an eleventh hour outage (which, coincidentally, lasted 11 hours) during which the IRS’s (Internal Revenue System) online filing platform, the “Modernized e-File” system (MeF), was unavailable.
The outage struck fear in the hearts of both filers – would the IRS penalize them for missing the deadline? – and the IRS – was the outage due to a cyberattack originating in Russia, and how soon could it be fixed?
Filers were relieved when MeF was again operational and they were given an additional day to get their tax returns in.
One cause of the tax day outage and how it came about
Ultimately, a report of the investigation conducted by the Treasury Inspector General for Tax Administration (TIGTA) determined that the IRS was not a victim of hacking (whew, relief!). They were a victim of a “firmware bug” which caused a storage array to fail.
TIGTA found that the IRS’s Tier 1 storage environment which failed did not have automatic failover or built-in redundancy and that it represented a single-point-of-failure. This affected three of the seven storage arrays at the IRS center. Owing to the lack of redundancy, disaster recovery operations were not immediate, and lasted for 11 hours. However, predating the outage, the external vendor responsible did not relay crucial information to the IRS about an update to the system that could have prevented the outage that resulted from the firmware bug to the IRS and so, the necessary steps were never taken.
The TIGTA report also found that the IRS’s contract with the external vendor lacked detailed service level objectives for performance monitoring and incident management, and was not specific enough. But, for the objectives that were specified, on the day of the outage, the vendor did not meet several of those either. Their acknowledgement of the problem, resolution plan, resolution, availability, and maximum unplanned downtime all lagged significantly behind the contracted objectives.
As a result, the vendor was slapped with liquidated damages. Perhaps the fine will help pay for the nearly $6M upgrade needed to properly populate and configure the IRS’s storage environment.
Thus, at the heart of the problem was poor management and communication, rather than poor technology.
No substitutes for resilient architecture and SLA-compliant processes
It’s surprising, but apparently, standard procedures for redundant architecture in high availability environments were not followed to the letter at the IRS, leading to an outage that lasted for 11 hours. In complex, high-capacity, high-demand environments such as that of the IRS, it’s standard to have full geographic redundancy and complete redundancy for each of the components (compute, storage, network, power) located in each primary, secondary, and tertiary, datacenter.
Ideally, added to full redundancy and failover capabilities, is automatic verification of the environment’s resilience, routinely conducted to ensure there are no single-points-of-failure and that all systems, storage, and servers, etc. are configured correctly to work in case one or more of the components fail for whatever reason. An important aspect of verification is confirmation that in the event of an outage, the environment will seamlessly failover, meeting SLA objectives. Here, too, the IRS was at a disadvantage since for some objectives, their SLA did not accurately reflect their needs and for others, objectives were not spelled out.
It’s possible that for convenience’s sake, the official reason for the outage was condensed into four words: “firmware bug” and “storage failure.” And, while we’ve seen that technically, this is what happened, it entirely misrepresents the true nature of the incident and the full spectrum of circumstances which led to the outage.
“Firmware bug” is another story told to explain away an 11-hour outage on the most critical day of tax season.
At Continuity Software, in 2019, we’re going to be focusing more on what can happen when you don’t know whether your IT environment – on-premises, private, public or hybrid – is resilient, and how your enterprise can achieve resilience.