Service outages in cloud-based infrastructure, data centers, enterprise networks, or other large or small IT systems, are the scourge of IT teams. Occasionally IT outages are caused by highly unusual incidents such as the monkey in Kenya that tripped a transformer in the country’s main power station, triggering a national power blackout. Slightly less rare are hurricanes and earthquakes. But most IT outages have far more mundane roots, and include software or human errors such as happened in this recent east coast phone outage.
These causes are very tricky to eliminate altogether. Human error is often unavoidable when overstretched IT teams have to handle ever-increasing complexity across multiple IT infrastructure layers. Therefore, instead of waiting for support teams to be overwhelmed by a barrage of emails and calls from angry customers, IT managers should come up with a proactive and comprehensive strategy for dealing with service outages.
Ideally, the strategy should include the following four steps:
Tackling an outage is like dealing with any other emergency. Teams tasked with resolving the situation need to have a solid action plan in place. The IT team manager should distribute the necessary remediation tasks to the right team members. Attention is required in order to avoid response actions that may affect other users and cause even further service disruption.
During the chaos and panic that often accompanies service outages, IT teams tend to focus solely on solving technical issues at the expense of handling calls and emails from angry customers. Although a natural response, this can often deteriorate matters. The best course of action is to communicate with customers and let them know the situation. Being honest and keeping users in the loop will help to increase sympathy, transparency, and ultimately, minimize the effect on the bottom line.
Once the crisis is over, it’s important to skip the blame game or minimize it as much as possible. Documenting the incident and reporting to management and the board etc., is a standard measure. But the real benefit will come from a deeper understanding of the root causes, and by putting in place systems and processes and that will prevent these types of mundane service outages from occurring in the first place.
Having a plan to tackle IT outages is a step in the right direction. But better still is complete prevention. This can be achieved by implementing an automated IT risk detection solution which can analyze an IT ecosystem with its multiple components, and alert IT teams in advance about potential problems, such as an upgrade that may have altered configuration files, inconsistent settings, and many more issues. In fact, such solutions can prevent most outages altogether. Their implementation will also put IT teams back in control, enabling them to devote more time and resources to support the organization’s core activities and long term objectives.