[vc_row][vc_column][vc_column_text]Service outages are no laughing matter. Far from being a benign issue as the term might suggest, they are every IT department’s worst nightmare, with a potential to inflict a heavy toll on an organization’s bottom line.
A quick review of the tech news confirms this. Delta Airline’s global fleet was grounded by a data center problem. A ‘glitch’ at Barclays.com affected hundreds of thousands of customers, and a one-day service outage at Salesforce.com cost the company $20 million. In the UK, a series of very public service outages at leading high street banks, led a senior government regulator to exclaim, “we can’t carry on like this”.
Service outages are commonplace and painfully expensive. According to IDC, infrastructure failure can cost large enterprises $100,000 per hour, while critical failures could cost as much as $1 million dollars an hour.
Aiming to determine what happened, why it happened, and how it was fixed, a major University of Chicago study reviewed 516 unplanned outages and identified a list of the main causes of service outages at online services companies. The lessons are equally valuable for IT departments.
Some of the main causes included:
Upgrades: Responsible for 15% of service outages. Upgraded systems were either not fully tested in the offline environment, or not tested thoroughly enough to verify that they would meets the demands of the full ecosystem.
Misconfiguration issues: Responsible for 10% of service outages. While IT workers are often responsible for misconfiguration, it’s not always their fault. Often, new software or upgrades to existing applications throw things out of whack elsewhere. The result: The full ecosystem will have conflicting views of what is correct.
Other causes of service outages include undue stress on an ecosystem due to traffic issues, power outages, security issues, and human error.
But the biggest issue was ‘unknown’, with researchers unable to identify the root cause of 48% of the 516 cases in the study. This has serious implications, because, if you can’t figure out what the problem is, how can you fix it?
One approach to tackling the issue is to use automatic big data analytics to identify potential outages. These systems constantly evaluate network elements, analyzing the relationships between hardware, software, configuration files, network connections, and everything else that makes up an IT system. IT department workers can’t do this work – because there is just too much information to keep track of.
The systems identify risky deviations from industry best practices and vendor recommendations while providing early warning capabilities to help administrators understand the impact of any change. So, when the time comes to install new software, for example, analytics systems can send out alerts about the implications of the installation, what services and functions will be affected, and what steps should be taken to prevent the risk of an outage.
Organizations upgrading from vSphere 5.5 to 6.x, for example, will find that there are so many issues to consider that it’s nearly impossible for IT workers to ensure that all bases have been properly covered. One missed step will significantly hamper operations and cause another outage. With a good operations analytics solution, users can leverage the power of automatic configuration validation to complete the job faster and more reliably.
The right analytics system can help IT teams prevent problems from cropping up before they inflict any damage. Surely, in today’s complex IT environment, this is welcomed help.
For the original article, visit http://www.networkworld.com/article/3106491/network-management/anatomy-of-a-service-outage-how-did-we-get-here.html