A European Telco Adopts Proactive Configuration Validation to Meet Resilience Goals and Prevent Outages
About the Company
This European telco is a large and diversified company. It offers mobile and fixed-line telephony, internet, and digital TV, is a leading provider of IT services, builds and maintains infrastructure for telephony and transmits broadcasting signals. They also offer a range of web-based solutions, services and applications to the banking, energy, entertainment, advertising and healthcare sectors.
The telco is a multi-faceted company with a very large, complex and varied IT infrastructure. Their environment is hybrid, including on-premises, private and public cloud (AWS) infrastructures that utilize a range of networking, computing, and storage systems from different providers and which run on different virtualization platforms. 30,000 servers in five datacenters cover their activity.
Preventing outages and maintaining business stability are critical objectives for the telco’s IT leadership team.
In the past, the company’s IT environments had experienced downtime and outages and this was of great concern to them because of the mission-critical services they provided to customers in prominent fields such as banking, healthcare and energy. The system and application operations team at this large telco were determined to find a resilience assurance solution – “a proactive tool to find and fix” misconfigurations that were disrupting their systems’ performance and even causing them to go down. They were decisive in their aim to provide “stability and quality” to their customers. Once they discovered and got started with Continuity Software’s AvailabilityGuard NXG™, their vision became reality.
The European Telco team was determined to deliver the highest levels of availability for the company’s new mission-critical AWS and SAP environments. Specifically, the team needed to verify the resilience of the Dell EMC VxBlock converged infrastructure underlying their SAP implementation.
In discussing the challenge of ensuring availability and preventing IT outages, the telco recognized that: “Even new systems can be incorrectly configured and harbor points of failure which won’t be discovered until there’s a problem. Add to that the steady stream of new services and apps by the cloud providers as well as changes, updates and tweaks made by in-house and third-party IT teams and you understand that maintaining the highest levels of IT resilience gets harder day-by-day. And, clearly, you want to prevent outages because they’re so costly on many levels.”
Owing to their experiences with service disruptions and downtime, these seasoned professionals understood that risks to availability were plentiful and they needed to be addressed and repaired on a continuous basis. After dealing with and recovering from situations of data loss and system outages, they saw how “tricky it was to check the systems” in order to find each error and misconfiguration. It was not a task they could perform manually.
The AvailabilityGuard NXGTM Implementation: Confidence and Operational Resilience
Continuity Software’s AvailabilityGuard NXG solution starter package was initially installed to assure resilience on the European telco’s new and critical VxBlock-based SAP Cloud environment, comprised of VMware ESXi clusters, Cisco UCS blades, a range of Dell EMC storage systems (including VMAX, VPLEX, VNX), and Cisco MDS switches.
The telco wanted to ensure a stable roll out of their new SAP implementation by validating that it was misconfiguration-free, at least on day one. They knew it wouldn’t remain that way and that they needed to commit to a new system of continual checks and repairs of misconfigurations in order to keep their services running at top performance levels.
AvailabilityGuard NXG scanned and analyzed configuration data from all IT infrastructure layers. Its initial scan of the SAP stack led to the identification of close to 90 issues of varying severity including some that “you’d never even search for and wouldn’t find even if you did,” according to the team. The most significant findings were cross-layer misconfigurations between the UCS and VMware layers; additional issues were found in the VMware and VPLEX environments.
Based on the successful initial results, the telco decided to expand the AvailabilityGuard NXG deployment to protect additional business services, including critical applications deployed on AWS (using EC2, S3, VPC and CloudFront services) and their large digital TV environment.
The Secret Sauce? Our Deep Knowledgebase
AvailabilityGuard NXG scans and analyzes configuration data from all IT infrastructure layers and environments. The information gathered in the scan process is compared against the solution’s unique, proprietary knowledgebase of close to 8,000 vendor, industry and community-driven best practices and recommendations. The issues detected are deviations from best practices that would lead to performance disruptions and/or outages if not remediated.
The use of a massive and comprehensive knowledgebase that continually accumulates more best practices and user input assures that risks to uptime inherent in increasingly complex multi-cloud and hybrid environments can be easily pinpointed and resolved.
As the Telco systems and application operations team leader put it: “We liked Continuity Software’s approach because their solution is vendor-agnostic and is based on what I’ve come to think of as their secret weapon – a huge knowledge base of technology vendors’ best practices that also contains input from the user community. This is unique in the field. “
AvailabilityGuard NXG automatically alerts the relevant IT personnel or business owners and provides them with protocols for resolution. The solution integrates with most incident management systems. At the telco, the findings are automatically forwarded as tickets directly into their JIRA issue tracking system. Since not all incidents are created equal, to help IT teams understand the implications of each deviation from recommended practice, the AvailabilityGuard NXG tickets include information about the severity and potential impact of the issue and its urgency. The telco found this prioritization feature useful in its uphill battle for resilience.
Using AvailabilityGuard NXG, the European telco teams were able to quickly validate and improve the operational resilience of their critical environment. And, thanks to automated daily scans of the environment by AvailabilityGuard NXG, the telco ensures that the highest levels of resilience are continuously maintained.
The team concluded that AvailabilityGuard NXG helps them to “be more confident about our infrastructure.”
The IT scenario at this large, multi-faceted European telco is characteristic of the modern enterprise where the core of business operations resides in complex and interconnected hybrid and multi-cloud IT environments that are prone to misconfigurations and outages. The profusion of changes, updates, fixes and upgrades that routinely occur in such environments make attaining and maintaining resilience a real challenge.
AvailabilityGuard NXG meets this challenge and establishes a process that enables IT, DevOps and development teams to gain control over assuring resilience and availability on the AWS cloud and hybrid IT environments.
Prevent downtime, data-loss and cyber resilience risks in hybrid IT infrastructure