Preventing outages and maintaining business stability are critical objectives for the telco’s IT leadership team. In the past, the company’s IT environments had experienced downtime and outages and they were concerned about their impact on the mission-critical services they provided to customers in prominent fields such as banking, healthcare and energy.
The telco’s cloud operations team was determined to deliver the highest levels of reliability for the company’s new mission-critical AWS and SAP environments. Specifically, they needed to verify the reliability of the Dell EMC VxBlock converged infrastructure underlying their SAP implementation. They recognized that: “Even new systems can be incorrectly configured and harbor points of failure which won’t be discovered until there’s a problem. Add to that the steady stream of new services and apps by the cloud providers as well as changes, updates and tweaks made by in-house and third-party IT teams and you understand that maintaining the highest levels of reliability gets harder day-by-day. And, clearly, you want to prevent outages because they’re so costly on many levels.”
Due to their experiences with service disruptions and downtime, the telco operations team understood that risks to reliability were plentiful and needed to be addressed and repaired on a continuous basis. After dealing with and recovering from situations of data loss and system outages, they saw how “tricky it was to check the systems” in order to find each error and misconfiguration. It was not a task they could perform manually.
As a result, the operations team was determined to find a reliability assurance solution – “a proactive tool to find and fix” misconfigurations that were disrupting their systems’ performance and even causing them to go down. They were decisive in their aim to provide “stability and quality” to their customers. This led them to Continuity Software’s Coral™ for AWS.
The telco installs Coral
Coral was initially installed to assure reliability on the telco’s new and critical VxBlock-based SAP cloud environment, comprised of VMware ESXi clusters, Cisco UCS blades, a range of Dell EMC storage systems (including VMAX, VPLEX, VNX), and Cisco MDS switches.
The telco wanted to ensure a stable rollout of their new SAP implementation by validating that it was misconfiguration-free, at least on day one. They knew it wouldn’t remain that way and that they needed to commit to a cloud management tool that continuously checks and repairs infrastructure reliability misconfigurations in order to keep their services running at top performance levels.
Results: The solution detects dozens of misconfigurations prior to rollout
Coral scanned and analyzed configuration data from all IT infrastructure layers. Its initial scan of the SAP stack led to the identification of close to 90 risks of varying severity including some that “you’d never even search for and wouldn’t find even if you did,” according to the team. The most significant findings were cross-layer misconfigurations between the UCS and VMware layers; additional issues were found in the VMware and VPLEX environments.
Repairing misconfigurations and errors
The solution, which integrates with most incident management systems, automatically alerts the relevant IT personnel or business owners and provides them with step-by-step protocols for resolution. At the telco, the findings are automatically forwarded as tickets directly into their JIRA issue tracking system. Since not all incidents are created equal, to help IT and DevOps teams understand the implications of each deviation from recommended practice, the tickets include information about the severity and potential impact of the issue and its urgency. The telco found this prioritization feature useful in its uphill battle for reliability.
The telco extends Coral coverage to additional critical services deployed on AWS
Based on the successful initial results, the telco decided to expand the Coral for AWS deployment to protect additional business services, including critical applications deployed on AWS (using EC2, S3, VPC and CloudFront services) and their large digital TV environment.
Coral for AWS secret sauce: A deep knowledgebase
Coral for AWS scans and collect configuration data from all IT infrastructure layers, runs analyses against an unequalled knowledge base of 300+ (and growing) rules of best practices from AWS, vendors, industry and power-users, and employs machine learning algorithms to gain visibility into the AWS environment and configuration. These processes pinpoint problem areas and enable their repair before they erupt into costly disruptions to business. The risks detected are deviations from best practices that would lead to performance disruptions and/or outages if not remediated. The continually expanding knowledgebase enables quick, reliable detection and repair of all risks to uptime.
As the telco team leader put it: “We liked Continuity Software’s approach because their solution is hybrid and vendor-agnostic and is based on what I’ve come to think of as their secret weapon – a huge knowledge base of technology vendors’ best practices that also contains input from the user community. This is unique in the field. “
Using Coral for AWS, the European telco was able to quickly validate and improve the operational reliability of their critical applications deployed on AWS. And, thanks to the solution’s continuous scans of the AWS environment, the telco ensures that the highest levels of reliability are continuously maintained.
The team concluded that Coral for AWS helps them to “be more confident about our AWS infrastructure.”
The IT scenario at this large, multi-faceted European telco is characteristic of the modern enterprise where the core of business operations resides in complex and interconnected hybrid and multi-cloud IT environments that are prone to misconfigurations and outages. The profusion of changes, updates, fixes and upgrades that routinely occur in such environments make attaining and maintaining reliability a real challenge.
Coral meets this challenge and establishes a process that enables IT and DevOps teams to gain control and visibility over assuring reliability on hybrid environments and, in particular, on AWS.
The telco successfully:
- Identifies misconfigurations and prevent outages incidents before they impact business
- Protects data in the AWS environment before it is lost, unrecoverable, damaged, or stolen
- Assures reliability of critical applications deployed on AWS
- Meets organizational reliability goals
- Increases stability and quality of AWS infrastructure
- Facilitates automated self-healing to achieve faster response time and decrease operational costs