About CloudZone
CloudZone helps enterprises make the move to the cloud, providing them with end-to-end cloud management service solutions, supported by DevOps engineers. CloudZone is committed to ensuring its customers adopt the most advanced technologies to improve reliability, optimize cloud infrastructure performance, increase data security and cut down on cloud costs. A significant part of achieving these aims involves deploying the right solutions to automate tasks, in order to run more highly-efficient cloud operations.
CloudZone is an AWS Premier Consulting Partner and has been working with AWS for a decade.
The challenge
The ins and outs of managing AWS infrastructure are very well-known to CloudZone. The company’s IT environment is hosted on AWS as are the environments of most of their managed services customers. They have undeniably deep and wide experience with AWS, and adhere to the AWS Well-Architected Framework pillars. However, despite being a seasoned cloud managed service provider, they were concerned about core operational issues, especially performance degradation, disruptions and outages and imperfect data protection which could lead to data loss. The company runs critical applications, including billing, CRM and others on AWS, and wanted to be sure they were always available with all data intact.
How cloud reliability is impacted
In dynamic cloud environments, misconfigurations arise from rapid changes to services and apps that frequently are not validated before going live. A related source of misconfiguration is the multiplicity of maintenance teams handling changes at the cloud provider – teams that don’t always have knowledge of changes made by others. All these conditions can often lead to performance disruptions, outages and even data-loss incidents.
CloudZone searches for an automated third-party solution
CloudZone wanted to overcome the obstacles to reliability and data protection. They knew this came down to ensuring there were no misconfigurations in their environment that could lead to disruptions, particularly in many aspects of performance and data protection, but also in areas such as security, compliance, and others. They were searching for a proven solution that continually checks on the status of misconfigurations throughout their AWS environment and then notifies them of the results, provides recommendations and facilitates automated self-healing – in other words, a proactive solution. At the same time, a side benefit of locating the right solution for their own needs would be the addition of another solution they could confidently recommend to their customers.
The solution: AvailabilityGuard NXG™ for AWS
CloudZone was referred to Continuity Software’s built-for-the-cloud AvailabilityGuard NXG™ for AWS solution, used by some of the world’s leading enterprises to ensure resilience and reliability.
How AvailabilityGuard NXG for AWS works
A SaaS solution, AvailabilityGuard NXG addresses the reliability of AWS environments by continually scanning production workloads including AWS services such as virtual machines, containers, networks, load balancers, databases, storage, DNS, and more. The solution collects configuration data (metadata) from AWS via 20 (and growing) AWS native APIs using read-only privileges and employing secure and lightweight data collection. The scans do not and cannot change configurations and no agent is installed.
Used in conjunction with the scans is another key solution component, Continuity Software’s proprietary knowledgebase, which contains 300+ (and growing) rules covering the best practices needed to maintain reliability, protect data, and more. Configuration data collected by the scans are compared against information in the knowledgebase. Deviations from best practices, regulations (where relevant), SLAs, etc., become incident tickets to be repaired. Instructions for repair are provided, including automated self-healing functions to achieve faster response time and decrease operational costs. This proactive process is critical to maintaining continuously-available AWS environments as it prevents performance disruptions and outages and makes sure data is not vulnerable to loss, damage and theft, and is always recoverable.
The solution’s UI shows all identified risks and provides a detailed description of each problem and the recommended steps for resolving it. The dashboard shows an overall Health score and a breakdown of risks by region, urgency, impact, business entity and domain. Below is a sample screen:
CloudZone tests AvailabilityGuard NXG for AWS
In January 2020, CloudZone conducted a 14-day trial of the AvailabilityGuard NXG for AWS solution. Four AWS accounts containing up to 500 nodes deployed in the US-East (N. Virginia) region comprised the scanned environment.
Business services (critical workload) included:
The main AWS services in use were: EC2, S3, ELB, ASG, CloudFront, ECS, Redshift, SQS.
Results
During the test period, 136 configuration risks were uncovered, the majority potentially leading to downtime, data loss and impacting security. The breakdown of risks is seen below:
Below are the risks detected by severity.
The following chart shows that downtime and data loss risks affected all infrastructure layers including virtual machines, databases, scaling groups, load balancers, CloudFront, ECS and others.
Risks detected and their repair
As seen above, AvailabilityGuard NXG for AWS detected severe risks of downtime and data loss. This section goes into greater detail about the risks and describes how to repair them. Two data loss and two downtime risks are described.
Risk #1: Data loss; high risk urgency
Problem: Unprotected EC2 with delete-on-termination; no termination protection. Accidental instance termination through the console, the API, or the CLI can cause downtime and even data loss.
Affected business entity: Monitoring system
Impact: If termination protection is not enabled, there is a risk of accidentally terminating EC2 instances, which can lead to downtime. In addition, when the delete-on-termination attribute is set to true, there is the risk of data loss. Enabling this attribute causes the volumes associated with the EC2 instance to be deleted when the instance is terminated.
Resolution
Make sure that static EC2 instances with delete-on-termination that are provisioned outside an auto-scaling group have the termination protection safety feature enabled. This protects the EC2 instances from accidental termination and possible downtime and data loss.
To prevent data loss during termination, set the delete-on-termination attribute to false.
To enable the termination protection feature using the AWS Management Console:
Risk #2: Data loss; high risk urgency
Problem: RDS without manual snapshot – two large databases in the US-East-1a region.
Affected business entity: CRM application
Impact: When an RDS instance is deleted, all the automatic snapshots are deleted as well. If there is no manual snapshot, the data will not be recoverable. This means that an accidental deletion of the instance results in having no snapshots; recovery is not possible.
Resolution
A manual backdrop must be created using the console. To do so:
Risk #3: Downtime; high risk urgency
Problem: Internet-facing load balancer (aws-lb-nhs) with a private subnet; implications for application availability.
Affected business entity: Notification application
Impact: If a subnet of an internet-facing load balancer is private – i.e., no route to an internet gateway – incoming traffic for this subnet is dropped. This can lead to application downtime.
Resolution
Change the subnet of a load balancer using the AWS Management Console:
Risk #4: Downtime; high risk urgency
Problem: ECS service rolling update resource issue; ECS service crm-w1 in the crmclus will not be able to successfully perform a rolling update.
The service configuration has two conflicting settings: the update procedure needs to start new tasks without stopping the current ones. But each container instance has enough memory to run only one running task. This might result in a long downtime during task upgrades.
Affected business entity: CRM application
Impact: If the rolling update procedure cannot stop tasks (due to the minimum healthy percentage configuration) and cannot start additional tasks (due to capacity issues in the ECS cluster), you will have to completely stop the application to upgrade the service. This may result in lengthy downtime.
Resolution
Use one of these methods:
Conclusion: The bottom line
The many critical misconfigurations detected by AvailabilityGuard NXG for AWS and its user-friendliness convinced CloudZone to adopt the solution. Although at times, misconfigurations can appear to be as minor as forgetting to cross a “t” or dot an “i,” nonetheless, their implications can be very major and lead to downtime and data loss. Thus, the solution’s automation and proactivity were especially important and appealing to them, as was the ease of following the clear instructions for repairing conditions.
Customer value
AvailabilityGuard NXG for AWS provides CloudZone with a reliability and data protection assurance solution it can count on. The company benefits from a solution that: