Certsys Assure Adherence to AWS Well-Architected

Certsys Adopts Continuity Software’s Coral™
for AWS to Assure Adherence to the AWS Well-Architected Framework

About Certsys

São Paulo-based Certsys serves the Brazilian market with solutions for enterprise that ease and optimize their transition to the cloud. The technology solutions it provides help companies get the most out of the cloud, making them more efficient and innovative as well as agile and competitive, while ensuring their proprietary information is kept secure. Certsys counts a range of both public and private sector enterprises among its clientele.

The Challenge

As a provider of effective solutions for enterprise, Certsys was well-aware of the difficulty in assuring the resilience of IT environments, including its own. Part of their environment is hosted by AWS and as such they were quite familiar with the AWS Well-Architected Framework, made of up five pillars: Operational Excellence, Security, Reliability, Performance Efficiency and Cost Optimization. They believe that one major way of achieving and maintaining resilience of their AWS environment is to adhere to these five pillars and thus create a best-of-breed infrastructure, ensure that the AWS infrastructure delivers maximum benefit, and prevent common technical pitfalls. When they moved their environment to AWS, obviously, they followed all the Framework’s best practices and guidelines in architecting their infrastructure.

Nonetheless, even when cloud environments are correctly architected and follow best practices, IT environments are dynamic and this leads to a decline in their resilience, reliability and security over time. This is precisely what Certsys experienced and the reason they were in search of a way to avoid performance disruptions, service unavailability, outages and data-loss incidents.

It’s true that companies whose IT environments are hosted on AWS see fewer of these disruptions, and the AWS Well-Architected Framework is one of the reasons why. Still, misconfigurations and single points of failure in IT environments are the main causes of disruptions and outages.

Why resilience decreases over time

Many factors contribute to complexity of cloud environments and the propensity for misconfigurations, including the high velocity of changes in an AWS environment, knowledge gaps between people and teams that maintain the environment and make changes, insufficient controls and lack of visibility. All these provide a fertile field for configuration errors and risks to occur.

Certsys’ motivation

Certsys needed the assurance of knowing their environment was always in good standing with respect to the AWS Well-Architected Framework, the standard for resilience and reliability the company set for itself. They were certain that this would be the key to meeting their goals for disruption-free 24X7X365 availability and security. They turned to Continuity Software for its Coral™ for AWS solution.

Certsys tests Coral™ for AWS

Certsys understands Coral’s potential; they test the solution.

Certsys knew that its AWS environment’s adherence to the AWS Well-Architected Framework had to be handled automatically and proactively and that they could achieve that by using Coral for AWS.

In early 2020, Certsys conducted a 14-days trial of Coral™ for AWS on a representative subset of the production environment of 500 nodes deployed in two regions: US East (Ohio) and South America (São Paulo). The nodes covered critical applications such as their CRM system. The main AWS services used by the company are EC2 instances, RDS instances, ASG, IAM, VPC, and others.

How Coral for AWS achieves and maintains IT resilience?

Coral is a SaaS solution deployed on AWS that automatically and proactively detects misconfigurations and risks across all components of AWS environments including virtual machines, containers, networks, load balancers, databases, cloud storage, DNS, and more.

To identify these risks the solution accesses its proprietary knowledge base containing hundreds of rules covering the best practices needed to maintain the AWS Well-Architected Framework for each of the five pillars. This process allowed Certsys to gain visibility into their AWS environment and configuration, pinpoint problem areas and enable their repair before they lead to a security breach, costly disruptions or outages and impact business.

The Coral UI shows all identified risks and provides a detailed description of each problem and the recommended steps for resolving it. The dashboard shows an overall Health score and a breakdown of risks by region, urgency, impact, and domain.

Results

The pie charts below show the breakdown of risks to the various AWS Well-Architected Framework pillars detected by the solution.

It’s important to note that since the Coral trial was conducted on only a portion of Certsys’ nodes, it must be assumed that there are additional risks in the portion of the production and staging environments not scanned.

This chart shows that the risks detected span all infrastructure layers including virtual machines, databases, scaling groups, identity, access management, and more.

Types of risks detected and their repair

The AWS Well-Architected Framework covers five pillars. For each pillar AWS provides a list of questions to which an organization should be able to provide the answers that will ensure it complies with the Framework. Below we present four of the risks discovered by Coral for AWS during Certsys’ 14-day trial period. Two risks relate to the reliability pillar and two to the security pillar.

Note that not complying with these Well-Architected Framework standards can lead to disruptions and outages.

Risk #1

Reliability Pillar, Question 2 (on the AWS Well-Architected Tool): How do you manage your network topology?
Possible answer: Use highly available connectivity between private addresses in public clouds and on-premises environment.

Which rule was violated?
Site-to-site VPN tunnel redundancy; thus, site-to-site VPN connections have only one active tunnel.

Description
Site-to-site VPN connections in region sa-east-1 have one inactive tunnel. In case of an outage or planned maintenance of the devices at the AWS endpoint, this could lead to network failure and application downtime.

Impact and implication
A site-to-site VPN connection has two active tunnels to help ensure connectivity in case one of the VPN connections becomes unavailable. Having only one available tunnel represents a single point-of-failure that may lead to network disconnects between the data center (or network) and the VPC.

Resolution
Ensure both tunnels are available.

If a static VPN is used, verify that the on-premises firewall is properly configured
If a dynamic VPN with BGP is used and IPSEC IS UP is indicated in the Details column, be sure to configure BGP properly on the firewall
In addition, be sure to enable route propagation in the VPC route table

Risk #2

Reliability Pillar, Question 7 (on the AWS Well-Architected Tool): How does your system withstand component failures?
Possible answer: Deploy the workload to multiple locations.

Which rule was violated?
RDS instances deployed in a single availability zone.

Description
RDS instances in the sa-east-1 region are deployed in a single availability zone. This could lead to unplanned downtime during a database outage.

Impact and implication
Amazon RDS with multi-AZ deployment maintains a synchronous standby replica in a different availability zone than that of the active DB instance. This configuration provides data redundancy, high availability, and failover support. It also eliminates I/O freezes and minimizes latency spikes during system backups. RDS configuration without multi-AZ deployment is vulnerable to outages that could lead to application downtime.

Resolution
Modify the RDS instance to enable multi-AZ deployment. Use the following steps:

Open the Amazon EC2 console at https://console.aws.amazon.com/rds/
Click Databases
Click on the database you want to reconfigure
Click Modify
Select yes for multi-AZ deployment
Click Continue, then click Modify DB instance

Risk #3

Security Pillar, Question 6 (on the AWS Well-Architected Tool): How do you protect your networks?
Possible answer: Limit exposure.

Which rule was violated?
EC2 with unrestricted access: There are public EC2 instances with unrestricted TCP access.

Description
EC2 instances in region sa-east-1 with a Public IP address have unrestricted TCP access. This poses a security risk.

Impact and implication
Allowing unrestricted (0.0.0.0/0 or ::/0) access makes the system vulnerable to malicious activity such as hacking, man-in-the-middle and brute-force attacks. The recommendation is to restrict access to specific security groups or IP addresses that require it and to implement the principle of least privilege in order to reduce the possibility of a breach.

Resolution
To restrict network access using the AWS Management Console:

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/
In the navigation pane, under NETWORK & SECURITY, click on Security Groups
Select the security group that you would like to reconfigure
On the Inbound tab, click Edit
Change the source for any inbound rules that allow unrestricted access
Click Save

Risk #4

Security Pillar, Question 9 (on the AWS Well-Architected Tool): How do you protect your data at rest?
Possible answer: Enforce encryption at rest.

Which rule was violated?
RDS without storage encryption.

Description
RDS instances in region sa-east-1 do not have storage encryption. This poses a data security risk. If storage encryption is needed at a later time, it can lead to downtime (see next section).

Impact and implication
Amazon RDS DB instances and snapshots at rest can be encrypted by enabling the encryption option for the relevant Amazon RDS DB instances. Data that is encrypted at rest includes the underlying storage for the DB instance, as well as its automated backups, read replicas and snapshots. Storage encryption cannot be enabled if the database was created without encryption. Therefore, if encryption will be needed at a later time, it will be necessary to back up and restore the database – an action that results in extended downtime.

Resolution
Make sure that all RDS instances use storage encryption.

Conclusion: The bottom line

Certsys adopted Coral for AWS. They were able to quickly repair misconfigurations that were urgent and worked the others into their schedule of repairs. The trial demonstrated that they could rely on the solution to ensure that their environment would always be aligned with the AWS Well-Architected Framework’s best practices. The solution’s ease of use also persuaded them because it proactively detects all risks and provides simple-to-follow steps to repair misconfigurations and errors, enabling them to easily comply with the AWS Well-Architected Framework, a goal they see as crucial for maintaining a secure, disruption- and outage-free environment.

Useful for Certsys is that detected risks can be reported as incidents on the ITSM tools they already use. The key is to discover the issues quickly and in time, before they cause outages or service unavailability. Using this proactive solution, Certsys also protects data against loss, damage, and theft and ensures data is always recoverable.

Customer Value

Coral for AWS provides Certsys with visibility and confidence, a consistent approach to evaluating its architecture, ongoing IT resilience, and the ability to swiftly add new apps and services while meeting the five pillars defined by the AWS Well-Architected Framework.

With Coral for AWS, Certsys:

Assures continuous reliability and resilience in their AWS environments
Ensures adherence to all five pillars of the AWS Well-Architected Framework
Prevents IT outages and data loss incidents before they impact business
Maximizes benefits achieved through cloud adoption
Achieves faster response times and reduces operational costs using automatic healing capabilities

“By continuously assuring reliability, operational excellence and overall resilience, Continuity Software’s Coral for AWS plays a critical role in our AWS operations.”

João Paulo Teixeira

Chief Customer Officer