Early, Proactive Testing Assures Resilience in the Cloud
Shifting resilience testing left, that is, testing early in the application development cycle, is a rational move that helps avoid outages and data loss. Our recent blog and July webinar discussed early, proactive resilience testing and strongly advocated this practice.
Today’s post will look at the kinds of threats to availability and security that arise from misconfigurations and single points of failure in your public cloud environments – AWS and Azure – threats you should avoid by shifting resilience testing left.
Ideally, following every software change or version update, or new capabilities introduced by cloud providers, validation tests should be conducted to examine where changes to configurations should be made. However, the typical development sequence doesn’t normally include such tests. And realistically, it’s impossible to manually keep track of all reconfigurations that must be made to accommodate each application change, version, capability or service.
In our latest webinar entitled “Shifting Left on Cloud Infrastructure Availability” we reviewed a few examples of such misconfigurations. Here’s a recap:
- EBS volume size limit is too low – Potential disruption or downtime on AWS: Five scaling groups are using the same EBS volume on a general purpose SSD. Right now the amount available to them is 307 GB but when they scale up, they’ll need close to 500 GB and this will exceed the amount available.
- EC2 with unrestricted SSH access – Potential security breach on AWS: EC2 instances with public IP addresses have unrestricted SSH access worldwide, representing a major security risk.
- Unavailable scaling group resources – Potential disruption or downtime on AWS: A block device snapshot is missing from a resource needed for launch configuration. When an auto-scaling group will need to add an instance, the missing snapshot will not allow its launch.
- Incomplete VM snapshot – Potential data loss on AWS and Azure: A newer disk added to a VM is missing a snapshot. In case of a cyberattack or other reason for data loss, data would not be recoverable.
- Insufficient subnet addresses – Potential disruption or downtime on Azure: Four VM scale set groups are deployed on a subnet with a limit of 256 addresses; the VMSS groups will need additional instances but this will exceed the number of addresses available.
Our AvailabilityGuard NXG™ solution checks for hundreds of potential misconfigurations and single points of failure. The risks above are only a handful of potential scenarios that might occur. It’s very important to be aware that since native tools like AWS Trusted conducts only about 20 checks and Azure only about 5, there’s a very high likelihood that resilience issues which can cause disruption or downtime won’t be detected before they erupt.
AvailabilityGuard NXG scans the enterprise’s entire IT stack for configuration information and compares this against a knowledgebase containing all the constantly updated information available and needed on vendor and user best practices. This is how issues critical to resilience are discovered in time to prevent them from becoming a risk to resilience.