Here’s a glimpse into some issues that cause disruptions or downtime. But they are not a five-step blueprint for resilience. Much more is involved.
Back in November, we presented a webinar on some common misconfigurations that can lead to outage in the public cloud environments of enterprises. We mentioned:
- Service limits on AWS EC2 resource: Exceeding limits for a specific EC2 resource type in a region may impact performance or lead to downtime
- Unavailable scaling group resources: An application that needs to scale-up but does not have all the resources needed to do so will result in application downtime
- Availability Zone failure: Affinity rules on Kubernetes pertaining to assignment of pods to a specific node type which does not reside on all Availability Zones. An Availability Zone failure will result in pod downtime as the pod will not be able to restart on a different Availability Zone.
- Missing snapshots: Data loss may occur when not all the disks on the VM have a snapshot; this makes it impossible to revert to a specific point in time
- Insufficient number of subnet addresses on a virtual network: This may lead to application downtime when the number of subnet addresses needed exceeds the subnet’s limit
Are these the five most common misconfigurations? No, not necessarily.
But, if you make sure these issues are taken care of, will you avoid downtime? No, not really, since there are hundreds of other ever evolving potential misconfigurations and points of failure that can impact performance or lead to service disruptions or an outage. And, with the explosive growth in new services introduced by public cloud providers on a daily basis, the number of potential risks is also growing.
This makes maintaining resilience in public cloud environments a complex challenge. As we know, enterprises are running an increasing percentage of their workloads on public clouds – yes, clouds, plural. According to Rightscale’s 2019 State of the Cloud report, companies use an average of two public clouds and are experimenting with another two public clouds. And, whereas in 2006, companies ran about 2% of workloads on the cloud, they currently run 38%. There’s good reason for this 19-fold increase.
The cloud provides many advantages including cost savings, scalability, flexibility, agility, efficiency and the ability to innovate.
One thing it does not provide, however, is comprehensive resilience assurance. This may come as a surprise to some since AWS and Azure, and other cloud providers, offer basic tools to help the enterprise follow best practices for availability, security and performance. But their coverage extends only to, at most, tens of misconfiguration issues on their own cloud environment. For effective resilience assurance, all potential points of failure must be addressed in each public cloud where the enterprise’s environments reside.
Maintaining resilience is meticulous work. As mentioned, there are hundreds of potential faults and as seen in the common misconfigurations presented above, even a “little miscalculation” in the number of subnet addresses needed is enough to bring an application down. So, while the public cloud is certainly robust, it can also be fragile; enterprises must “cross every ‘t’ and dot every ‘i’” in order to maintain the resilience of their environments.
Comprehensiveness is the only option for assuring resilience. Continuity Software’s proven AvailabilityGuard™ (AG) resilience assurance solution scans all the IT layers in the enterprise’s environments – not only those on public clouds but also on private clouds, legacy, on-prem and hybrid environments – in a proactive search to detect potential faults. AG automatically conducts hundreds of checks on AWS and Azure, utilizing its proprietary knowledgebase of roughly 8,000 vendor best practices for private cloud and on-prem environments to discover where misconfigurations lie, and to provide guidance on how to repair them.
With its machine learning capabilities, AvailabilityGuard is capable of better analyzing trends and crowd knowledge, expanding and optimizing the checks it executes, and detecting risks to assure resilience. Furthermore, enterprises can easily add their own custom checks and automate internal policies.
AG is a single pane of glass with in-depth visibility through which the enterprise learns about and manages resilience in all its environments.
View the webinar recording here.