Modern enterprises are always in search of more agility and productivity – IT included. To a very great extent these have been provided by cloud and virtualization technologies. Dramatic innovation in the fields of cloud computing, micro-service architecture, software-defined “anything” and workload orchestration fuels the change. It’s been several years now since blade systems, hypervisors, and container orchestration platforms made it relatively simple to create an elastic pool of compute resources. Network function, and Storage virtualization were much slower to mature, but are now finally enterprise-ready.
One of the more exciting Software-Defined Storage offerings out there is Dell-EMC’s ScaleIO. It allows local storage resources of multiple servers to be pooled together into an intelligent storage fabric, supporting high-performance and elasticity. Both storage and IO capacity can scale linearly by adding more servers to the network.
Like other software-defined storage solutions, ScaleIO simplifies storage operations in several ways, particularly by eliminating the need for a dedicated Storage Area Network – since storage transport is carried over standard Ethernet. Provisioning and mapping storage involves less moving parts and fewer hardware and software layers to implement and configure.
Resiliency is achieved by implementing:
- Redundant network connections for storage nodes and clients
- Mirroring each data block across two nodes, coupled with massive striping of volumes. As the scale grows, the probability of any singe node failure affecting data availability diminishes – and the re-build time for missing mirrors shortens (as all nodes in the pool are used in parallel for reconstruction). So the larger the network, the more resilient it gets and the faster it heals
- Careful definition of node “Fault Sets” to help ScaleIO make sure a single failure will not result in the loss of both copies of any block. For example, it would make sense to configure nodes that are likely to fail together as members of the same Fault Set (e.g., blades in the same chassis, or, nodes at a single site). ScaleIO will always make sure mirrored copies are stored on different Fault Sets
Providing the right conditions for built-in resilience features to work
Of course, no matter how innovative and inherently well-designed a technology is – it is still our – the users’ – responsibility to utilize it wisely. A well-designed and maintained deployment will ensure seamless service. However, if something goes wrong – either in the design phase or due to an insufficiently thought-out maintenance, expansion or upgrade activity – you can still end up with an unstable or unsafe deployment. To give just a few examples of things that could go wrong:
- Clients for which network connectivity to ScaleIO volumes is not fully redundant. A single failure at the NIC, cable, port or switch level will result in downtime. It’s important to “dig deep” – since the hidden point-of-failure may not be obvious. Consider, for instance, a scenario in which two different NICs are configured, and connected to two different switches or interconnects, but with just one of them actually active due to an inconsistent configuration of VLAN IDs across the host and Layer 2 network.
- Incorrect alignment between configured Fault Sets and the actual infrastructure. Consider a scenario where a typo results in nodes of the same blade chassis being configured in more than one Fault Set.
- Failure to meet best-practices applicable to your specific environment. There are quite a few vendor-specific environments, which are updated from time to time. For example, when VMware VMs are used to implement critical ScaleIO functions, such as storage nodes, or metadata management, such VMs must: (1) use only local ESX datastores (otherwise they’ll fail to restart…), (2) have specific tuning requirements (e.g., “UUID” must be set to “enabled”), etc.
- Cross-domain issues – for example, inconsistencies between ScaleIO infrastructure and the compute layer. Consider a scenario where you use ScaleIO snapshots – using ScaleIO consistency groups to take safe snapshots for different applications at different times. What if one of your database servers were to use volumes from different consistency groups for the same database instance? Snapshots taken will, obviously, be corrupt.
Correct configuration is key
Successful deployment, therefore, depends on ongoing validation of your configuration. At the minimum you should review both the design and the end-result after each material change (e.g., initial deployment, upgrade, capacity increase, addition of nodes or physical storage, network-related update, etc.).
We’re proud to offer an automated method of validating your configuration. AvailabilityGuard™, our flagship product, fully inspects your ScaleIO configuration, as well as the adjacent layers (blade systems, Hypervisors, OS, database, etc.) so that at any given point in time, you know if you need to make changes for your deployment to be safe. AvailabilityGuard is also business-service aware – so when a misconfiguration is uncovered, it immediately flags it on its dashboard, and allows you to correctly gauge the criticality and impact. Through deep integration with service management tools (e.g., ServiceNow, Remedy, Patrol, Tivoly, HP Service Manager, etc.) – it helps your team become much more proactive while boosting awareness of best-practices and correct design guidelines.