Availability Risks Contest

Find the most risks and win an iPad Air 2

InfrastructureMania

The contest is now closed, how many risks did you find?

See results

The following infrastructure diagram includes (at least) 24 single-points-of-failure and misconfigurations that could cause unplanned outages. Scroll down to see the results.

Click to view a larger image

 

These are the risks

Our Sample Gap: The production RAC nodes (A, B) at site A are connected to a 8Gbps FC network (AD, AC) while the corresponding DR RAC nodes (K, L) at site B are connected to a 4Gbps FC network (AA, AB). This configuration can lead to degraded I/O performance in the event of fail-over to site B which will affect database and related applications response time, backup processes and other storage-dependent services.

Risk #1: ESX Server (E) has a single SAN I/O path The ESX Server has a single point of failure in its storage access configuration. Should this single path fail, the outcome will be failure of virtual machines on the host and service disruption.

Risk #2: vMotion traffic not enabled on vMotion-intended network on an ESX server (D) Virtual Machines running on hosts with vMotion traffic disabled will not be able to automatically migrate. In a DRS-enabled cluster, this will result in an imbalanced cluster, and affect the stability of the virtual machines.

Risk #3: VLAN id misconfiguration for a port group of an ESX Server (D) Incorrect VLAN setting can impact VM network availability and security following VM fail-over or vMotion to another host.

Risk #4: Incorrect port group network label on an ESX server (E) Incorrect port group network label on an ESX server will lead to virtual machine failure after it has been relocated to run on the host, through fail-over, restart, vMotion or other mean.

Risk #5: Inconsistent number of SAN I/O paths between ESX cluster nodes (C), (D), (E) and (F) Server (D) and (F) have two paths while Server (C) has eight paths. Virtual machines running on Servers (D) and (F) will suffer from reduced I/O load balancing capabilities. This may lead to performance degradation after relocating virtual machines from host (C) to other hosts.

Risk #6: SAN Switch single point of failure for a RAC node (B) While Node (B) has two HBAs, it is connected only to one SAN switch (AD). Should switch (AD) fail, node (B) will lose storage access and suffer downtime.

Risk #7: Database running on Oracle RAC nodes (A,B) is partially remotely replicated (for DR) Nodes (A) and (B) are accessing 47 EMC VMAX LUNs, of which only 45 are replicated with SRDF. In case of disaster, data will be lost, and recovery from tape will be required.

Risk #8: Inconsistent point-in-time copy (AK) for Oracle database The data files and archived redo log files of the oracle database on RAC nodes (A) and (B) are stored on the same volume groups. Thus it is not possible to follow the Oracle best practice to flush and clone the archived redo log files only after initiating the data files clone. The outcome can be loss of transactions and recovery issues.

Risk #9: HBA single point of failure for a RAC node (A) RAC node (A) is using a single HBA port. Should this HBA fail, the node will lose storage access and suffer downtime. The same issue exists in the DR Environment for nodes (K) and (L).

Risk #10: HBA running at suboptimal speed on RAC node (A) The HBA of RAC node (A) is operating at 4GBps speed while its peer node and connected switched are running at 8GPbs. As a result, the node may suffer from degraded performance.

Risk #11: Oracle database on RAC nodes (A,B) is replicated through two different EMC RDF groups In the event of disaster, one group may fail and cease to replicate while the other will continue to be active longer. Thus, some of the database replicated LUNs will be more current than others, rendering the entire copy inconsistent and usable, and leading to complete data loss.

Risk #12: Production WebLogic server (R) is installed with an outdated Java release, and misaligned with its peers. WebLogic nodes (O), (Q) and (P) are installed with Java version 6 Update 6 while node (R) is outdated and installed with Update 1. This issue may impact the stability of the applications managed by the servers which rely on Java.

Risk #13: Incorrect domain and DNS settings on ESX Server (H). Server (H) is configured with “dns0” instead of “dns02” as its peers. Thus, it uses a single DNS server (dns01), which constitutes a single point of failure for the host; should dns01 fail, Host (H) will not be able to resolve names, and the host and services on it may suffer unexpected results. Furthermore, the domain is incorrect set to D01, instead of D1.

Risk #14: Timekeeper (NTP) service is off or misconfigured on ESX Server (I) It’s a best practice to make sure time is synchronized between nodes of the same cluster. Incorrect timestamp may lead to unexpected results, including service disruption and data corruption for applications running on virtual machines on the host.

Risk #15: Development virtual machines (N, D) are running on a Sphere HA cluster that is used for Production VMs. Mixing QA/Development virtual machines with Production VMs on the same cluster is considered a dangerous practice. QA and development are by nature not as stable as production, and consequently may impact the stability of the production environment if resources are shared. For instance, in case of excessive resource consumption or in case of reoccurring vMotion events that overload network traffic.

Risk #16: RAC nodes (K, L) at site B mount NFS file systems from NetApp filers (Y) at Site A In the event of a disaster at Site A, these file systems will become unavailable to nodes (K, L). It’s a best practice to configure the DR RAC servers to access the local (replicated) copy and ensure smooth transition to the remote site in case of disaster.

Risk #17: Only 41 out of 45 LUNs are presented and accessible to the Standby RAC node (K) at Site B. In case of fail-over to RAC nodes at site B, node (K) will not be able to mount all the required volumes and start the database. The result will be outage.

Risk #18: SAN Switch (AA) is connect to the storage FA through the wrong VSAN Cisco switch (AA) is connected to the EMC storage volumes (AE) through VSAN2 instead of VSAN1. This may impact the overall storage access redundancy, when the (AE) storage volumes will de-facto only be accessible from the (AB) switch, thus creating a single point of failure.

Risk #19: Unequal source volume (AI) and target replicated storage volume (AG) size In case of fail-over to site B, it will not be possible to fail-back to site A without significant effort of storage re-configuration.

Risk #20: Suboptimal SAN I/O policy (algorithm) which does not balance load on ESX hosts (G, J) While ESX hosts (G, H) have multiple FC adapters and I/O paths, the path selection policy configured for the storage volume is such that only utilized a single path at any given time (“fixed”), and does not perform load balancing. Other nodes in the cluster are configured with improved policies (though note they are still not configured consistently – MRU vs Round-Robin).

Risk #21: Inconsistent and suboptimal disk queue depth setting between RAC nodes (A, B) The physical volume I/O queue depth on node (A) is set to 32 while 8 on node (B). Misconfiguration of the queue depth can dramatically impact performance levels and lead to service disruptions, backup errors and more.

Risk #22: Missing / outdated WebLogic Deployment binaries on node (S) All the WebLogic nodes are running 3 deployments while only node (S) has a one deployment. The other two deployments are missing or the deployment binaries are not consistent with the other cluster nodes. This issue may lead to insufficient redundancy and load-balancing, consequently impacting the availability of the application.

Risk #23: All the Site A WebLogic Applications servers are running on a single host (F) All the WebLogic Application Servers at site A are running on a single Host (F). Should this host fail, all the WebLogic Applications servers will become unavailable.