The side effects of failover in a cluster

The side effects of failover in a cluster

by jmmerk on June 19, 2011

by Yaniv Valik
SR DR Specialist, DR Assurance Group

It happens all the time. You’ve decided to manually switch to a different node in the cluster, or maybe your active node crashed. Luckily enough, production services started running on the formerly passive node (well, sometime they won’t…). Everything is up and running but something has changed and not for the better… usually it’s performance.

If you’re a database/system/storage administrator or an IT manager, you’re probably all too familiar with this scenario.  It doesn’t matter whether you’re using Veritas Cluster, Microsft Cluster, AIX HACMP, HP-UX MC/ServiceGuard or Sun/Linux Clusters. Finding the root cause, if at all, could take weeks over weeks. Database, server and storage configuration (and what’s between) is so complex in today’s datacenters, there could be thousands of potential causes.  Even when you have a “suspect”, testing it may result in additional side effects, or worse – downtime.

Here are a few examples of things that could-go-wrong resulting in performance degradation…but it’s really just the tip of the iceberg:

Example I: Reduced I/O Settings

  • The standby/passive node has less I/O paths to SAN volumes, thus the passive node can carry less I/O load. There are dozens of possible variations to this theme…
  • The standby does have the same number of paths but they are not distributed on Fibre Channel adapters and ports as well as on the active node
  • I/O mode differences – round-robin or other multi-path load balancing algorithm is configured on the active node while the standby is configured for path  fail-over only (no load balancing)
  • Different I/O queue depth configured per device and/or per HBA
  • And so on

Example II:         Different Server Configuration

In this category, there is really an endless list of samples… here are a few:

  • The passive node is configured with different performance settings (for example – on Microsoft Windows processor scheduling is adjusted best performance of programs on the passive node, but for background services on the active node background services)
  • The passive is not installed with latest system or application patch, service pack or version
  • The passive node does not use network interfaces load balancing while the active node does

Example III:        Uneven Database-related Configuration

  • The standby/passive node is configured with reduced values for critical system parameters affecting database performance – such as shared memory parameters, semaphores, file limits, and so on
  • The standby/passive node has different performance-related database configuration (e.g., max number of processes / threads / sessions, memory pools sizes, operation mode, transaction logging settings, software logging settings, and so on)

Can this be avoided?

Do not wait for the next failover event. Why have end users and application teams breathing down your neck? Verify on an ongoing basis that your clusters follow the vendor’s best practices and that all nodes are aligned in terms of software, kernel parameters, operating system settings, limits, configuration files, hardware-related configuration,… and the list goes on.

Automation is required. Automated  monitoring to identify gaps between cluster active and passive nodes is the only practical solution.

A failover is more than just getting everything up and running as fast as possible. Without keeping the same service levels, operations are still damaged and money is lost.  RecoverGuard by Continuity Software addresses challenges by intelligently identifying risks and vulnerabilities which may result in downtime and reduced performance in case of failover.

P.S
Readers, it would be great if you can share with me your experience with failover-related troubles.

2 Comments

  1. Good Article. Please let me know if you would like to have it published on BCP News as well…full credit to the author. Please check out BCP News at http://www.bcpnews.com and email me if you would like to be published on the site.

    Best regards,
    Gaston

  2. Eric Jakimas says:

    Excellent article on why primary and backup environments need to be in sync and why this is difficult to manage without an automation tool.