Datacenter Management Series – Downtime Avoidance in HA Clusters

Note: This is a transcript of a webinar that was originally presentes on March 2010.

 “Hi, and thank you for joining us today.  While people are still joining we have a pretty large attendance today and I am very pleased and humbled by that.  We shall begin and hopefully those who arrive later will catch up.

And so, let me just begin by introducing myself.  My name is Doron Pinhas, I am the CTO of Continuity Software, and I will be walking you through this presentation about avoiding downtime in high availability clusters.  Let me remind you that after this formal session we will have an informal question and answer session and I will provide you with more information on how to do just that through your panel at the end of this session.

 Slide2 - webHA
So, without further ado, and before we will actually dive in the real content, just a few words about us, I promise to keep this session as free as possible from any marketing reference.  We, here, at Continuity Software, are a team of professionals in the area of high availability and DR.  As we gathered around 2004, we all came to the realization that too often disaster recovery, high availability and clusters solutions do not work as expected.  Rather than putting the blame on the tools themselves we just wanted to understand what makes it so, because obviously many of the tools out there are mature enough – and so why do they break?

Eventually we came up with an answer – and I am going to reveal some of our findings today in this session.  Basically it’s a management issue.  Our mission in life is to provide the right set of management tools that will allow customers to actually put their current investment in clustering in high availability, storage, and other data canter technologies in general – into the best possible use.  We have an award winning technology we use in working closely with a large number of enterprises, many of which are fortune 500, which actually put us in a very pleasing position;  As an unbiased vendor we do not actually produce any clustering tool but rather management software, so we have the rare opportunity to listen and collect reflections, thoughts and pieces of experience from many other customers. Our returning to the community today will be just to reiterate some of the most common things we typically hear.  I hope you will enjoy this presentation.

 Slide3 - WebHA

To put discussion in the right frame, I will start by stating the obvious, no one really wants to lose data, no one really wants to suffer from unplanned downtime, that’s obviously true for the home user and as enterprises go it will typically involve the following approach: we will put lots of high availability into any one of our datacenters – that will involve clustering, which is the main topic of today’s session – as well as other technologies such as storage arrays, SAN, Multipathing, virtualization, etc.

Possibly, the next logical step -once we have achieved local redundancy – is to look at the more geographical scenario by introducing all sorts of DR tools that will involve replication and in our case geo-clustering and so on.  So, basically, once this investment is actually completed we are theoretically in a state in which we should have automated failover with high level of data protection in case of real life problems.  Reality shows that this is too often not the case, and I want to elaborate a little bit about why it is so.  Our analysis came to a rather obvious description of the problem and we see more and more customers referring to the issue of configuration drift, which I will elaborate on in just a while, as the main reason for clustering and high availability systems getting derailed, or not working as expected when needed. 

Essentially, we now have a data center with a certain portion of the systems which actually run the production role, and a relatively large number of storage devices and servers that serve as standby and are just waiting out there for the right moment in which we will test them or in which there will be a real failure event. Basically, in the production end, we tend to make changes all the time.  There is a variety of possible changes – from adding a new user, applying a patch, upgrading old software, replacing an old server (and these are relatively minor scale changes), and obviously adding more storage, and so on and so forth… Probably some changes are larger in scale, such as replacing a storage array which may involves changes in hundreds of servers and sometimes even larger – such as data center migration.  So we have change going on all the time and the nature of change in production is such that we will pretty much know that it works. Grudgingly, perhaps,  you will get some time allowed to test it before it’s actually Brought into production; and even if something still went wrong, our users will let us know that its broken so we will fix it.  Generally speaking – the production portion is in good shape.  Now, when we look at all the standby systems, often enough they need to also get changed as a result of the corresponding change in production.  And here, to begin with, there is an opportunity that some things may not be observed.  We may not notice that certain system also needs to get updated on the standby side and even if we do remember, we have very little opportunity of testing it frequently enough.  No one can really flip the switch at the end of each day just to make sure that the systems still work.  Which brings us to a very frustrating position in which we apply very complex changes at times – and have zero ability to test them frequently enough, and this in itself will introduce a very significant opportunity for misconfiguration to slip in.  In today’s session I intend to show you quite a bit of examples of those possible areas.

 Slide6 - WebHA

And finally, before I move to the main topic of the discussion, and begin to frame the discussion, a few words about clustering in general;  From the functional perspective, a typical clustering solution, and we are talking, by the way, of clustering from multiple vendors – all in general suffering from the same ailments – even though most of these [solutions] are mature and great products, (Microsoft, HP, IBM, SAN and Linux have their own clusters; there are independent vendors such as Veritas and there are other flavors out there including database clusters such as RAC and so on).  All of these are built around the concept of defining a certain recovery logic – and it begins by defining a certain container, typically called service group.  Basically we can create several service groups and each one of these will capture the collection of definitions to let our clustering software know how to start applications on the various nodes that comprise the cluster.  So basically we would tie in and create resources, some of these will be virtual, such as virtual host names and addresses, that are properties of that particular application, the correct software configuration and storage, and then we define dependencies and basically end up with the following sample configuration in which we have a certain application modeled by the cluster.  Now – if we do it correctly – the cluster software will be able to tell exactly what needs to be done to restart the application in an organized fashion and it will automatically do what needs to be done when the application cannot run anymore on one of the nodes and needs to move somewhere else. 

There are some other configuration activities that need to be done at the cluster level such as defining the [cluster]behavior; meaning, how should we fail over, which service group shall fail when there’s a certain problem, what will happen when the primary node returns, and so on and so forth.  There are also some advanced settings that mainly evolve around the way the cluster tool works internally.  Guidelines into exception handling, setting up internal communication network, heartbeat and locking.  We will discuss many of these aspects pretty soon.

That’s the logical function.  Now we have a set of service groups that can gracefully move from one node to the other in case of failover or if we want just to do it manually.  From the physical perspective, obviously we begin with a set of servers. When setting up those servers, we need to consider several factors.  The first, is the hardware, and most vendors will have certain guidelines as to the required architecture, how many interfaces we need for communication and so on, we’ll top those with an operating system, and here there is a wide range of settings that need to be taken care of such as version of software, the operating system, patches, kernel parameters, user definition, storage definitions etc. 

On top of that – we need to install the applications the cluster needs to take care of – and these should be installed consistently across all nodes, keeping in mind that we need to keep the same versions, parameters etc.

And finally we can configure the cluster software itself, which takes care of managing all those resources.  Now, in addition, most clusters out there will also have a shared storage architecture in which all nodes must access the same storage devices – typically using some sort of a SAN configuration (more often today we can see iSCSI entering in, and there are various configuration requiremetns that need to be taken care of just to make sure all the cluster nodes will see the storage in the correct way (we will touch upon some of these pretty soon).

Finally we will need to set up the correct networking.  Clustering vendors will have specific recommendations and requirements as to how the network should be configured in order to facilitate a successful failover.  We would typically have redundant network links for internal communications as well as redundant network links for external communications.

So far, we have discussed only basic cluster set up – but there are some more complex configuration, which I will also try to address today.  For example – a stretch cluster, which is useful in the incident we would want to separate our cluster into two relatively nearby facilities up to several miles (and some will argue up to 40 miles and perhaps even more – but [definitely] not much more than that); and so we can basically stretch our cluster and apply LVM mirroring in between different storage arrays at each site so that each node can now see a local copy which is mirrored to a remote site.  In the event of a site failover we will always have a set of a cluster nodes tat can still access a complete mirror of the data –  and that allows us greater flexibility.

When we need a greater distance or more flexibility, we can even move to geo-clusters, or replicated clusters (there are some other terminologies for these kind of solutions) – and this will entail actually setting up two sets of clusters.

More often than not, a particular change to our cluster, may result in joint work of many individuals, so any miscommunication may be ground for additional configuration drift.

 Slide10 - WebHA

Basically, with that I will move to showing you some of the most common examples that we have encountered so far, in our experience in the last couple of years. I will begin with the clustering layer. This is a nice example to start with, it is a very innocent misconfiguration, in this case we are showing this example in Unix but basically the very same problem can happen in Windows as well.  So we have a cluster with 2 nodes, the service groups are defined and we are just adding a new file system because we need, perhaps, more storage space, making sure we create the right resources and dependencies and everything is just perfect. But in Unix – unlike in Windows – we need to actually associate the file system with a directory that will serve as a mount point.  If we forget to do it in one of the nodes (or rather do it but misspell) – a manual process that is not managed by the cluster software – than we have a hidden risk that will prevent failover from correctly occurring. When production server fails, the stand by will try to calculate the sequence of events necessary to restart everything gracefully., At one of the early stages it will try to mount the required resources and it will fail because there is no mount point.  And so now we have our investment in clustering failing us in the sense that someone will need to manually fix this problem, which is not really what we have intended. 

Another example about resource configuration (and that’s true for most clusters out there, I would say for all but, perhaps, RAC and true Active- Active cluster configurations) in which some of the nodes are stand by and therefore we choose not to let the operating system automatically mount our file system – we would rather let the cluster software handle that., But, by default when we create a new file system, most operating systems will try to mount it automatically on boot – so we have to disable that.  If we fail to do it in one of the filesystems, we’ll now have a situation in which the standby nodes can actually write data to file systems that supposedly are owned by the active node, which may lead to data corruption, which is cluster wide. Therefore it’s definitely something you want to avoid. 

Other areas of concern around the cluster configuration are incorrect resource and resource dependency issues.  Theoretically this should not happen as long as you manage your cluster with the supplied management interfaces; but in many cases, when you do apply changes, some of the nodes may not be up and changes may not be reflected. and even though we feel it’s not the perfect practice we still see people manually changing cluster configuration files, so if you are missing a critical resource definition in one of the nodes you will not be able to failover gracefully.  If we are crossing our dependencies and this is something we see every now and a while, the file system resource that should point to a specific logical volume, might actually points to the wrong one.  It can actually cause hazard to our cluster! if you want to failover, say, service group A, you would actually: (a) Fail doing that because we do not point to the right resource; and; (b) Might actually mismatch the [incorrect] resource from another service group happily running somewhere else.  This is something to avoid.

And finally some hanging or loosely hanging chains of resources that are not tied correctly all the way through.

Another area is [cluster] state issues. Clustering do keep careful track of their state.  The cluster is a state machine that’s supposed to provide ways to move from one phase to another in case something changes in our configurations. Some of the clusters are more sensitive than others so once we have our first failure in the cluster, the next failure might result in unpredictable activity.  The best practice is to [every now in a while or frequently enough] take a close look at our cluster state and make sure it always on the “zero” state which means it is perfectly configured.  If we fail to observe that we may end up with a cluster that does not function as expected. Sometimes we actually suspend (or freeze, or bring down) some of the resources, just to facilitate some maintenance on some of the nodes and just forget to release that state – so it can remain there for a long time.

And finally there are some other bad states which include “down”components, “paused” and “jeopardy” that are flagged [by the cluster] but not always propagated to our management console so the cluster may be in an undesirable state and we may not know about it.

Another area, which is rather notorious, is Quorum devices, voting devices and I/O fencing – which are multiple names for the same idea – and we will discuss it a little bit later.

Last, but not least, are mis-configuration of advanced agents needed.  For example – in replicated clusters or geo clusters we will typically need to deploy cluster wide DNS agents and replication agents that will manage cross-site transitions.  If these are not installed or not configured correctly we may have problems.

Moving to the application layer, here we may have a variety of possible issues: the most common ones are missing software packages: we don’t have the same software or the necessary software component installed.  Misconfigured application can be different in parameters or configuration files and required network objects which are not defined, or defined but not declared correctly.  Last but not least is licensing for the application itself.  Imagine a failover for an application that’s licensed for a 1000 users and on our standby on the other side there is no license, definitely that may actually go unnoticed in testing but in real life that won’t get us far.

One of the richer areas in potential problems is the operating systems and hardware layer.  Definitely we will need to keep our hardware, to a certain degree, in a similar or compatible configuration.  Cluster nodes do not have to be identical – by all means (although I personally believe this is a good practice) but there should be a certain ratio maintained between nodes.  We will definitely want to avoid using different CPU architecture like 64 bit vs.32 or keeping too large a difference in the number of CPUs, memory, HBA and network interface cards. if you have only 2 or 3 clusters that’s probably not something you would experience; but when we are talking about dozens and hundreds this is something we start to see popping up.

Slide16 - WebHA

The other area is software configuration – and in this case it’s the operating systems and central utilities, so it’s not uncommon to find certain cluster nodes which do not have the same operating system version or service pack; or patches installed that do not have the same versions or are not installed at all, as well as certain required infrastructure applications such as web servers,  Java and essential storage utilities.  There might also be differences between critical kernel parameters.  Imagine we had to tune our /O queue depth to give the right kind of performance and we just forgot to do it in one of our standbys.  So now, 6 month can pass before a failure.  And when it does happen – no one remembers anymore what was done to fix the problem – we just know that it came back.  This is something worthy of regular auditing and there are many other areas which I am going to show in the next slides.  This one [the example in the next slide] has more to do with Windows but the same thing can happen in Linux as well. DNS mis configuration for example –  one of the most common issues, is that, in a multi-site failover we have actually the same DNS configuration cross both ends and if we don’t keep a redundant DNS setting than it will probably fail to failover as expected.  So, again, this is an area that needs to be taken care of.  Some other issues that may considered misconfigured at the operating system level and the hardware level are incorrect user definitions. [Another], pretty common problem is that we have different passwords, or that certain users needed for the [smooth] operation of the application were added but not to all nodes;  licensing, which we have mentioned earlier, and that’s true also for the operating systems.  We have specific issues for the more advanced cluster configurations, for example in metro clusters in which we use LVM mirroring to a wide extent.  These could get into crazy corners.  Say – for example – a configurations in which one of our logical volumes or file systems was actually mirrored to one of the sites instead of two [sites], so if it happens in one end only, than failure to that particular end will result in data loss.  Sometimes we will find it is happening at both sites so practically speaking no cross site failover will be possible.

Some cluster solution will require to tag the mirrored members and let the cluster know where physically each mirror resides and if you fail to do it correctly you may end up with either poor performance (at best) or failed recovery (at worst). 

Again I/O fencing devices or Quorum devices must be viewed by all nodes in the cluster through at least one path – but better yet, through redundant pathing; and if some of the nodes are not configured to see all the Quorum devices or see them through a reduced configuration, than we may end up in certain situations where failover is required but our standby is either not able to lock the resources it needs or can not make sure it has the right to take over – so it will just freeze.

Another area, and perhaps the last for this portion of the discussion is definitions of export and mount – if you use those.  Though some customers are reluctant to use that because of security concerns, others deploy it to a wide extent., if you need to export material or mount network file-based data, you should probably take care of incompatible NFS versions, or different permission model, mode, options set, etc. If that is not taken care of you could end up with unexpected failures.

Networking layer, I will try to keep it brief, basically the most common issue is failing to meet vendors requirements as to the number of interfaces per network (public or private) or just failing to keep private networks [and these are critical].  A very common issue is that some of our nodes do not have their interface teaming correctly configured or are not at the correct speed., If you do not observe that, then upon failure, you might experience an unexpected behavior. 

A very common issue is that we have some single point of failure out there between our switches or VLANS so its worthy of auditing every now and a while, just to make sure all of our paths are actually redundant.  Some clusters – the more advanced ones – will use propriety, low-level, and low- latency stacks just to insure communication between nodes is done in a timely manner. But these are sometimes sensitive to misconfiguration, some of these are not even routable which is an area where people don’t necessarily put enough attention.

Finally, in firewalled networks we must make sure that all of our cluster nodes have the same permission to pass through in between different networks. 

Moving into the storage area which is one of the weakest links in the chain.  All of the clustering tools out there do not have sufficient control over the various storage equipment we may have in our datacenters (that can include EMC, IBM, HDS or NetApp and many others…).

 Slide22 - WebHA

Basically here is a very simple example in which we add a new shared storage devise to the cluster and for some reason one of our nodes cannot really see that device.  In more than 95% of the clusters out there, the passive nodes will refrain from actually accessing the device until it’s called for, so that basically they cannot really tell that this problem actually exists, until they are called to actually capture the service group.  And that’s probably too late, because now failover will not work, and again, even though we have expected a very quick and automatic recovery someone will have to manually fix this and this might take hours.  All this can be avoided by periodically auditing that all of our nodes can see all of the shared volumes.

Moving into a different area, and this can be hidden even in a small, 2-node cluster, and definitely in larger ones. We have a certain resource (could be a filesystem or a database) which is striped across several storage volumes and some of these are configured for pathfault tolerance, meaning they have more than one SAN I/O Path to more than one array port (while one device is using just one port – or one path.  Bad for performance but, worse, it’s bad for redundancy.

 Slide23 - WebHA

Next area is quite intriguing and curiously enough, very common, in which we have an un-authorized host which is not part of a cluster but might have access to our storage devices and we actually show here both possible scenarios where it can happen in our production site or in our remote recovery site.  Both [scenarios] can be rather unpleasant, and this can happen more easily than people might figure out.  If you have a host which is not part of a cluster actually being able to access one of the devices, then if that particular host is to write to that device – it will blow away our cluster and no amount of failover automation will help us because the shared data is now corrupt.

 Slide24 - WebHA

 As I said, this can happen very easily.  [For example] Two of our customers reported they had just moved an HBA as a result of an upgrade (they wanted to improve performance) and the old HBA still retained it’s firmware definitions (no one has removed those).  Several month later that HBA was connected to a QA lab machine and perhaps a day or two later it blew away a critical production system. So that’s something also worthy of reviewing every now and a while.  

Slide25 - WebHA

Another area common to replicated- or geo-clusters is that you add more storage devices and for some reason these are not replicated. It’s not always as simple as it may seem! perhaps our storage admin did provision replicated devices but we, as server admins, misused those or just picked the wrong device. Or, the server admin may have performed well but the DBA added a new data file on the wrong file system;  either way – we end up with incompletely replicated stripped data that can lead to data loss.  Likewise there is a variety of opportunities to have a replica which is not guaranteed to be consistent as in this particular illustrated sample.

 Slide26 - WebHA

And, again, setting up storage consistency groups is done at the storage level, and it has to be a coordinated effort between our storage admin, server admin and at times even our database admin.  If we failed to do it [in a coordinated fashion] we will have replicas that will usually look OK whenever we test the cluster (because the failover will gracefully disconnect network links and everything will look perfect) – and we will typically always fail in real life (because different consistency groups in real life are usually bound to specific network links; these will fail second or minutes apart and we will end up with a totally useless copy).

 Slide27 - WebHA

Again, this is a very simple example where we have just mixed storage tiers.  This is relatively harmless, it’s just bad for performance or perhaps wasteful if you don’t need high performance, but there could be variations such as this one where we actually mix SAN storage with local storage and possibly even in a replicated environment and most likely if it happens in a cluster that does not failover too often we won’t observe it.   Obviously there is no chance failover will actually succeed and again this is something we see every now and a while.

 Slide28 - WebHA

So, there are some more areas around storage and regretfully time will not allow to explore all, I will just try to hint at some, so one of the areas we find is often overlooked is having enough storage control devices mapped to our servers. many of the cluster vendors will have certain best practices around that, for example EMC has gatekeepers, (and there are other names different vendors use for control devices) but these must be assigned in a sufficient number to a cluster. Some vendors will insist that you will have double the number of your device groups +1 and if you don’t have it – when the cluster attempts to seize control of your shared storage devices it may actually reach a certain position where everything will freeze.  So that’s an area which is often over looked.  Misconfigured replication agents on metro clusters, replicated clusters or geo clusters.

Or the notorious Quorum and voting issues: sometimes the [quorum] devices are not visible [by all nodes], sometimes not [seen] redundantly enough and we have to have a network failover, [in this case] some of our hosts are rendered useless and can’t really seize control of storage resources.  Other possible issues are with dead replicas, dead paths or frozen replication – all are totally outside the reach of our cluster. A geo cluster would not be aware of any of these issues, and , when it attempts to failover, it would not succeed.

Here is a very short marketing segment before moving on to the conclusions part.

 Slide31 - WebHA_0

 So how do we at Continuity Software help mankind?

We have a software tool called RecoverGuard, it’s a perfectly agent-less tool which is very easy to install; it usually takes an hour to set up and an hour to configure, and immediately after that you can obtain a very detailed report showing you whether your configuration is consistent or whether you have any risks.  The way we manage to do that is by scheduling RecoverGuard to run once a day or as frequently as you want.  It will collect information from storage, servers and databases, correlate all the collected information to automatically discover servers and storage and clusters and replicas and geo clusters – so it’s totally free of human intervention – and then take a very close look at the configuration to see if any of the problems we have discussed plus a very large number of additional issues is actually in existence in your environment – and if some match is found then it will alert and show you exactly what needs to be done to fix the problem. So it offers you a way to proactively monitor your environment and make sure all of those best practices are observed and all undesirable mistakes are avoided. This knowledge base is a result of a large community efforts throughout several years. Most contributions come from our customers.  It’s a community and vendor driven knowledgebase and it keeps getting updated.  Essentially we cover all [areas discussed] and many more., RecoverGuard has a very interactive U/I, providing you both a dashboard view showing you in which area you have risks for your data or your availability, when you drill down into an issue you will see a very detailed description of how it looks, what happened, what’s the root cause, what’s the impact and how to fix the problem.  Its very fun to use so if anyone would like more details I will show you how to contact us later.

 Slide33 - WebHA

With that ends the marketing section and we reach the final stage of this formal part which is conclusions.  I will try to keep it as short as possible.

 Slide36 - WebHA

Personally, I believe that one of the areas not being observed  carefully enough is having the right collaboration.  The more successful customer environment we have seen are such is in which DBAs storage admins and server admins will get together at least on a monthly basis to review critical system configurations and try to understand whether there are new best practices that need to be discuss, that may reflect on the work of the different teams.  If something is found relevant, you need to document it.  Having the right documentation is the basis to a successful recovery.

Perhaps most importantly I would urge you to consider automating testing and auditing of issues such as the ones I have described.  Some of the clusters will come furnished with free testing tools.  By all means use those to run automated tests every now and a while –  it will make your life easier and increase your confidence.  Of course, automated testing itself will not find all the possible issues and for this reason you also need, to automate auditing of the environment.  It could be either script based, or you can perhaps use tools such as RecoverGuard but try to automatically collect the configuration information required to understand whether you have those issues or not.  It’s not practical to manually do that, you really have to automate the data collection.  Try to make sure that the information that the cluster will provide you such as alert about bad states or missing control devices are actually caught by your network management system.  That’s one of the areas that may be relevant for you clusters and in many cases it is not monitored. 

Finally I will urge you to send us feedback! we will make sure to publish it on our site and our blog and I will provide you with the links momentarily. 

With that I would like to thank you for your time in this section and we are now moving to an open question and answer section.”