Private Cloud: Top 10 Downtime and Data Loss Risks and How to Avoid Them

Julie: Welcome to Continuity Software’s Webinar. Today’s webinar is called The Private Cloud: Top 10 Downtime and Data Loss Risks and How to Avoid Them. I am Julie Shafiki and I’ll be the organizer on the call. I’d like to welcome everyone who’s taken the time to be with us today. The presentation itself will be about 20 minutes followed by a Q&A session. I’d like now to introduce our first presenter, Mr. Gil Hecht, founder and CEO of Continuity Software. Doron Pinhas, Continuity’s CTO will also be presenting. I now turn the floor over to Gil.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m64f90728(2)(1)(1)(1)


Gil: Thank you, Julie. Today we’re going to talk about the private cloud and specifically we’re going to focus on downtime and data loss risks in the private cloud. After we will look at some risk examples and try to analyze why they happen. We’ll have some tips on how to avoid those risks. And then we will do a 5-minute introduction to Continuity Software. For those of you who are interested we will also have a Q&A session towards the end of the presentation.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m18a6f0b1(2)(1)(1)(1)


So what is the private cloud? If you look at all kinds of dictionaries you see the definition is a “scalable pool of computer resources that are dedicated to my business.” Now, in more simple terms, and with your permission, I’m going to use the VMware “jargon” during this webinar, although the same issues happen with other types of virtualization technologies as well. So, if you look in more simple terms the private cloud is essentially a couple of ESX servers, which are connected to shared storage and are running multiple virtual machines. Since the private cloud is becoming a very important piece of every enterprise (simply because it’s running so many virtual machines), Disaster recovery is actually becoming a very important part of the private cloud as well. And when many people discuss the private cloud, they consider the private cloud as two remote environments: production and disaster recovery, which are both active, of course. Now, in order to get all the great things we can get from the private cloud, the private cloud is designed for multi-tenancy, which makes it so cost effective and that’s the reason we all go there. And it’s also designed to allow us to have relatively painless virtual machine relocations which allows for much simpler and more effective management. In order to achieve that, the technology had to develop such that there would be more layers in the technology stack. And I’m talking about things such as the virtual machine of course, VMFS, and other layers as well. And in addition, there is also a need to hide the physical infrastructure. The ability to move a VM from one place to another is actually tied to this ability to hide the infrastructure and provide the virtual infrastructure. Now, those things create challenges. They do lots of good things and the private cloud is, without a question, the right way to go. But it does create certain challenges. And,generally speaking I would say that we can take those challenges and break them into three areas.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m35bea6fe(1)(1)

One area is the area of the virtual infrastructure itself, meaning those ESX servers that are clustered together and connected to the shared storage in other configurations. And the configuration of them and the connections below them of the network and storage. So that’s the virtual infrastructure and the next couple slides will give you examples of specific risks that can happen in the virtual infrastructure. Another area is the area of the virtual machines or each virtual machine by itself. And there we’re talking about the fact that since you separated the physical infrastructure from the virtual machine, you’ll find that many problems can arise in the virtual machine layer that couldn’t happen before. And those issues are things that can lead to data loss, can lead to downtime or can lead to sub-optimal performance. And again, the next couple slides will look at some examples of that as well.

Last, but not least, as we said, disaster recovery, or really the ability to recover the private cloud as necessary is a very important part. And there as well there are a lot of complexities that relate to how people do disaster recovery in the cloud and we will give you some examples of risks in this area as well. So, with that, I want to hand it over the stage to Doron Pinhas. Doron is the CTO of Continuity Software, and he will take you through some of those examples.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m3b4804a(1)(1)

Doron: I want to pick up from where Gil has left; basically , when we drill down more closely we break down each of those layers into sub-areas. This slide represents what we have encountered [in the field] and what our customers report. Having all those layers and sub-layers managed by different teams is often one of the reasons for problems to occur. Let’s just drill down to each one of those layers and each one of those sub-layers. It was a struggle to choose only a single representative sample from each one, and the following slides do suggest additional issues that can happen. Those of you who wish to download the presentation can take a closer look.

With that I’m just going to lead the first example, which is on the virtual infrastructure layer.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m69606b10(1)(1)

This one is storage related, and what we see here is an ESX cluster spanning several nodes. The problem here is that one of the nodes has no SAN path redundancy like we would expect from all nodes serving the cluster. What it means is that all virtual machines, which are currently running on that particular node, will either, suffer from a single point of failure or reduced performance (or both) – which can be a big issue, Because machines can move about when deployed with DRS or HA.

This could also explain why some machines demonstrate performance hits every now and awhile, which then goes away and so forth – as they move to the under-provisioned servers.

Of course, there could be many other risks at the storage level – some can actually lead to data-corruption that may affect multiple machines.

Moving to another example on the virtual infrastructure layer, this one relates to the network setting and is actually quite similar to the previous one:

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_3d146e09(1)(1)


What we see here is that one of the cluster nodes only has a single network connection to the shared or public network which can affect performance but can also result in low-performance particularly because this slide also suggests the existence of some noise on the network interface. So again, we have a single-point-of-failure and extremely poor performance for all the machines that are currently running on top of that particular server. And like the previous example, the fact that machines are moving about can make it very difficult to understand why we have fluctuation in performance.

Network issues, of course, are not limited to redundancy only but also to other differences between nodes like DNS, routing, and many others.


CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_744aa466(1)(1)

We are still on the virtual infrastructure layer, but now we are moving to a different area, which is the configuration of the cluster itself.

Here is a very basic issue in which we don’t have similar configuration in-between nodes. In this particular case, this is both related to having different version of ESX on different nodes plus having different OS options configured. This can actually present risks to the data of Virtual machines running on the cluster.

Of course, there could be many other differences at that layer including hardware differences, firmware differences, and other options.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_5c580c08(1)(2)

The last sub-category, which is quite broad, contains an eclectic collection of issues – I’ve chosen to present one of the “nicer” examples but the slide also suggest some of the others.

What we see here is a common configuration in which the vCenter application is installed inside a virtual machine, which is actually a good practice. The problem we see here, which we’ve managed to find in several customer sites, is that the virtual machine running vCenter is configured with fully automated DRS. This practically means that we can’t tell in advance on which particular physical node it is going to run at any given time. If this machine panics or hangs, we will have control over our cluster and thus we can not tell where to go in order to restart that machine. If you have a very large cluster environment you can spend a couple of hours figuring out how to revive vCcenter.

Of course, you’ll usually find it out when you actually need to do something urgent, so it’s highly inconvenient.

There are many other potential risks in this broad category; some of these are suggested below.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_7e7976a(1)(2)


Moving to the next layer, which is the VM layer.

As Gil has suggested, the abstraction layer added by virtualization can actually make it very difficult to identify potential vulnerabilities or risks.

Here is a very simple example that relates to the virtual storage allocation. What we see here is a virtual machine running Oracle (or any other database for that matter), which is currently supported by an ESX cluster containing several nodes with the cluster using different tiers of storage assigned to different data stores. As you can see some of these are provisions for very high performance and some of these (probably for archiving) are provisioned with lower performance RAID 5.

The Oracle admin has no ability to know what tier the storage allocated to the database belongs to. Here we focus on temporary database files (all vendors will recommend to store them on the most performance capable of storage, because otherwise it will slow down the entire database). It so happens that this particular database has temporary files stored on the [virtual OS] file system which is in turn, virtually mapped to a LUN on an incorrect pool – which may result in horrible performance degradation.

As mentioned, it’s not really easy for the Oracle admin or the virtual machine server admin to understand why

Some of the other storage allocation problems can actually result in data loss. Like, mixing RDM and non-RDM devices or devices from different tiers.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_2dfc0fe2(1)(2)

With that, I’ll move to the next configuration, which is something that people are not always aware of when they start using the infrastructures but becomes apparent later on. We typically use builds or “base” images to provision new machines. So, what we see here is that over time we have added more VM application servers. What happened here is that the new devices, or new VMs, were not provisioned based on the same image. In this case, this takes the form of different version of the operating system, which can actually result in security risks, performance issues, or unexpected behavior.

Over time as the environment grows larger and larger it becomes more of a challenge to make sure that all of the virtual machines that belong to the same application are actually consistently provisioned.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_mbcae214(2)(2)

We’re now moving to another area, which relates to reliability (and I’m showing one of the most obvious examples). Here we have the same application comprising two virtual machines running off two different physical nodes. And as a result of routine maintenance or unplanned outage, one of the machines has to fail over or relocate to a different server – and the problem here is that now we have both (and only) two application servers running on the same physical nodes. So, from now on, we have a single point of failure that may impact the entire application. There’s no clear way to track this on an ongoing basis, but such issues can definitely can cripple a production application.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_52a0f720(1)(2)

[Were now] moving to the last and final layer, which is the DR configuration.

I’m going to show three representative examples (actually four…). The first two do not necessarily relate to SRM implementations, but can happen even if you rely on manually configured fail over.

First, let’s look at the replication process itself, which must be tightly aligned to your recovery goals. What we see here are a couple of examples in which, on the top one, one of our data stores is not fully replicated. Obviously, all the virtual machines that are dependent on that data store will not be able to recover. The second one is a bit more complex, in which, everything is replicated, but not according to the storage vendor and VMware’s best practices, which imply you should use the same storage consistency group for all the devices on the data store. So now we do have a [complete] copy, which is very likely to be corrupt. Of course, there are multiple other permutations that can lead to replication faults and these are not really monitored by the infrastructure itself.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m38dedd0d(1)(2)

Moving to another storage related issue, this time about mapping of the replicas to our recovery site. In this example we see a classical issue in which one of our designated recovery ESX hosts does not have a storage path configured to one of the replicas. So – unlike the previous examples – the data is there but it’s not accessible by the recovery host – which actually means if we attempt failover we will find that all the virtual machines that depend on that particular data store will also fail. Again, the last two slides are generic, and relevant both to SRM and non-SRM implementations

Now, for those of you who are using SRM or are considering using SRM there are some more challenges to face. You must be able to make sure that the SRM is fully aligned with your production configuration.

SRM is definitely the way to go! It allows automating failover, but it is relatively vulnerable to configuration changes. You have to make sure that you keep your configuration always in good shape.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_4c602e32(1)(2)

Here is one of the most trivial examples (and I’ll suggest more complex ones later on) in which we have created a Data Store group and a Protection group that contains the VMs on that particular Data Store group. I’m assuming we have properly configured SRM to support a recovery process on the recovery site. The problem is that, over time, new VMs and – in this case – important ones, were added to the Protection Group. If we do not manually refresh the configuration of SRM, it will not be aware of those additions and the result is that during the failover, we may either have partial recovery or a failed recovery.


Of course, this is one of the most obvious discrepancies that can happen. But SRM should be aligned [to production] in many other aspects that relate to the physical layer and the VM layer. With that, I’m going to give control back to Gil, who will take us to the conclusion of this presentation.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_7d4ca6e4(1)(1)(1)


Gil: Thank you, Doron. Doron shared about 10 examples of issues that can happen in the cloud; downtime and data loss issues that can happen in the cloud. Now, this database of issues that we chose from actually contains thousands of examples of other problems that can also happen in the private cloud. I think one thing that becomes clear when we talk about so many risks that can happen in the importance of the cloud is that, no one who actually cares about this data or cares about the availability can afford to overlook those risks. The meaning of overlooking those risks is you’re waiting for disaster to happen and it that you will lose either your entire cloud or just some of it at a given moment sometime in the next couple of years. So, try to analyze what can be done. There are really two options. One option which we don’t like much but is an option is simply to do very frequent manual testing. The problem with frequent manual testing is that A.) It is amazingly expensive so it takes a lot of manpower and takes lots of hours and it is a significant investment. In addition to that, the testing itself if you’re doing a live test can actually cause downtime and can cause data loss because when you’re making a real test with people, essentially you need to move VMs from one place to another or you need to failover VMs to disaster recovery. And what if it doesn’t work? If it doesn’t work it will mean that you just lost availability or you potentially even lost the data it’s sitting on. So, I think it’s pretty clear that automation is the right way to go. And we believe that the best method is to combine frequent automatic risk detection in combining that with very infrequent manual testing. So, in other words, do a test once a year with users and real systems and start moving VMs around and do a failover to disaster recovery and all that. Find a way to do automatic testing, which is completely non-intrusive and takes no manpower, literally everyday. That is exactly what Continuity is all about. So, with your permission I want to tell you now for just two minutes about what Continuity is doing and then we’ll go to the questions and answers.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_1955d709(1)(1)(1)(1)

Continuity is a company that got started in 2004 and is focused solely on finding downtime and data loss risks in the data center. We have numerous large enterprise customers, some of them you can see here, many of them obviously are not on this board.And, that’s our business. Our business is to help companies avoid downtime and data loss. The way we do it is with a product called Recover Guard.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_5d27e38c(1)(1)(1)(1)


Recover Guard is essentially a product that sits on one place in the network. It is scanning the entire data center once a day and is scanning for everything: storage, servers, virtual machines, disaster-recovery, high-availability clusters: everything. It automatically understands the configuration, it automatically understands what is a cloud, what is a database, how are things connected. And then, it is using the Gap Detection capability in order to find problems that can happen which can cause downtime or data loss. And today we have over 4,000 example [risk] signatures in the database which is community driven. And when I say, “community driven” I mean that our customers and the vendors that we are partnered with – which are all the large enterprise vendors – are contributing their knowledge in order to allow us to put it in the database and help customers prevent downtime and data loss. So, I’m not going to take you through all the details. You can go to our website and check it out, but I will say that we’re covering today all the fields of data protection and availability.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m25d07c43(1)(1)(2)


We can really help customers who care about downtime and data loss.

I’ll just show you the product for 30 seconds and then we will move to the questions and answers.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_1f3a7756(1)(1)(2)(1)


When you install the product, that’s what you see; that’s our dashboard. It will show you the list of business services and the risk level associated with data protection, availability, optimization, and SLA. If you click on one of those areas you’ll see the list of risks below.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m237d0d96(1)(1)(1)(1)

And if you click on one of them you’ll see all the details that are required to understand the risk and resolving it. So it will show you the relevant technology, it will show you the description of the problem, it will show how it happened, when the change occurred, how to resolve it, what’s the history of the problem, etc. Now I want to hand it over to Julie, please feel free to ask questions on the web interface; you have a questions module.

CS Webinar Dec 7 2010 - V2_2 - Transcribing(2)_html_m6f4b6033(1)(1)(2)(1)

Julie: Thank you, Gil. Thank you, Doron. We would like to now move to the Q&A session of the webinar. For those of you interested in asking questions you can use the question module on the GoToWebinar viewer here. We will respond to questions either via the chat or out loud if the question is one that the team here thinks is valid for everyone to hear.


Gil: First question is, “Isn’t VMware SRM already solving some part of this problem?”

Doron: Well, as I said earlier, SRM is actually a great way to go to automate failover processes. What it doesn’t do is A.) Look into the production configuration itself. You might have noticed some of the risks we have discussed can actually affect performance or availability of our primary or production data center. Not just the reconfiguration of the protection site. So, definitely SRM will not patch those, and what’s also apparent is SRM manages processes but will not help you understand configuration issues that relate to the way the infrastructure itself is configured. For example, if you have a replication issue or a SAN mapping issue, and so on. So, while it is a great way to automate processes, it is lacking the mechanism to audit the validity of your implementation. It doesn’t do health checking, which is basically what our product is all about.

Gil: Thank you, Doron. With that we have another question. “How does Recover Guard operate in a disaster situation?”

Doron: Basically, the value of the tool is apparent any given day of the year, not just during disaster. Because, it will find all the issues you want to fix prior to the disaster. But, having said that, when you do have a disaster it is extremely important to have a faithful repository that documents the way your environment looks like just the minute prior to failure, and to understand what kind of exposure you haven’t yet managed to fix prior to the failure. So, by placing Recover Guard at the recovery site, you can have that repository available. If something is broken and doesn’t work when you attempt to failover you have that reference and it will probably suggest what went wrong. That’s a valuable asset during disaster!

Gil: Thank you, Doron. We have another question for you, “Isn’t the product a little bit like CMDB?”

Doron: That’s an interesting question. Recover Guard does have its own internal CMDB engine, that’s true. But that’s definitely not the core value of the tool. Of course, if you do have other CMDB tools we can seamlessly integrate the two. And so, Recover Guard can supply information which “traditional” CMDB tools may be missing. But that’s not the main reason to use this tool. What most CMDB tools are lacking (or I should say that it’s all [tools]), is the business intelligence, or the risk-detection capabilities, that Recover Guard demonstrates. So, while we do have some sort of an internal CMDB engine to capture all of that information, the main strength of the tool is actually analyzing the configuration rather than collecting data. That hopefully sheds some light over that issue.

Gil: Thank you. And here is a question that is coming from Arizona. “Who is typically the user of such a testing solution?”

Doron: Obviously the tool will reveal very refined technical issues – it is a powerful tool for the VMware administrator as well as for adjacent subject matter experts like storage, network teams, and so on, who will be very happy to subscribe to the outputs of the tool,and will understand how vulnerabilities can affect the business. But if your organization also has business continuance department, then they will probably love to be users, because [RecoverGuard] it gives them control of what’s going on. They won’t have to rely on manual communication and they can provide management with ongoing information as it relates to recoverability or recovery goals, RPOs, and so on. So, it’s both relevant to business continuance managers and the technical managers of the infrastructure itself.

Gil: Thank you, and we’ll choose one more technical question and then maybe move to a different class. The question is, “What about non-virtualized, non-VMware environments?”

Gil: So this becomes a little bit of a demonstration of the difference between private cloud and public cloud. We are hosting this webinar on the public cloud. We were hoping it was going to be very reliable. Unfortunately we have repeating audio issues and we just lost Doron again. So let’s take a couple of other types of areas that we get questions on. So one question that I see here is about licensing of the software and also about pricing of the software.I’ll try to answer both.

Essentially the product is priced based on the number of physical servers you have. So, if you have 20 ESX serves and 400 VMs then the product is priced based on those 20 ESX servers, and you multiply that by the price of the product. And I believe Doron is back, so let’s go back to the previous question. Doron, the question is for you, “What about non-virtualized, non-VMware environments?”

Doron (back again after our webinar hoster lost the voice bridge again…):I’ll take it into two quick parts. It [RecoverGuard] was originally conceived to protect physical environments so it will do a perfect job at analyzing your clusters, your storage, and your physical boxes. But as it also relates to the cloud, as most of you are probably keenly aware, the virtualized environment is not fully isolated. It’s still dependent on some physical element whether some of the servers remain physical, whether some of the core infrastructure elements like directory, DNS, and so on are still physical – you also want to understand whether you have dependencies to the physical elements that are not protected well enough. The best approach is to use this technology not only to monitor the private cloud, but also any physical element that is also configured for high availability or DR, Geo-clusters and so on, which are, of course, supported by the tool.

Gil: Thank you, and one more interesting question. “Can’t I just manually check VM relocation from time to time?”

Doron: The answer is, yes. That’s definitely something that is doable, the problem with this approach is two fold. A.) It relies on [labor-intensive] human effort – which there’s much more sense automating. Plus, B) if you’ll try to do that you’ll quickly come to the realization that this requires a couple of hours a day – per test and per VM. This is an effort that can be much better served by using a machine to run it. Now, tracing locations of virtual machines is definitely just one aspect, albeit an important one, and hopefully we’ve demonstrated there are some other issues to consider. So if you’ll go about looking into each one of those potential problems manually you will end up investing too much effort. Our belief is that some routine tests must be automated. Thereby saving you the effort and increasing the reliability of your environment and your control. So the short answer is yes, but why should you?

Gil: Thank you. And, now we will choose the last question for today. Obviously, after we finish this webinar we will continue to stay online and you are welcome to use the questions module, which we will answer directly. So, the last question that we will choose is – I am having a hard time choosing – Ok. “Are there any other testing solutions out there for such an environment?”

Doron: It’s important to distinguish or to clarify the term, “testing.” If you are using SRM for example then you are probably well aware of the fact that SRM has some testing capabilities of its own. Basically it can simulate failover scenarios and thereby bring up protection groups as you choose based on copies. That definitely tests at least some aspects of recoverability, but it does not test any issues that can relate to your production site configuration, to HA, the dependency between the virtual infrastructure and storage, and replication and so on. It does not really show if the recovery environment can withstand sustained load once activated for prolonged time. So, while such test is viable, it does leave so many areas still exposed. There are not too many other testing tools in the private cloud environment… Of course, in the physical world you have additional, similar tools, but again all will simulate just the failover process but not analyze any possible discrepancy or misconfiguration of the infrastructure. So, if you want to be really proactive and make sure your environment is risk-free, you will have to look into those areas as well, meaning, “Is my failover solution configured correctly? Is it optimal? Does it make sense?” And so on. To do that– you should probably look into all the [other layers of the] environments in addition to testing.

Gil: Thank you, Doron. So, we do have about 45 more questions to answer. But unfortunately we’re going to not do them publically. We will just answer them directly in the system. So I want to thank everyone for joining us, and we will make the presentation and the recording of this webinar available to you on our website. That will take us about a week. Obviously you’re welcome to go to our website where you can find numerous more signatures and examples of problems that can happen in the data center in a virtualized environment and non-virtualized environments. So thank you, Doron. Thank you Julie. And see you everyone soon.

Julie: Thank you, Gil. And thank you to everyone who has taken the time to join today. Have a good evening, good day, and please do let us know if you have any questions. On the last slide as you can see there are emails