Best practices for meeting SLAs in a NetApp environment
by Yaniv Valik
SR DR Specialist, DR Assurance Group
Following many conversations with storage admins, system admins and business continuity personnel who rely on NetApp environments, I decided to post a short piece about best practices and recommendations which are specific to Network Appliance filers.
Common concerns raised the people we speak with are:
”How do I know that all my data is replicated?”
”How can I guarantee that I will be able to recover data successfully using my snapmirror / snapshot / snapvault copies?”
”Is there a way to know the actual RPO for a specific server / filer / database?”
“I have policies related to the number of snapshots, frequency of replication, retention, etc. How may I get a clear image of my datacenter’s status in terms of these policies? How can I detect SLA breaches as soon as they happen?”
The plain and unpleasant answer is: It’s very difficult to maintain SLA targets, especially with NetApp filers.
- There are many best practices. Too many. The vast majority of them are critical to ensure data high availability and successful recovery in case of disaster. On top of the common best practices, each vendor has its own specific best practices. Network Appliance has plenty of those.
- Many of these crucial best practices are cross-domain / cross-platform. For example, NetApp and Sun Solaris have their own required settings. So do NetApp and Microsoft Windows, NetApp and VCS (Veritas Cluster Server), etc.
- A NetApp filer is a hybrid array, supporting both SAN and NAS configurations. Each has different recommendation and guidelines. This may lead to more complexities and configuration mistakes. It also makes SLA measurements far more complicated.
- Truth to be told, good tools for SLA management have not evolved yet within Network Appliance.
- Last – database, server, switch and filer configurations are dynamic and change every day. Every change may affect availability and recoverability. It’s impossible to verify every single day that all best practices and guidelines are being well kept on all the servers, databases, etc. It’s not an achievable task for humans. Moreover, as much as you try, sometimes vulnerabilities are generated between teams, such as, for example, when a new database file is stored on an unreplicated file system.
So, what can be done to meet SLA goals with NetApp?
As you understand by now, it’s impossible to list all known best practices in this post. Nevertheless, I have tried to this a few of the important best practices that, from my experience, are frequently violated over time, finally resulting in SLA breaches:
- As much as possible, store files of business units, databases (etc) on a single NetApp volume. This significantly reduces the risk for snapshot/snapmirror data inconsistency in case of disaster. Also, the problem of forgetting to replicate some of the volumes is less likely to occur. In addition, test on a daily basis that all required volumes, qtrees and LUNs have the correct number / type of copies and are up-to-date as expected.
- Use RAID protection. RAID protection allows overcoming disk failures with no downtime or loss of data. Moreover, have sufficient spare disks – at least one hot spare for each type of disk drive.
- Check that snapshot size does not exceed snapshot reserve space. This can impact operations in various way, such as production downtime, RPO SLA violation, etc. Use auto-delete if possible. Also, use space reserved LUNs in production.
- Check snapMmrror status on a regular basis, based on your replication frequency. Detect stuck replication processes as soon as possible, which may result in having no valid copy for DR at worst and RPO SLA violation at best.
- Separate volatile data from critical data. Do not store swap files, temporary databases and database tablespaces on the same LUNs/volumes used to store production data. In a replicated environment, it is a bandwidth overkill, which may result in never ending replication processes, RPO SLA violation, backup window issues, snapshot space getting full quickly, etc. In synchronous replication environment it will also dramatically slow down the server operations. For the same reasons, if you use storage replication for DR, do not also replicate daily database exports.
- Take point-in-time copies (snapshots, etc) using consistency technologies. When creating a PiT copy for a file system, make sure to sync it before generating the snapshot. The use of tools such as snapdrive is recommended. For databases, careful and accurate use of procedures such as Oracle hot backup and DB2 suspend I/O are mandatory to ensure recoverability and usable copies.
Now – a moment of truth about the above 6 best practices:
Can you Verify them on a daily basis? On all databases, servers, filers?
How much time will it take? Do you have the manpower?
What will be the cost?
Finally, can you do the same for hundreds of additional NetApp specific guidelines and potential vulnerabilities and thousands of non-specific best practices?
No one can. For this reason, we recommend the use of automatic monitoring tools such as RecoverGuard.
Of course, there are many additional questions. I will try to answer them in the future posts. Stay tuned!