Category Archives: High Availability and Disaster Recovery

SQL Server Clustering and TIPs


Tips for SQL Server Cluster:
 
 
A server cluster allows you to connect a number of physical servers—or nodes—that act as failover partners for each other. The redundancy a cluster provides spells more uptime for your critical
operations. I’ve implemented lots of clusters in my 13 years working with SQL Server™, and each had its own set of issues. That experience allowed me to gather a number of tips that can help make your clustering efforts easy and successful.
Server clusters take advantage of the built-in clustering capabilities of the Enterprise Editions of the Windows Server® family. In fact, for clustering purposes, Windows Server 2003 is significantly better than Windows 2000 Advanced Server. To maximize the benefits you’ll gain from clustering, you need the right hardware-and that involves some expense.
It’s not sufficient to slap a couple of servers together with a shared disk, and you can’t rely on the fact that individual hardware components may be in the Windows® Catalog (formerly known as the Hardware Compatibility List). The system as a whole must be in the Windows Catalog. But don’t worry-there are some approved, lower-cost cluster solutions available.
Figure 1 shows a typical cluster configuration.
Figure 1 A typical cluster (Click the image for a larger view)
Of course, there’s more to clustering than hardware-you also need to choose the right edition of SQL Server 2005. The Enterprise Edition enables clustering, as well as other useful features, such as the ability to leverage more CPUs, distributed and updateable partitioned views, built-in log shipping, automatic use of indexed views. If you already have an Enterprise Edition license, you should consider clustering, whether or not you have the two to eight servers necessary to form a traditional cluster (we’ll talk about one-node clusters in a minute). If you have SQL Server 2005 Standard Edition, you can install a two-node cluster.
Windows Server 2003 Enterprise and Datacenter Editions come with clustering built in. All you have to do is run Cluster Administrator to set up your cluster. You can add all nodes at once or one at a time. Similarly, when you install SQL Server, you can elect to install on an individual, non-clustered server, or you can install a virtual instance on a cluster. If you elect to install a virtual instance, you can install on all nodes of the cluster, just some nodes, or even only one node.
Finally, to achieve the true aim of clustering-high availability-you need qualified people and well-rehearsed procedures to follow if bad things happen. Although clustering is good insurance against hardware failure, it doesn’t prevent user errors, such as the dropping of a critical table by a sleep-deprived DBA in the middle of the night.
 
 
 
One-Node Clusters
 
Even if all you need at the moment is a single server, consider creating a one-node cluster. This gives you the option of upgrading to a cluster later, thus avoiding a rebuild. Just be sure that the hardware you choose is on the cluster portion of the Windows Catalog.
It’s not merely for high availability that you’d want the option to add a node at a later date. Consider what happens if you find that your server just doesn’t have the necessary capacity. That translates to a migration-and that takes time and effort. If you have a one-node cluster, migration is easier with far less downtime. You add the new node to the cluster, add the SQL Server binaries and service packs to the new node, and then failover to the new node. Then you add any post-service pack updates and, finally, evict the old node. The downtime is only the time it takes to failover and add the updates (if any).
 
 
 
Adding Nodes
 
Since all nodes in a cluster must be the same, you’ll want to act sooner, rather than later, to get that extra node. If you wait too long, the node may go out of production. On one project, I had to rebuild a node in a SQL Server 2000 cluster. I had the OS/network admin handle the basic machine build, then I jumped in to add it back to the cluster and prepare it for service as a SQL Server node. All went well until I failed over to the new node. Much to my dismay, it failed right back. To make a long story short, although I had prepared a detailed document on building a new cluster, including adding the cluster service and SQL Server service accounts to both nodes, the document wasn’t followed explicitly. The admin didn’t add those service accounts to the rebuilt node, so the privileges they had before the rebuild no longer existed.
It took me a long time to track that one down. One day it occurred to me to look at local group membership. Once I added the two accounts, failover went smoothly. Then I got to thinking. Rebuilding a node is something you don’t do frequently and, if you do, it’s an emergency. Although I had a document in place, it wasn’t used. We could have automated the security part by simply writing a brief script to add those two accounts and make any other necessary customizations. Things have improved in SQL Server 2005, though. The installer requires you to set domain-level groups for the SQL Server service accounts.
Of course, this got me thinking even more. You can create scripts that invoke CLUSTER.EXE to add the node to your Microsoft® Cluster Server (MSCS) cluster. All you have to do is feed the script the name of the node and it can handle the rest. In an emergency, automation is really your friend.
 
 
 
N+1 Clusters
 
Sometimes, the reason for adding a node to a cluster isn’t that you’re replacing a node. You could be adding more SQL Server instances to your cluster and each instance needs separate disk resources. Though multiple instances can run on a single node, they would be sharing CPU and RAM-and that could spell poor performance. Ideally, only a single instance should run on a single node. How do you ensure that when you fail over? Simple: the answer is that one node has no services running on it, while the other nodes each run one SQL Server instance. In fact, that’s the definition of an
N+1 cluster: N instances running on N+1 nodes. The extra node is the backup.
 
 
 
Upgrading SQL Server
 
Upgrading a clustered instance of SQL Server is not for the faint of heart: it’s clustered for a reason-you need uptime. But SQL Server 2005 offers a number of enhancements you want to take advantage of, so if and when you’ve got to do it, you can proceed without a lot of downtime.
What are your choices? Let’s look at the most expensive solution first: creating a whole new cluster, which means new servers and perhaps a new storage area network (SAN). You can probably keep the existing network switches, but that’s about all. Obviously, this approach isn’t cheap but it has advantages. New hardware generally performs much better than old, with disk capacity and speed increasing at an ever-growing rate. Thus, you’ll get a performance boost with new hardware alone. You may even want to lease your equipment just to stay ahead of the curve.
Once you have the hardware in place, you can create your new virtual SQL Server on this setup, copy your production databases over, and then put the new system through its paces, leaving plenty of time to shake out the bugs before cutover day. Just be sure to script out the logins from your existing server. (Check out support.microsoft.com/kb/246133. It’s also a good idea to update your login build script in case of catastrophic failure.)
To minimize downtime, you’ll likely have to use log shipping, unless your databases are quite small and you have a period of time in which no users are connected. You can log-ship right up to just before cutover. Then, kick the users out, cut and ship the final log, then point the app at the new instance. (Check out the database mirroring section below for an interesting alternative to log shipping.) If you use DNS aliases, you probably won’t even need to point the apps to the new instance. Just update the DNS alias instead. This approach has the advantage that if you get part way through the migration and have to revert back to the original, at least you have the original.
You can take a less expensive route, but it requires more advance planning. A cluster can support more than one SQL Server instance, but each instance must have its own disk resources. So when you’re carving up your SAN, set one LUN aside for a future upgrade. To perform the upgrade, install SQL Server binaries on this disk resource. You can exercise the system and, when you’re ready, shut down the current SQL Server, move the disk resources from the old SQL Server group, update the dependencies, and bring the new SQL Server instance online. Attach the databases from the old instance, and you’re up and running. (You did back everything up ahead of time, right?)
That’s the less-expensive approach-and it carries with it some risk. If something goes bad, you can’t detach the databases from the new instance and put them back. You’re reduced to restoring from backups-and that can mean some serious downtime.
An alternative is putting two instances of SQL Server on your SAN, assuming you have enough space. You restore production backups (and log ship) to the new instance, and proceed much as I described earlier. However, now you have a fallback. Once you’ve done the migration, you can free up the SAN resources from the old instance. It’ll cost you only the price of the extra disks.
Load Balancing
 
Let’s start by debunking a common misconception. You use MSCS clustering for high availability, not for load balancing. Also, SQL Server does not have any built-in, automatic load-balancing capability. You have to load balance through your application’s physical design. What does that mean?
As a table grows, you can expect to see some degradation in performance, particularly when table scans are involved. When you get into the millions or billions of rows, the traditional solution has been to use partitioned views, which are made up of tables with identical schemas hooked together with UNION ALL’s. Also, CHECK constraints are put in place to differentiate the member tables, and this prevents data duplication across the partitioned view. If the column that is used in the CHECK constraint is also part of the primary key, the view is updatable.
If the member tables are on their own filegroups, you may get better disk performance if the files in those filegroups are on separate physical drives. The tables can even be in separate databases. In SQL Server 2005, however, as long as all of the data is in the same database, you can use table partitioning, which is far easier to implement.
But let’s say you’ve tried your best with table partitioning or (local) partitioned views and still things are slow. If you have SQL Server 2000 or SQL Server 2005, you can use distributed partitioned views. The major difference is that the member tables can reside on different instances of SQL Server and those instances can be installed on an N+1 cluster. Why is this a good idea? If any one member table goes offline in a partitioned view, the entire view goes offline. Making those members part of a cluster then gives you the reliability you need to support performance and provide load balancing.
Do You Really Need a Cluster?
 
Perhaps you have some spare servers lying around, but they aren’t in the Windows Catalog for clusters. It’s a shame to have to go out and buy new ones just to support a cluster when those servers are available.
Database mirroring may be an attractive alternative to clustering. Mirroring involves three elements: an instance that houses the mirrored database is known as the principal; the backup server is known as the mirror; and, if you want automatic failover, a third server-known as the witness-is required. Briefly, a transaction in a database on the principal gets run again in the mirror. If the principal goes down, the database can fail over to the mirror, automatically if you have a witness. You have to set up mirroring for each of your application databases and you can’t mirror system databases.
The mirror is a separate instance of SQL Server, unlike in a cluster, and can be located thousands of miles away. Its caches get populated by the update activity that occurs as a result of the transactions duplicated from the principal. Assume, of course, that there is no activity on the mirror other than receiving the mirrored transactions from the principal. Failover is generally quicker than in a cluster since SQL Server is already running on the mirror. Because the caches are at least partially primed, the initial performance is not as sluggish as it might be in the clustered scenario. And note that when a mirrored database fails over, the role of principal and mirror are reversed.
The downside of database mirroring is the need for double the total disk capacity than with a cluster. You’ll also need more CPU power-if you go for synchronous mode with no data loss. As I said, high availability isn’t cheap.
A Combined Approach
 
Since a mirror can be located quite remotely from the principal, it’s a good choice for Disaster Recovery (DR) plans. Your cluster can be your first line of defense, but what happens if you employ both clustering and mirroring? In a cluster failover, if you have a witness as part of your mirroring configuration, the mirror will become the principal while the clustered SQL Server is coming online. However, note that failover from the new principal back to the (clustered) new mirror is not automatic. Consequently, it’s better not to enable automatic failover for your mirrored databases when used in conjunction with a cluster.
DR isn’t the only reason you’d use mirroring; it’s also useful if you have to apply a service pack or hotfix to your principal, in which case you can manually failover to your mirror. While applying the service pack or hotfix, the old principal server is temporarily offline and the committed transactions occurring at the new principal are queued up, waiting to be sent back to the new mirror (old principal). Once the service pack or hotfix installation is complete, synchronization will take place and eventually the two servers will be completely in sync. Now, you can switch the roles of principal and mirror. Downtime was the couple of seconds to failover and back. You can use this approach to migrate your SQL Server to another box. Just don’t fail back.

 

Virtual Server Adds Flexibility
 
Virtualization allows you to run one or more operating systems concurrently on a single physical server. Virtualization software adds another layer of capabilities to the cluster concept because you can cluster the software. Consequently, if the server on which the host is running fails, then it-and its guest OSs-failover to a backup node. This could be a very easy way to migrate a guest server. Plus, the guest OS does not have to be cluster-capable. Thus, you could run SQL Server Workgroup Edition inside a guest Windows Server 2003, running on Microsoft Virtual Server 2005 on a cluster.
Indirectly, you have essentially clustered Workgroup Edition (see Figure 2).
Figure 2 Using a virtual server (Click the image for a larger view)

 

In Control
 
If you’re in charge of a SQL Server implementation, you need to know that your server is always available. Server clustering helps to ensure that’s always the case. This article provides some hard-earned tips to help you get started, and you’ll find more useful information in the “Clustering Resources” sidebar.

High Availability and Disaster Recovery process


What Do The Terms High Availability And Disaster Recovery Mean?

Before we get too deep into this, I figured I would define the terms High Availability and Disaster Recovery since they are quite often used together and sometimes, mistakenly, interchangeably.

High Availability – Protects against hardware/software issues that may cause downtime.  An example could be a bad processor or memory issue.

Disaster Recovery – The ability to recover from a catastrophic event, such as a natural disaster or, on a more local level, a data center disaster (fire, flood, cooling outage, etc) and be able to continue business operations with minimal loss and minimal downtime.

These two concepts really go hand in hand in ensuring business continuity and both should be built into a business continuity solution.

High Availability/Disaster Recovery Options In SQL Server

SQL Server has a number of native features built in to provide some basic measure of high availability/disaster recovery.  These include:

  • Database backups – this is probably the most basic form of disaster recovery for SQL Server and one that should be practiced in every situation, regardless of what other HA/DR solutions are in place.
  • Clustering – this provides a way of binding two or more like Windows servers together in what is known as a cluster.  Each of the servers in the cluster is considered a node and, typically, one node is “active” (processing transactions) and the other nodes are “passive”.  There is a private network that runs between the nodes so that if the active node fails to deliver a “heartbeat” that the other node(s) can detect, an automatic failover is invoked and one of the passive nodes is promoted to active.
    • Pros of Clustering:  Clustering provides redundancy in the case of server hardware failure and provides a fairly quick (within 5 minutes), automatic solution to move processing to another server.
    • Cons of Clustering
      • Does not protect against disk issues since all nodes share the database on the same disk.
      • Only protects against issues with that specific server, not data center-wide since all nodes are located in same data center.
      • Only addresses availability, not disaster recovery
  • Database Mirroring – new in SQL Server 2005, database mirroring offers a way to mirror a database to another server (and disk).  All transactions are sent to the mirror server as they are committed on the production server.  Depending on how it is implemented, can automate failover, similar to clustering.
    • Pros of Database Mirroring
      • Provides some form of both HA and DR since mirror can be located in another data center, thus protecting you from hardware failure and disaster.
      • Fast.  Mirror is updated virtually instantly
    • Cons of Database Mirroring
      • Only done at the database level, not the instance level and only user databases can be mirrored, not system databases.  This means that some other form of synchronizing logins and other system database objects has to be devised.
      • To be get all features of database mirroring, Enterprise Edition has to be used. 
      • Any SQL Agent jobs must be manually enabled on the mirror if a failover takes place.
  • Log Shipping – this is one of the oldest forms DR available in SQL Server and involves setting up a warm standby server with a copy of the user database on it that is to be protected and backups of the transaction log from the production database are periodically shipped to the standby server and applied.
    • Pros of Log Shipping:
      • Tried and true technology that has been around for a long time.
      • At the database level, can provide both HA and DR protection because warm standby can be located in another data center.
    • Cons of Log Shipping:
      • Amount of potential data loss is higher than with the other options because logs are usually shipped no more frequently than every 5 minutes and typically, more like every 30 minutes to an hour.
      • Failover is fairly manual and time intensive.  Takes longer than other options to bring warm standby online.
      • Like database mirroring, this only protects a database, not the entire instance.
      • For SQL Server 2000, this feature is only available in Enterprise Edition.  Available in Standard Edition from SQL Server 2005 forward.
      • Does not transfer non-logged transactions or schema changes (security, addition of database objects
  • Replication – while not necessarily intended as an HA/DR solution replication can be used in both scenarios.
    • Pros of Replication:
      • Real-time data updates.  Minimal data loss, if at all.
      • Can be used for both HA/DR as publisher and subscriber can be in different data centers.
    • Cons of Replication:
      • Complicated to setup and maintain.
      • No provided failover mechanism.  This has to be created as part of solution.
      • Again, only database specific solution, not instance specific.

Given that these native solutions were really only database and not instance based, we chose to look at third party options.  The product that we settled on was Double-Take.  While certainly not an easy solution to implement, Double-Take was attractive to us because it allowed us to set up a stand-by server in our hot site for each of our mission critical SQL Servers and then continuously replicate the entire instance to the stand-by server.  It also provides for either automated (if the server stops responding) or manual failover (we have opted for manual) through a control applet that automatically swaps DNS entries between the production and the standby server when a failover is initiated.

Double-Take:  How It Works

Both the production and the standby server have to have the exact same SQL Server configuration (Edition, build, directory structure/location, instances, etc) installed.  The Double-Take software is then installed on both the production and standby server and then, through the Double-Take software, the production server is configured as the source and the standby server is configured as the target.

During the configuration process, you can configure Double-Take to compress the data before it replicates it over the WAN to the stand-by server.  This can save a ton of bandwidth and makes sure that the transactions are queued on the target server as quickly as possible ensuring minimal data loss in the event of a failure.

Additionally, Double-Take will also generate scripts for the failover, failback, and restore of the databases back to the production server when it is back in commission.  These scripts and/or the replication can be customized by overriding the automatically generated rules that Double-Take creates.

%d bloggers like this: