High Availability and Disaster Recovery process


What Do The Terms High Availability And Disaster Recovery Mean?

Before we get too deep into this, I figured I would define the terms High Availability and Disaster Recovery since they are quite often used together and sometimes, mistakenly, interchangeably.

High Availability – Protects against hardware/software issues that may cause downtime.  An example could be a bad processor or memory issue.

Disaster Recovery – The ability to recover from a catastrophic event, such as a natural disaster or, on a more local level, a data center disaster (fire, flood, cooling outage, etc) and be able to continue business operations with minimal loss and minimal downtime.

These two concepts really go hand in hand in ensuring business continuity and both should be built into a business continuity solution.

High Availability/Disaster Recovery Options In SQL Server

SQL Server has a number of native features built in to provide some basic measure of high availability/disaster recovery.  These include:

  • Database backups – this is probably the most basic form of disaster recovery for SQL Server and one that should be practiced in every situation, regardless of what other HA/DR solutions are in place.
  • Clustering – this provides a way of binding two or more like Windows servers together in what is known as a cluster.  Each of the servers in the cluster is considered a node and, typically, one node is “active” (processing transactions) and the other nodes are “passive”.  There is a private network that runs between the nodes so that if the active node fails to deliver a “heartbeat” that the other node(s) can detect, an automatic failover is invoked and one of the passive nodes is promoted to active.
    • Pros of Clustering:  Clustering provides redundancy in the case of server hardware failure and provides a fairly quick (within 5 minutes), automatic solution to move processing to another server.
    • Cons of Clustering
      • Does not protect against disk issues since all nodes share the database on the same disk.
      • Only protects against issues with that specific server, not data center-wide since all nodes are located in same data center.
      • Only addresses availability, not disaster recovery
  • Database Mirroring – new in SQL Server 2005, database mirroring offers a way to mirror a database to another server (and disk).  All transactions are sent to the mirror server as they are committed on the production server.  Depending on how it is implemented, can automate failover, similar to clustering.
    • Pros of Database Mirroring
      • Provides some form of both HA and DR since mirror can be located in another data center, thus protecting you from hardware failure and disaster.
      • Fast.  Mirror is updated virtually instantly
    • Cons of Database Mirroring
      • Only done at the database level, not the instance level and only user databases can be mirrored, not system databases.  This means that some other form of synchronizing logins and other system database objects has to be devised.
      • To be get all features of database mirroring, Enterprise Edition has to be used. 
      • Any SQL Agent jobs must be manually enabled on the mirror if a failover takes place.
  • Log Shipping – this is one of the oldest forms DR available in SQL Server and involves setting up a warm standby server with a copy of the user database on it that is to be protected and backups of the transaction log from the production database are periodically shipped to the standby server and applied.
    • Pros of Log Shipping:
      • Tried and true technology that has been around for a long time.
      • At the database level, can provide both HA and DR protection because warm standby can be located in another data center.
    • Cons of Log Shipping:
      • Amount of potential data loss is higher than with the other options because logs are usually shipped no more frequently than every 5 minutes and typically, more like every 30 minutes to an hour.
      • Failover is fairly manual and time intensive.  Takes longer than other options to bring warm standby online.
      • Like database mirroring, this only protects a database, not the entire instance.
      • For SQL Server 2000, this feature is only available in Enterprise Edition.  Available in Standard Edition from SQL Server 2005 forward.
      • Does not transfer non-logged transactions or schema changes (security, addition of database objects
  • Replication – while not necessarily intended as an HA/DR solution replication can be used in both scenarios.
    • Pros of Replication:
      • Real-time data updates.  Minimal data loss, if at all.
      • Can be used for both HA/DR as publisher and subscriber can be in different data centers.
    • Cons of Replication:
      • Complicated to setup and maintain.
      • No provided failover mechanism.  This has to be created as part of solution.
      • Again, only database specific solution, not instance specific.

Given that these native solutions were really only database and not instance based, we chose to look at third party options.  The product that we settled on was Double-Take.  While certainly not an easy solution to implement, Double-Take was attractive to us because it allowed us to set up a stand-by server in our hot site for each of our mission critical SQL Servers and then continuously replicate the entire instance to the stand-by server.  It also provides for either automated (if the server stops responding) or manual failover (we have opted for manual) through a control applet that automatically swaps DNS entries between the production and the standby server when a failover is initiated.

Double-Take:  How It Works

Both the production and the standby server have to have the exact same SQL Server configuration (Edition, build, directory structure/location, instances, etc) installed.  The Double-Take software is then installed on both the production and standby server and then, through the Double-Take software, the production server is configured as the source and the standby server is configured as the target.

During the configuration process, you can configure Double-Take to compress the data before it replicates it over the WAN to the stand-by server.  This can save a ton of bandwidth and makes sure that the transactions are queued on the target server as quickly as possible ensuring minimal data loss in the event of a failure.

Additionally, Double-Take will also generate scripts for the failover, failback, and restore of the databases back to the production server when it is back in commission.  These scripts and/or the replication can be customized by overriding the automatically generated rules that Double-Take creates.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: