ptg 2334 CHAPTER 56 SQL Server Disaster Recovery Planning A few things may cause issues for this pattern, such as the need to make sure that no application keeps “state” from one transaction to the other. Additionally, the application and/or the web tier needs to be able to route user connections (the load) to either site in some type of balanced or round-robin method. This is often done with big IP routers that use round-robin routing algorithms, for example, to determine which site to direct connections to. Active/active configurations can be created using peer-to-peer continuous data replication as well as other multi-updating subscriber replication topologies. A slight twist to having two primary sites is to have one primary site and a secondary site that doesn’t process transactions but is actively used for reporting, testing, and other tasks (just no processing that changes anything). In the event of a primary failure, the secondary site can take over full primary site responsibilities quickly. This is sort of active/passive, with active “secondary usage” on the passive site (following the first active/passive DR pattern described previously). This type of configuration can take advantage of database mirroring and database snapshots (for the reporting). There are plenty of advantages to this varia- tion, which greatly distributes the workload and moves up the DR pyramid. Active Multisite DR Pattern An active multisite DR configuration contains three or more active sites, with the inten- tion of using any one of them as the DR site for the other (as shown in Figure 56.4). This pattern allows you to distribute your applications redundantly between any pair of sites, but not to all three (or more). For instance, you could have half of Primary Site 1’s appli- cations on Primary Site 2 and the other half on Primary Site 3. This way, you spread out the risk further and increase your odds of uninterrupted processing. Again, having “stateless” applications is critical here, as is some smart routing of all connections to the right sites. Using continuous data replication and the database mirror- ing options allows you to easily create such a DR topology. And, again, you also have the secondary usage variation available to you if one or more alternative sites were passive (with secondary usage supporting reporting, for example). Choosing a Disaster Recovery Pattern We reduce these to patterns because, at the foundational level, they represent what you need to do to support the level of business continuity your company demands. Some companies can tolerate different levels of loss because of the nature of their business; others cannot. At the highest levels, it is fairly easy to match these patterns to what your business requires. In this chapter, we look at what SQL Server capabilities are available to help you implement these patterns. Often, global companies devise a DR configuration that reserves each major data center site in their regions as the active or passive DR site for another region. Figure 56.5 shows one large high-tech company’s global data center locations. Its Alexandria, Virginia, site is also the passive DR site for its Phoenix, Arizona, site. Its Paris, France, regional site is also the DR site for its Alexandria, Virginia, site, and so on. For companies that have multiple data center sites but only need to support the active/passive DR pattern, a very popular variation can be used. This variation is called reciprocal DR. As you can see in Figure 56.6, there are two sites (Site 1 and Site 2). Each is active for some applications (Applications 1, 3, and 5 on Site 1 and Applications 2, 4, and ptg 2335 How to Approach Disaster Recovery 56 Active Multi-Site DR Primary Site 1 ACTIVE A B Web and Application Tier SQL Server Database Tier Physical Storage Tier “In Sync” Primary Site 2 ACTIVE A C Web and Application Tier SQL Server Database Tier Physical Storage Tier “In Sync” Primary Site n ACTIVE “In Sync” Web and Application Tier SQL Server Database Tier Physical Storage Tier snapshotssnapshots snapshots Bi-directional Synchronization Bi-directional Synchronization Bi-directional Synchronization C B FIGURE 56.4 Active multisite DR pattern. 6 on Site 2). Site 1’s applications are passively supported on Site 2, and Site 2’s applications are passively supported on Site 1. Rolling out the configuration this way eliminates the “stateless” application issue completely and is fairly easy to implement. It is also possible to provide the passive applications data available via database snapshots at the other reci- procal site (for free!), further leveraging distributing workload geographically. This configuration also spreads out the risk of losing all applications if one site ever happens to be lost (as in a disaster). Again, the Microsoft products to help you achieve this DR pattern variation are data replication to the DR site, or log shipping, and even asyn- chronous database mirroring with database snapshots available to help with some distrib- uted reporting. As we noted previously, third-party products such as Symantec’s Veritas Volume Replicator can be used to push physical byte-level changes to the passive (hot) DR site physical tier level. ptg 2336 CHAPTER 56 SQL Server Disaster Recovery Planning Reciprocal DR Primary App1 Primary App3 Primary App5 … App 2 App 4 App 6 Primary Site 1 ActivePassive snapshots Primary App2 Primary App4 Primary App6 … App 1 App 3 App 5 Primary Site 2 ActivePassive snapshots FIGURE 56.6 Reciprocal DR. Phoenix, AZ Paris, FRANCE Mumbai, INDIA Alexandria VA FIGURE 56.5 Using active regional sites for passive DR. Recovery Objectives You need to understand two main recovery objectives: the point in time to which data must be restored to be able to successfully resume processing (called the recovery point objective) and the acceptable amount of downtime that is tolerable (called the recovery time objective). The recovery point objective (RPO) is often thought of as the time between the ptg 2337 How to Approach Disaster Recovery 56 last backup and the point when the outage occurred. It indicates the amount of data that will be lost. The recovery time objective (RTO) is determined based on the acceptable downtime in case of a disruption of operations. It indicates the latest point in time at which the business operations must resume after disaster (that is, how much time can elapse). The RPO and RTO form the basis on which a data protection strategy is developed. This helps to provide a picture of the total time that a business may lose due to a disaster. The two of them together are very important requirements when designing a solution. Let’s put these terms in the form of algorithms: RTO = Difference between the time of the disaster to the time the system is operational – Time operational (up) – Time disaster occurred (down) RPO = Time since the last backup of complete transactions representing data that must be re-acquired or entered – Time disaster occurred – Time of last usable data backup Therefore: Total lost business time = Time operational (up) – Time disaster occurred (down) – Time of the last usable data backup Knowing your RPO and RTO requirements is essential in determining what DR pattern to use and what Microsoft options to utilize. A Data-Centric Approach to Disaster Recovery Disaster recovery is a complex undertaking unto itself. However, it isn’t really necessary to recover every system or application in the event of a disaster. Priorities must be set on determining exactly which systems or applications must be recovered. These are typically the revenue-generating applications (such as order entry, order fulfillment, and invoicing) that your business relies on to do basic business with its customers. Therefore, you set the highest priorities for DR with those revenue-generating systems. Then the next level of recovery is for the second-priority applications (such as HR systems). After you prioritize which applications should be part of your DR plans, you need to fully understand what must be included in recovery to ensure that these priority applications are fully functional. The best way is to take a data-centric approach, which focuses on what data is needed to bring up the application. Data comes in many flavors, as Figure 56.7 shows: . Metadata—The data that describes structures, files, XSDs, and so on that the appli- cations, middleware, or back end needs. . Configuration data—The data that the application needs to define what it must do, or the middleware needs to execute with, and so on. . Application data values—The data itself within your database files that represents the transactional data in your systems. As just mentioned, you first identify which applications you must include in your DR plans, and then you must make sure you back up and are able to recover that application’s ptg 2338 CHAPTER 56 SQL Server Disaster Recovery Planning Applications (ERP, HR, SFA,…) Middleware (EAI, ETL, WS,…) tightly coupled loosely coupled A B Back End (SQL Server, Files, Other…) Systems (HW, OS, Network) B A loosely coupled tightly coupled loosely coupled tightly coupled tightly coupled loosely coupled Meta data Types of Data Location of the Data (Tiers) Configuration data Application data (values) FIGURE 56.7 Types of data and where the data resides. data (metadata, configuration data, and application data). As part of this exercise, you must determine how tightly or loosely coupled the data is to other applications. In other words (as you can also see in Figure 56.7), if on the back-end tier, Database A has the orders transactions and Database B has the invoicing data, both must be included in the DR plans (because they are tightly coupled). In addition, you must also know how tightly or loosely coupled the application stack components are with each layer. In other words (again looking at Figure 56.7), if the ERP application (in the application tier) requires some type of middleware to be present to handle all its messaging, that middleware tier compo- nent is tightly coupled with the ERP application and so on. Microsoft SQL Server Options for Disaster Recovery You have seen the fundamental DR patterns you will be targeting and also recognize how to identify the highest priority applications and their tightly coupled components for DR. Now let’s look again at the specific Microsoft options available to implement various DR solutions. These options include data replication, log shipping, database mirroring, and database snapshots. Data Replication One of the strongest and more stable Microsoft options that can be leveraged for disaster recovery is data replication. Not all variations of data replication fit this bill, though. In particular, the central publisher using either continuous or very frequently scheduled distribution is very good for creating a hot spare of a SQL Server database across almost any geographical distance, as shown in Figure 56.8. The primary site is the only one actively processing transactions (updates, inserts, deletes) in this configuration, with all transactions being replicated to the subscriber, usually in a continuous replication mode. ptg 2339 Microsoft SQL Server Options for Disaster Recovery 56 Central Publisher Replication Publication Server SQL Server 2008 Active Primary Site Remote Distribution Server Publisher Distributor Adventure Works DB translog Subscription Server SQL Server 2008 Passive DR Site with Active read only “Hot Spare” Continuous (transactional) Adventure Works DB translog Subscriber SQL Server 2008 distribution FIGURE 56.8 Central publisher data replication configuration for active/passive DR. The subscriber at the DR site is as up-to-date as the last distributed (replicated) transaction from the publisher—usually near real-time. The subscriber can be used for a read-only type of processing if controlled properly and that read-only access does not hinder the replica- tion processing and put your DR pattern at risk. The newer peer-to-peer replication option provides a viable active/active capability that keeps both primaries in sync as transactions flow into each server’s database, as shown in Figure 56.9. Both sites contain a full copy of the database, with transactions being consumed and then replicated simultaneously between them. The complete setup of these data replication configurations is covered in Chapter 19, “Replication.” Log Shipping As you can see in Figure 56.10, log shipping is readily usable for the active/passive DR pattern. You must understand that log shipping is only as good as the last successful trans- action log shipment. Frequency of these log ships is critical in the RTO and RPO aspects of DR. This is really not a real-time solution. Even if you are using continuous log shipping mode, there is a lag of some duration due to the file movement and log application on the destination. ptg 2340 CHAPTER 56 SQL Server Disaster Recovery Planning Peer-to-Peer Relication SQL Server 2008 Publication Server Distribution Server SQL Server 2008 Active Primary Site North American Active Site distribution Adventure Works DB translog SQL Server 2008 Publication Server Distribution Server SQL Server 2008 Active Primary Site Asia Active Site distribution Adventure Works DB translog FIGURE 56.9 Peer-to-peer data replication configuration for active/active DR. SQL Server 2008 Primary Server CallOne DB translog SQL Server 2008 Monitor Server Log Shipping “Monitor” MSDB DB SQL Server 2008 Active Primary Site TxnLog backups \Backup\CallOne_tlog_200905141120.TRN \LogShare\CallOne_tlog_200905141120.TRN Passive DR Site “Source” “Destination” TxnLog Copies TxnLog Restores Secondary Server CallOne DB Last log shipped Delay Answer Delay between logs loaded Delay Answer FIGURE 56.10 Log shipping configuration for active/passive DR. ptg 2341 Microsoft SQL Server Options for Disaster Recovery 56 SQL Server 2008 Principal Server Adventure Works DB translog SQL Server 2008 Witness Server Database Mirroring MSDB DB SQL Server 2008 Active Primary Site Passive DR Site with Active DB Snapshot Mirror Server Adventure Works DB translog Reporting Users Reporting Users Database Snapshot FIGURE 56.11 Database mirroring and database snapshots for active/passive DR. Remember, log shipping is destined to be deprecated by Microsoft (unofficially announced). So it is perhaps not a good idea to start planning a future DR implementa- tion that will go away. Database Mirroring and Snapshots Database mirroring is rapidly becoming the new, viable DR option from Microsoft. In either a high-availability mode (synchronous) or performance mode (asynchronous), this capability can help minimize data loss and time to recover (RPO and RTO). As you can see in Figure 56.11, database mirroring can be used across any reasonable network connection that may exist from one site to another. It effectively creates a mirror image that is completely intact for failover purposes if a site is lost. It is viable in both an active/passive pattern and in an active/active pattern (where a database snapshot is created from the unavailable mirror database and is used for active reporting). NOTE It is likely Microsoft will rapidly enhance database mirroring to support all DR pat- terns over time. Setup and configuration of database mirroring are covered in Chapter 20, “Database Mirroring,” along with full details of database snapshots in Chapter 32, “Database Snapshots.” ptg 2342 CHAPTER 56 SQL Server Disaster Recovery Planning Now, to complete the DR planning for your SQL Server platform, you must do much more homework and preparation. The next section explains a great overall disaster approach that includes pulling all the right information available and executing on a DR plan (and testing it thoroughly). The Overall Disaster Recovery Process In general, a handful of things need to be put together (that is, defined and executed upon) as the basis for an overall disaster recovery process or plan. The following list clearly identifies where you need to start: 1. Create a disaster recovery execution tasks/run book. This should include all steps to take to recover from a disaster and cover all system components that need to be recovered. 2. Arrange for or procure a server/site to recover to. This should be a configuration that can house what is needed to get you back online. 3. Guarantee that a complete database backup/recovery mechanism is in place (includ- ing offsite/alternate site archive and retrieval of databases). 4. Guarantee that an application backup/recovery mechanism is in place (for example, COM+ applications, .NET applications, web services, other application components, and so on). 5. Make sure you can completely re-create and resynchronize your security (Microsoft Active Directory, domain accounts, SQL Server logins/passwords, and so on). We call this “security resynchronization readiness.” 6. Make sure you can completely configure and open up network/communication lines. This also includes ensuring that routers are configured properly, IP addresses are made available, and so on. 7. Train your support personnel on all elements of recovery. You can never know enough ways to recover a system. And it seems that a system never recovers the same way twice. 8. Plan and execute an annual or bi-annual disaster recovery simulation. The one or two days that you do this will pay you back a hundred times over if a disaster actual- ly occurs. And, remember, disasters come in many flavors. Many organizations have gone to the concept of having hot alternate sites available via stretch clustering or log shipping techniques. Costs can be high for some of these advanced and highly redundant solutions. The Focus of Disaster Recovery If you create some very solid, time-tested mechanisms for re-creating your SQL Server envi- ronment, they will serve you well when you need them most. Following are the things to focus on for disaster recovery: ptg 2343 The Overall Disaster Recovery Process 56 . Always generate scripts for as much of your work as possible (anything created using a wizard, SMSS, and so on). These scripts will save your hide. They should include the following: . Complete replication buildup/breakdown scripts . Complete database creation scripts (DB, tables, indexes, views, and so on). . Complete SQL login, database user IDs and password scripts (including roles and other grants) . Linked/remote server setup (linked servers, remote logins) . Log shipping setup (source, target, and monitor servers) . Any custom SQL Agent tasks . Backup/restore scripts . Potentially other scripts, depending on what you have built on SQL Server . Make sure you document all aspects of SQL database maintenance plans being used. This includes frequencies, alerts, email addresses being notified when errors occur, backup file/device locations, and so on. . Document all hardware/software configurations used: . Leverage sqldiag.exe for this (as described in the next section). . Record what accounts were used to start up the SQL Agent service for an instance and MS Distributed Transaction Coordinator (MS DTC) service. This step is especially important if you’re using distributed transactions and data replication. . The favorite SQL Server implementation characteristics that we script and record for a SQL Server instance are . select @@SERVERNAME—Provides the full network name of the SQL Server and instance. . select @@SERVICENAME—Provides the Registry key under which Microsoft SQL Server is running. . select @@VERSION—Provides the date, version, and processor type for the current installation of Microsoft SQL Server. . exec sp_helpserver—Provides the server name; the server’s network name; the server’s replication status; and the server’s identification number, collation name, and time-out values for connecting to, or queries against, linked servers. . exec sp_helplogins—Provides information about logins and the associated users in each database. . configuration for active/active DR. SQL Server 2008 Primary Server CallOne DB translog SQL Server 2008 Monitor Server Log Shipping “Monitor” MSDB DB SQL Server 2008 Active Primary Site TxnLog backups BackupCallOne_tlog_200905141120.TRN LogShareCallOne_tlog_200905141120.TRN Passive. Options for Disaster Recovery 56 SQL Server 2008 Principal Server Adventure Works DB translog SQL Server 2008 Witness Server Database Mirroring MSDB DB SQL Server 2008 Active Primary Site Passive. the destination. ptg 2340 CHAPTER 56 SQL Server Disaster Recovery Planning Peer-to-Peer Relication SQL Server 2008 Publication Server Distribution Server SQL Server 2008 Active Primary Site North American Active