2 – The SQL Server landscape 60 document a SQL Server infrastructure, using some T-SQL scripts, a DBA Repository, SSIS and SSRS. If you are interested in deploying this solution in your environment, it is available to download from: http://www.simple‐talk.com/RedGateBooks/ RodneyLandrum/SQL_Server_Tacklebox_Code.zip . It can easily be modified to suit your needs. With this repository, or something similar, in place you will have a means to easily get to know, and monitor, your SQL Server environment. Next, it is time to consider the sort of tasks that this SQL Server environment needs to do, and what others want it to do. A common request, just when you've got the environment nice and settled down, is to move the data around a bit. The migratory data, as it is often referred to, at least by me, is the subject of the next chapter. 61 CHAPTER 3: THE MIGRATORY DATA When someone, usually a developer or project manager, enters my office and states, "We are going to need to move data from X to Y …" there usually follows a short inquisition, starting with the question, "Why?" Of course, I can probably guess why, as it is such a common request. As a data store grows, it often becomes necessary to "offload" certain processes in order to maintain performance levels. Reporting is usually the first to go, and this can involve simply creating a mirror image of the original source data on another server, or migrating and transforming the data for a data warehouse. QA and Dev environments also need to be refreshed regularly, sometimes daily. As a DBA, I have to consider many factors before deciding how to allocate the resources required in building a data migration solution. As Einstein may have posited, it is mostly a matter space and time or, more correctly, space time. The other, less scientific, variable is cost. Moving data is expensive. Each copy you need to make of a 1.5 Terabyte, 500-table production database is not only going to double the cost of storage space, but also the time required for backup and restore, or to push the data to target systems for reporting, and so on. In this chapter, I am going to cover the myriad ways to push, pull, or pour data from one data store to another, assessing each in terms of space, time and cost criteria. These data migration solutions fall into three broad categories: • Bulk Data Transfer solutions – this includes tools such as Bulk Copy Program (BCP) and SSIS • Data Comparison solutions – using third party tools such as Red Gate's SQL Data Compare, a built-in free tool such as TableDiff, or perhaps a homegrown T-SQL script that uses the new MERGE statement in SQL 2008. • "High Availability" solutions – using tools for building highly available systems, such as log shipping, replication, database mirroring and database snapshots. I'll review some of the available tools in each category, so that you're aware of the options when you come to choose the best fit for you, and your organization, and I will provide sample solutions for BCP, SSIS and TableDiff. I will also cover log shipping in some detail as, in my experience, it continues to be one of the most cost effective data migration solutions, in terms of cost, space and time. 3 – The migratory data 62 Mapping out the data migration solution As always, the most appropriate solution will depend on your exact requirements, and each option varies in terms of complexity and the time it will take you as DBA to plan and implement it. A common mistake of the novice DBA is to declare recklessly to a manager, on receiving a relatively straightforward data migration request, that moving the data "should only take 10 or 15 minutes". I can say, with utmost confidence, that there is absolutely no data migration solution that takes 10-15 minutes to design, document and implement. There is also the question of monetary cost. Some of the tools are comparatively expensive whereas others, such as Log Shipping and BCP, are essentially "free". However, that can be misleading too. Free is never really free. There is no free lunch, no free data. One thing is for sure though: regardless of cost, most data migration requests are approved because they satisfy an important business need and this means that a DBA will be tasked with moving data at some point, regardless of cost, space or time. When the topic of moving data from one location to another arises, I turn to my trusty wall-sized whiteboard and plethora of dry erase markers. I quickly assess the data migration needs with a series of probing questions, which I map out in a flowchart format on the white board. Here are some of the typical questions, with some typical answers: • "How much data are we talking about?" ["Roughly 15 Gigs worth of data a month"] • "How often do you need the data refreshed?" ["Daily"] • "Do you need the whole database(s) or a subset of data?" ["A subset of data."] • "Who is going to need access to the data?" ["Developers/Analysts"] • "Does the data need to be modified on the target, or do we need to apply indexes on the target?" ["We need to apply indexes independent of the source"] • "What version SQL are you using on the source, and can the target be a different version and edition?" ["SQL Server 2000 and 2005 … unclear on the edition … that is your job Mr. or Mrs. DBA"] At this point, I have the information that I need. There are several possible solutions, in this case, and which one you choose largely depends on cost. 3 – The migratory data 63 Log Shipping is a solution that has served me well in my career, across the space, time and cost boundaries. However, this solution would not allow us to add indexes on the target system. In addition, it is not possible to log ship between different versions of SQL Server, say SQL 2000 to 2005, and reap all of the benefits for a reporting instance because you will be unable to leave the target database in Standby mode and therefore can not access the database. There are many potential solutions to the "once-a-day refresh" requirement. Database snapshots may be a viable option, but require Enterprise Edition and that the snapshot resides on the same SQL Server instance as the source database. While, on our imaginary whiteboard, we might cross off Log Shipping and Snapshots as potential solutions, for the time being, it would be a mistake to rule them out entirely. As I mentioned before, log shipping has served me well in similar scenarios, and it's possible that some future criteria will drive the decision toward such a solution. Bear in mind also that, with log shipping in place, it is possible to use your log shipped database target instance as both a hot standby server for disaster recovery as well as a server to offload reporting processes. However, for now let's assume that another solution, such as SSIS, BCP or TableDiff, would be more appropriate. Over the following sections, I'll demonstrate how to implement a solution using these tools, noting along the way how, with slight modifications to the criteria, other data migration solutions could easily fit the need. The data source Most of the examples for data migration will use the DBA repository database, DBA_Rep, discussed in the previous chapter, as the data source. The data that I will be working with for bulk loading via BCP, and data comparisons using TableDiff.exe, comes from the SQL_Conn table, whose schema is defined in Listing 3.1. CREATE TABLE [dbo].[SQL_Conn]( [Run_Date] [datetime] NULL, [Server] [varchar](100) NULL, [spid] [int] NULL, [blocked] [bit] NULL, [waittime] [int] NULL, [name] [nvarchar](128) NULL, [lastwaittype] [nvarchar](150) NULL, [cpu] [int] NULL, [login_time] [datetime] NULL, [last_batch] [datetime] NULL, [status] [nvarchar](50) NULL, [hostname] [nvarchar](128) NULL, 3 – The migratory data 64 [program_name] [nvarchar](150) NULL, [cmd] [nvarchar](60) NULL, [loginame] [nvarchar](128) NULL, [duration] [datetime] NULL ) ON [PRIMARY] Listing 3.1: Schema for the SQL_Conn table. This table is a heap; in other words it has no indexes. It is populated using an SSIS job that collects connection information from each SQL Server instance defined in the DBA Repository, and merges this data together. NOTE For a full article describing the process of gathering this data, please refer to: http://www.simple-talk.com/sql/database-administrat ion/using-ssis-to- monitor-sql-server-databases-/ I chose this table only because it provides an example of the sort of volume of data that you might be faced with as a DBA. Over time, when executing the scheduled job every hour for many tens of servers, the table can grow quite large. However, as a side note, it is worth gathering as the data offers many insights into how your servers are being utilized. TIP If you would like to view the data from a sample data file that might be otherwise too large to open in Notepad, I use tail.exe to view the last n lines of the data file. Tail.exe is available in the Windows 2003 Resource Kit. Bulk data transfer tools The bulk loading of data is not a new concept. It has been around since the very early days of SQL Server. Loading data in bulk typically involves taking a subset of data, say all of the data in one or more tables, dumping it out to a flat file and subsequently loading it into a secondary source, such as another database on a secondary server. Though the terminology has changed somewhat, from "bulk loading" to "fast loading", this basic concept remains the same across all versions. Such a process is generally effective both in terms of cost and time; in the former case, because the tools that are available to do it, such as BCP and SSIS, are freely distributed with SQL Server and in the latter case because these methods have the ability to bypass logging, in certain circumstances, and so are extraordinarily efficient. You can expect as much as 20K records per second in some cases, depending on what the hardware subsystem that performs the reads and writes of . 2 – The SQL Server landscape 60 document a SQL Server infrastructure, using some T -SQL scripts, a DBA Repository, SSIS and SSRS. If you are. have a means to easily get to know, and monitor, your SQL Server environment. Next, it is time to consider the sort of tasks that this SQL Server environment needs to do, and what others want. tools such as Red Gate's SQL Data Compare, a built-in free tool such as TableDiff, or perhaps a homegrown T -SQL script that uses the new MERGE statement in SQL 2008. • "High Availability"