Nielsen c29.tex V4 - 07/23/2009 4:59pm Page 682 Part IV Developing with SQL Server Summary Writing dynamic SQL is definitely taking your game to a higher level. Code that creates code. Cool. Remember to always be aware of whether or not the parameters can be used to enable SQL injection. A few key points from this chapter: ■ Build up the @SQLStr variable from the inside out — start with the dynamic list and then append the SELECT prolog. ■ If the WHERE clause is dynamic, chances are good the FROM clause will be also. ■ When writing dynamic SQL, add a PRINT statement to output the @SQLStr variable during development. It makes debugging much easier. ■ Use sp_executesql. ■ If there’s a different way to make the code flexible, such as the parse and join method mentioned in the best practice, do that instead of dynamic SQL. ■ All the standard database integrity features (e.g., foreign keys) help defend against SQL injection. ■ Always think like a hacker. Where can an SQL injection string be used to alter the intention of the code? ■ Dynamic SQL is not necessarily ad-hoc SQL. Never permit ad-hoc SQL to your database. This concludes a ten-chapter discussion on T-SQL development that began with ‘‘what is a batch’’ and progressed to the point of code-generating batches. If you’re up to writing code-generating code in T-SQL, you’re doing well as a SQL Server database developer. I congratulate you. 682 www.getcoolebook.com Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683 Data Connectivity IN THIS PART Chapter 30 Bulk Operations Chapter 31 Executing Distributing Queries Chapter 32 Programming with ADO.NET Chapter 33 Sync Framework Chapter 34 LINQ Chapter 35 Asynchronous Messaging with Service Broker Chapter 36 Replicating Data Chapter 37 Performing ETL with Integration Services Chapter 38 Access as a Front End to SQL Server A s much as I’d like to think that Management Studio is the ultimate UI and there’s no need for any other interface to SQL Server, the truth is that SQL Server needs to connect to nearly any possible data conduit. Other than Chapter 5, ‘‘Client Connectivity,’’ all the code so far has occurred inside SQL Server. Part V focuses on myriad ways that data can be brought into and synchronized with SQL Server. Some of the connectivity technologies are well known and familiar technologies like the simple but mighty bulk insert, distributed queries and linked servers, ADO.NET, replication, and Microsoft Access. Other connectivity technologies are newer. Integration services replaced DTA with SQL Server 2005. Service Broker was also introduced with SQL Server 2005. LINQ and Synch are new with SQL Server 2008. If SQL Server is the box, then this part busts out of the box and pumps data in and out of SQL Server. www.getcoolebook.com Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 684 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 685 Bulk Operations IN THIS CHAPTER Mass inserts from comma-delimited files T-SQL ETL processing Bulk insert BCP Performing bulk operations O ften, DBAs need to load copious amounts of data quickly — whether it’s a nightly data load or a conversion from comma-delimited text files. When a few hundred megabytes of data need to get into SQL Server in a limited time frame, a bulk operation is the way to get the heavy lifting done. XML’s popularity may be growing, but its file sizes seem to be growing even faster. XML’s data tags add significant bloat to a data file, sometimes quadrupling the file size or more. For very large files, IT organizations are sticking with CSV (also known as comma-delimited) files. For these old standby files, the best way to insert that data is a bulk operation. In SQL Server, bulk operations pump data directly to the data file according to the following models: ■ Simple recovery model: No problem with recovery; the transaction log is used for current transactions only. ■ Bulk-logged recovery model: No problem with recovery; the bulk operation transaction bypasses the log, but then the entire bulk opera- tion’s data is still written to the log. One complication with bulk-logged recovery is that if bulk operations are undertaken, point-in-time recov- ery is not possible for the time period covered by the transaction log. To regain point-in-time recovery, the log has to be backed up. As extent allocations are logged for bulk operations, a log backup after bulk operations will contain all the pages from extents that have been added, which results in a large transaction log backup. ■ Full recovery model: In full recovery model, bulk operations are not performed; the engine does full logging of inserts. To restart the transac- tion log recoverability process, following the bulk operation, perform a complete backup, and restart the transaction logs. 685 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 686 Part V Data Connectivity For more details on recovery models and how to set them, see Chapter 41, ‘‘Recov- ery Planning.’’ Details on the transaction log are covered in Chapter 66, ‘‘Managing Transactions, Locking, and Blocking.’’ Technically, the SELECT INTO syntax is also a bulk-logged operation, and it too bypasses the transaction log. SELECT INTO creates a table from the results of a SELECT statement; it is discussed in Chapter 15, ‘‘Modifying Data.’’ Bulk insert operations are normally one step of an ETL (extract-transform-load) nightly process. While developing these ETL processes in T-SQL is perfectly acceptable, Integration Services is a strong alterna- tive, and it includes bulk operations. For more details about developing Integration Services solutions, see Chapter 37, ‘‘Performing ETL with Integration Services.’’ Bulk insert is extremely fast, and I’ve had good success using it in production environ- ments. My one word of caution is that the data must be clean. Variations in data type, irregular columns, and missing columns will cause trouble. Bulk operations can be performed with a command prompt using BCP (a command-prompt utility to copy data to and from SQL Server), within T-SQL using the BULK INSERT command, or using Integration Services. Bulk Insert The BULK INSERT command can be used within any T-SQL script or stored procedure to import data into SQL Server. The parameters of the command specify the table receiving the data, the location of the source file, and the options. To test the BULK INSERT command, use the Address.csv file that’s part of the build script to load the AdventureWorks sample database. It’s probably already on your hard drive or it can be downloaded from MSDN. The 4MB file has 19,614 rows of address data — that’s small by ETL norms. The following batch bulk inserts from the Address.csv file in the AdventureWorks directory into the AWAddress table: Use Tempdb; CREATE TABLE AWAddressStaging ( ID INT, Address VARCHAR(500), City VARCHAR(500), Region VARCHAR(500), PostalCode VARCHAR(500), GUID VARCHAR(500), Updated DATETIME ); 686 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 687 Bulk Operations 30 BULK INSERT AWAddressStaging FROM ‘C:\Program Files\Microsoft SQL Server\90\Tools\Samples\ AdventureWorks OLTP\Address.csv’ WITH (FIRSTROW = 1,ROWTERMINATOR =’\n’); On my Dell notebook, the BULK INSERT completes in less than a half-second. The first thing to understand about BULK INSERT is that every column from the source table is sim- ply inserted directly into the destination table using a one-to-one mapping. The first column from the source file is dumped into the first column of the destination table. Each column lines up. If there are too many columns in the destination table, then it will fail. However, if there are not enough columns in the destination table, then BULK INSERT will work, as the extra data is placed into the bit bucket and simply discarded. The BULK INSERT command won’t accept a string concatenation or variable in the FROM parameter, so if you’re assembling the string of the file location and the filename, then you need to assemble a dynamic SQL statement to execute the BULK INSERT. Building and executing dynamic SQL is covered in Chapter 29, ‘‘Dynamic SQL and Code Generation,’’ which contains many examples. Best Practice B ecause BULK INSERT is dependent on the column position of both the source file and the destination table, it is best practice to use a view as an abstraction layer between the BULK INSERT command and the table. If the structure of either the source file or the destination table is altered, then modifying the view can keep the BULK INSERT running without having to change the other object’s structure. Another best practice is to BULK INSERT the data into a staging table, check the data, and then perform the rest of the transformations as you merge the data into the permanent tables. As long as you don’t mind copying the data twice, this works well. Bulk Insert Options In practice, I’ve always needed to use some options when using BULK INSERT: ■ Field Terminator specifies the character used to delimit or separate columns in the source file. The default, of course, is a comma, but I’ve also seen the pipe character (|) used in production. ■ Row Terminator specifies the character that ends a row in the source file. ‘ \n’ means end of row and is the typical setting. However, files from mainframes or other systems sometimes don’t use a clean end of line. In these cases, use a hex editor to view the actual end of line characters and specify the row terminator in hex. For example, a hex value of ‘ 0A’ is coded as follows: ROWTERMINATOR = ‘0x0A’ 687 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 688 Part V Data Connectivity ■ FirstRow is useful when specifying whether the incoming file has column headers. If the file does have column headers, then use this option to indicate that the first row of data is actually the second row of the file. ■ TabLock places an exclusive lock on the entire table and saves SQL Server the trouble of having to lock the table’s data pages being created. This option can dramatically improve performance, but at the cost of blocking data readers during the bulk insert. If the bulk insert is part of an ETL into a staging table, then there’s no problem, but if it’s a bulk insert into a production system with potential users selecting data, then this might not be a good idea. Multiple bulk-import streams can potentially block each other. To prevent this, SQL Server provides a special internal lock, called a bulk-update (BU) lock. To get a BU lock, you need to specify the TABLOCK option with each bulk-import stream without blocking other bulk-import streams. ■ Rows per Batch tells SQL Server to insert n number of rows in a single batch, rather than the entire file. Tweaking the batch size can improve performance. I’ve found that beginning with 100 and then experimenting to find the best size for the particular set of data works best. This helps performance because the logging is done less often. Too many rows, however, often exceed memory cache and may create waits. In my experience, 2,000 rows is often the best number. ■ Max Errors specifics how many rows can fail before the bulk insert fails. Depending on the business requirement for the data, you may need to set this to zero. ■ The Errorfile option points to a file that will collect any rows not accepted by the bulk insert operation. This is a great idea and should be used with every BULK INSERT command in production. Other options, which I’ve never needed in production, include Check_Constraints, CodePage, DataFileType, Fire_Triggers, KeepIdentity, KeepNulls, Kilobytes_per_batch,and Order. (The best practice of bulk inserting into a staging table and then performing the ETL merge into the permanent tables makes these commands less useful.) Best Practice BULK INSERT handles columns in the order they appear in the source comma-delimited file, and the columns must be in the same order in the receiving SQL table. Bulk inserting into a view provides a data abstraction layer so that any changes in column order won’t break the BULK INSERT code. When developing a BULK INSERT statement, it’s generally useful to open the source file using Excel and examine the data. Excel often reformats data, so it’s best not to save files in Excel. Sorting the data by the columns can help find data formatting anomalies. 688 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 689 Bulk Operations 30 BCP BCP, short for bulk copy program (or bulk copy Porsche — a reference among DBAs to its speed), is a command-line variation of bulk operations. BCP differs from BULK INSERT in that it is command-line executed and can import or export data. It uses many of the same options as BULK INSERT.Thebasic syntax is as follows: BCP destination table direction datafile options For the destination, use the server name along with the complete three-part name (server and database.schema.object). For a complete listing of the syntax, just type BCP at the command prompt. Because this is an external program, it needs authorization to connect to SQL Server. You have two options: Use the -P password option and hard-code your password into the batch file script, or omit the -P, in which case it will prompt for a password. Neither is a very good option. You can also use integrated security, which is usually considered the best practice. For straightforward ETL operations, I prefer using T-SQL and BULK INSERT.Forcomplex ETL loads, Integration Services is great. To be frank, I have little use for automating ETL processes using DOS batch scripts and BCP, although Powershell may make a believer of me yet. Summary This chapter explained a specific T-SQL command: BULK INSERT. Bulk operations provide the additional horsepower needed to import massive amounts of data by ignoring the transaction log and pumping the data directly to the table. The downside is that it complicates the recovery plan. The best way to perform a bulk operation is with the BULK INSERT T-SQL command. The next chapter continues the theme of moving data with a personal favorite of mine, distributed queries — when one SQL Server instance executes a query against a second SQL Server instance. 689 www.getcoolebook.com Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 690 www.getcoolebook.com Nielsen c31.tex V4 - 07/21/2009 2:03pm Page 691 Executing Distributed Queries IN THIS CHAPTER Understanding distributed queries Making the connection with remote data sources T-SQL distributed queries Pass-through queries Two-phase commits and distributed transactions D ata is seldom in one place. In today’s distributed world, most new projects enhance, or at least connect to, existing data. That’s not a problem; SQL Server can read and write data to most other data sources. Heterogeneous joins can even merge SQL Server data with an Excel spreadsheet. SQL Server offers several methods for accessing data external to the cur- rent database. From simply referencing another local database to executing pass-through queries that engage an Oracle server, SQL Server can handle it. Distributed Query Concepts Linking to an external data source is nothing more than configuring the name of the linked server, along with the necessary location and login information, so that SQL Server can access data on the linked server. Linking is a one-way configuration, as illustrated in Figure 31-1. If Server A links to Server B, then it means that Server A knows how to access and log into Server B. As far as Server B is concerned, Server A is just another user. If linking a server is a new concept to you, then it could easily be confused with registering a server in Management Studio. As illustrated in Figure 31-1, Man- agement Studio is only communicating with the servers as a client application. Linking the servers enables SQL Server instance A to communicate directly with SQL Server instance B. Links can be established in Management Studio or with T-SQL code (which could, of course, be created by configuring the link in Management Studio and then generating a script.) The latter has the advantage of repeatability in case a rebuild is necessary, although building the links in code requires more steps. 691 www.getcoolebook.com . also introduced with SQL Server 2005. LINQ and Synch are new with SQL Server 2008. If SQL Server is the box, then this part busts out of the box and pumps data in and out of SQL Server. www.getcoolebook.com Nielsen. Front End to SQL Server A s much as I’d like to think that Management Studio is the ultimate UI and there’s no need for any other interface to SQL Server, the truth is that SQL Server needs to. data. That’s not a problem; SQL Server can read and write data to most other data sources. Heterogeneous joins can even merge SQL Server data with an Excel spreadsheet. SQL Server offers several methods