ptg 1684 CHAPTER 43 Transact-SQL Programming Guidelines, Tips, and Tricks select @rowcnt = @@ROWCOUNT, @error = @@ERROR if @rowcnt = 0 print ‘no rows updated’ if @error <> 0 raiserror (‘Update of titles failed’, 16, 1) return NOTE Error processing was improved in SQL Server 2005 with the introduction of the TRY CATCH construct in T-SQL. It provides a much more robust method of error han- dling than checking @@ERROR for error conditions. The TRY CATCH construct is dis- cussed in more detail later in this chapter. De-Duping Data with Ranking Functions One common problem encountered with imported data is unexpected duplicate data rows, especially if the data is being consolidated from multiple sources. In previous versions of SQL Server, de-duping the data often involved the use of cursors and temp tables. Since the introduction of the ROW_NUMBER ranking function and common table expressions in SQL Server 2005, you are able to de-dupe data with a single statement. To demonstrate this approach, Listing 43.26 shows how to create an authors_import table and populate it with some duplicate rows. LISTING 43.27 Script to Create and Populate the authors_import Table USE bigpubs2008 GO CREATE TABLE dbo.authors_import( au_id dbo.id NOT NULL, au_lname varchar(30) NOT NULL, au_fname varchar(20) NOT NULL) go INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘681-61-9588’, ‘Ahlberg’, ‘Allan’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘739-35-5165’, ‘Ahlberg’, ‘Janet’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘499-84-5672’, ‘Alexander’, ‘Lloyd’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘499-84-5672’, ‘Alexander’, ‘Lloyd’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘432-31-3829’, ‘Bate’, ‘W. Jackson’) ptg 1685 T-SQL Tips and Tricks 43 INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘432-31-3829’, ‘Bate’, ‘W. Jackson’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘432-31-3829’, ‘Bate’, ‘W. Jackson’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘437-99-3329’, ‘Bauer’, ‘Caroline Feller’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘378-33-9373’, ‘Benchley’, ‘Nathaniel’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘378-33-9373’, ‘Benchley’, ‘Nate’) INSERT INTO dbo.authors_import(au_id, au_lname, au_fname) VALUES(‘409-56-7008’, ‘Bennet’, ‘Abraham’) GO You can see in the data for Listing 43.27 that there are two duplicates for au_id 499-84- 5672 and three for au_id 432-31-3829. To start identifying the duplicates, you can write a query using the ROW_NUMBER() function to generate a unique row ID for each data row, as shown in Listing 43.28. LISTING 43.28 Using the ROW_NUMBER() Function to Generate Unique Row IDs SELECT ROW_NUMBER() OVER (ORDER BY au_id, au_lname, au_fname) AS ROWID, * FROM dbo.authors_import go ROWID au_id au_lname au_fname 1 378-33-9373 Benchley Nate 2 378-33-9373 Benchley Nathaniel 3 409-56-7008 Bennet Abraham 4 432-31-3829 Bate W. Jackson 5 432-31-3829 Bate W. Jackson 6 432-31-3829 Bate W. Jackson 7 437-99-3329 Bauer Caroline Feller 8 499-84-5672 Alexander Lloyd 9 499-84-5672 Alexander Lloyd 10 681-61-9588 Ahlberg Allan 11 739-35-5165 Ahlberg Janet Now you can use the query shown in Listing 43.28 to build a common table expression to find the duplicate rows. In this case, we keep the first row found. To make sure it works correctly, write the query first as a SELECT statement to verify that it is identifying the correct rows, as shown in Listing 43.29. ptg 1686 CHAPTER 43 Transact-SQL Programming Guidelines, Tips, and Tricks LISTING 43.29 Using a Common Table Expression to Identify Duplicate Rows WITH authors_import AS (SELECT ROW_NUMBER() OVER (ORDER BY au_id, au_lname, au_fname) AS ROWID, * FROM dbo.authors_import) select * FROM authors_import WHERE ROWID NOT IN (SELECT MIN(ROWID) FROM authors_import GROUP BY au_id,au_fname, au_lname); GO ROWID au_id au_lname au_fname 5 432-31-3829 Bate W. Jackson 6 432-31-3829 Bate W. Jackson 9 499-84-5672 Alexander Lloyd Now you simply change the final SELECT statement in Listing 43.29 into a DELETE state- ment, and it removes the duplicate rows from authors_import: WITH authors_import AS (SELECT ROW_NUMBER() OVER (ORDER BY au_id, au_lname, au_fname) AS ROWID, * FROM dbo.authors_import) delete FROM authors_import WHERE ROWID NOT IN (SELECT MIN(ROWID) FROM authors_import GROUP BY au_id,au_fname, au_lname); GO select * from authors_import go au_id au_lname au_fname 681-61-9588 Ahlberg Allan 739-35-5165 Ahlberg Janet 499-84-5672 Alexander Lloyd 432-31-3829 Bate W. Jackson 437-99-3329 Bauer Caroline Feller 378-33-9373 Benchley Nathaniel 378-33-9373 Benchley Nate 409-56-7008 Bennet Abraham If you want to retain the last duplicate record and delete the previous ones, you can replace the MIN function with the MAX function in the DELETE statement. Notice that the uniqueness of the duplication is determined by the columns specified in the GROUP BY clause of the subquery. Notice that there are still two records for au_id 378- 33-9373 remaining in the final record set. The duplicates removed were based on au_id, ptg 1687 The xml Data Type 43 au_lname, and au_fname. Because the first name is different for each of the two instances of au_id 378-33-9373, both Nathaniel Benchley and Nate Benchley remain in the authors_import table. If you remove au_fname from the GROUP BY clause, the earlier record for Nathaniel Benchley would remain, and Nate Benchley would be removed. However, this result may or may not be desirable. You would probably want to resolve the disparity between Nathaniel and Nate and confirm manually that they are duplicate rows before deleting them. Running the query in Listing 43.27 with au_fname removed from the GROUP BY clause helps you better determine what your final record set would look like. In Case You Missed It: New Transact-SQL Features in SQL Server 2005 SQL Server 2005 introduced some new features and changes to the Transact-SQL (T-SQL) language: . The xml data type . The max specifier for the varchar and varbinary data types . TOP enhancements . The OUTPUT clause . Common table expressions (CTEs) . Ranking functions . PIVOT and UNPIVOT . The APPLY operator . TRY-CATCH logic for error handling . The TABLESAMPLE clause NOTE Unless stated otherwise, all examples in this chapter make use of tables in the bigpubs2008 database. The xml Data Type SQL Server 2005 introduced a new xml data type that supports storing XML documents and fragments in database columns or variables. The xml data type can be used with local variable declarations, as the output of user-defined functions, as input parameters to stored procedures and functions, and much more. The results of a FOR XML statement can easily be stored in a column, stored procedure parameter, or local variable. XML data is stored in an internal binary format and can be up to 2GB in size. XML instances stored in xml columns can contain up to 128 levels of nesting. ptg 1688 CHAPTER 43 Transact-SQL Programming Guidelines, Tips, and Tricks xml columns can also be used to store code files such as XSLT, XSD, XHTML, and any other well-formed content. These files can then be retrieved by user-defined functions written in managed code hosted by SQL Server. (See Chapter 45, “SQL Server and the .NET Framework,” for a full review of SQL Server managed hosting.) For more information and detailed examples on using the xml data type, see Chapter 47, “Using XML in SQL Server 2008.” The max Specifier In SQL Server 2000, the most data that could be stored in a varchar, nvarchar, or varbinary column was 8,000 bytes. If you needed to store a larger value in a single column, you had to use the large object (LOB) data types: text, ntext, or image. The main disadvantage of using the LOB data types is that they cannot be used in many places where varchar or varbinary data types can be used (for example, as local variables, as arguments to SQL Server string manipulation functions such as REPLACE, and in string concatenation operations). SQL Server 2005 introduced the max specifier for varchar and varbinary data types. This specifier expands the storage capabilities of the varchar and varbinary data types to store up to 2 31-1 bytes of data, which is the same maximum size of text and image data types. The main difference is that these large value data types can be used just like the smaller varchar, nvarchar, and varbinary data types. The large value data types can be used in functions where LOB objects cannot (such as the REPLACE function), as data types for Transact-SQL variables, and in string concatenation operations. They can also be used in the DISTINCT, ORDER BY, and GROUP BY clauses of a SELECT statement as well as in aggre- gates, joins, and subqueries. The following example shows a local variable being defined using the varchar(max) data type: declare @maxvar varchar(max) go However, a similar variable cannot be defined using the text data type: declare @textvar text go Msg 2739, Level 16, State 1, Line 2 The text, ntext, and image data types are invalid for local variables.declare @maxvar varchar(max) The remaining examples in this section make use of the following table to demonstrate the differences between a varchar(max) column and text column: create table maxtest (maxcol varchar(max), textcol text) ptg 1689 TOP Enhancements 43 go populate the columns with some sample data insert maxtest select replicate(‘1234567890’, 1000), replicate(‘1234567890’, 1000) go In the following example, you can see that the substring function works with both varchar(max) and text data types: select substring (maxcol, 1, 10), substring (textcol, 1, 10) from maxtest go maxcol textcol 1234567890 1234567890 However, in this example, you can see that while a varchar(max) column can be used for string concatenation, the text data type cannot: select substring(‘xxx’ + maxcol, 1, 10) from maxtest go xxx1234567 select substring(‘xxx’ + textcol, 1, 10) from maxtest go Msg 402, Level 16, State 1, Line 1 The data types varchar and text are incompatible in the add operator. With the introduction of the max specifier, the large value data types are able to store data with the same maximum size as the LOB data types, but with the ability to be used just as their smaller varchar, nvarchar, and varbinary counterparts. It is recommended that the max data types be used instead of the LOB data types because the LOB data types will be deprecated in future releases of SQL Server. TOP Enhancements The TOP clause allows you to specify the number or percentage of rows to be returned by a SELECT statement. SQL Server 2005 introduced the capability for the TOP clause to also be used in INSERT, UPDATE, and DELETE statements. The syntax was also enhanced to allow the use of a numeric expression for the number value rather than having to use a hard- coded number. ptg 1690 CHAPTER 43 Transact-SQL Programming Guidelines, Tips, and Tricks The syntax for the TOP clause is as follows: SELECT [TOP (numeric_expression) [PERCENT] [WITH TIES]] FROM table_name [ORDER BY ] DELETE [TOP (numeric_expression) [PERCENT]] FROM table_name UPDATE [TOP (numeric_expression) [PERCENT]] table_name SET INSERT [TOP (numeric_expression) [PERCENT]] INTO table_name numeric_expression must be specified in parentheses. Specifying constants without paren- theses is supported in SELECT queries only for backward compatibility. The parentheses around the expression are always required when TOP is used in UPDATE, INSERT, or DELETE statements. If you do not specify the PERCENT option, the numeric expression must be implicitly convertible to the bigint data type. If you specify the PERCENT option, the numeric expression must be implicitly convertible to float and fall within the range of 0 to 100. The WITH TIES option with the ORDER BY clause is supported only with SELECT statements. The following example shows the use of a local variable as the numeric expression for the TOP clause to limit the number of rows returned by a SELECT statement: declare @rows int select @rows = 5 select top (@rows) * from sales go stor_id ord_num ord_date qty payterms title_id 6380 6871 2007-09-14 00:00:00.000 5 Net 60 BU1032 6380 722a 2007-09-13 00:00:00.000 3 Net 60 PS2091 6380 ONFFFFFFFFFFFFFFFFFF 2007-08-09 00:00:00.000 852 Net 30 FI1980 7066 A2976 2006-05-24 00:00:00.000 50 Net 30 PC8888 7066 ONAAAAAAAAAA 2007-01-13 00:00:00.000 948 Net 60 CH2480 Allowing the use of a numeric expression rather than a constant for the TOP command is especially useful when the number of requested rows is passed as a parameter to a stored procedure or function. When you use a subquery as the numeric expression, it must be self-contained; it cannot refer to columns of a table in the outer query. Using a self- contained subquery allows you to more easily develop queries for dynamic requests, such as “calculate the average number of titles published per week and return that many titles which were most recently published”: SELECT TOP(SELECT COUNT(*)/DATEDIFF(month, MIN(pubdate), MAX(pubdate)) FROM titles) title_id, pub_id, pubdate FROM titles ORDER BY pubdate DESC go ptg 1691 TOP Enhancements 43 title_id pub_id pubdate CH9009 9903 2009-05-31 00:00:00.000 PC9999 1389 2009-03-31 00:00:00.000 FI0375 9901 2008-09-24 00:00:00.000 DR4250 9904 2008-09-21 00:00:00.000 BI4785 9914 2008-09-20 00:00:00.000 BI0194 9911 2008-09-19 00:00:00.000 BI3224 9905 2008-09-18 00:00:00.000 FI0435 9917 2008-09-17 00:00:00.000 FI0792 9907 2008-09-13 00:00:00.000 NOTE Be aware that the TOP keyword does not speed up a query if the query also contains an ORDER BY clause. The reason is that the entire result set is selected into a work- table and sorted before the top N rows in the ordered result set are returned. When using the TOP keyword, you can also add the WITH TIES option to specify that addi- tional rows should be returned from the result set if duplicate values of the columns speci- fied in the ORDER BY clause exist within the last values returned. The WITH TIES option can be specified only if an ORDER BY clause is specified. The following query returns the top four most expensive books: SELECT TOP 4 price, title FROM titles ORDER BY price DESC go price title 17.1675 But Is It User Friendly? 17.0884 Is Anger the Enemy? 15.9329 Emotional Security: A New Algorithm 15.894 You Can Combat Computer Stress! If you use WITH TIES, you can see that there is an additional row with the same price (15.894) as the last row returned by the previous query: SELECT TOP 4 WITH TIES price, title FROM titles ORDER BY price DESC go ptg 1692 CHAPTER 43 Transact-SQL Programming Guidelines, Tips, and Tricks price title 17.1675 But Is It User Friendly? 17.0884 Is Anger the Enemy? 15.9329 Emotional Security: A New Algorithm 15.894 The Gourmet Microwave 15.894 You Can Combat Computer Stress! In versions of SQL Server prior to 2005, if you wanted to limit the number of rows affected by an UPDATE statement or a DELETE statement, you had to use the SET ROWCOUNT statement: set rowcount 100 DELETE sales where ord_date < (select dateadd(year, 1, min(ord_date)) from sales) set rowcount 0 SET ROWCOUNT often was used in this way to allow backing up and pruning of the transaction log during a purge process and also to prevent lock escalation. The problem with SET ROWCOUNT is that it applies to the entire current user session. You have to remember to set the rowcount back to 0 to be sure you don’t limit the rows affected by subsequent statements. With TOP, you can more easily specify the desired number of rows for each individual statement: DELETE top (100) sales where ord_date < (select dateadd(year, 1, min(ord_date)) from sales) UPDATE top (100) titles set royalty = royalty * 1.25 You may be thinking that using TOP in INSERT statements is not really necessary because you can always specify it in a SELECT query, as shown in Listing 43.30. LISTING 43.30 Limiting Rows for Insert with TOP in a SELECT Statement CREATE TABLE top_sales (stor_id char(4), ord_num varchar(20), ord_date datetime NOT NULL, qty smallint NOT NULL, payterms varchar(12) , title_id dbo.tid NOT NULL) go insert top_sales select top 100 * from sales where qty > 1700 order by qty desc However, you may find using the TOP clause in an INSERT statement useful when insert- ing the result of an EXEC command or the result of a UNION operation, as shown in Listing 43.31. ptg 1693 The OUTPUT Clause 43 LISTING 43.31 Using TOP in an Insert with a UNION ALL Query insert top (50) into top_sales select stor_id, ord_num, ord_date, qty, payterms, title_id from sales where qty >= 1800 union all select stor_id, ord_num, ord_date, qty, payterms, title_id from sales_big where qty >= 1800 order by qty desc When a TOP (n) clause is used with DELETE, UPDATE, or INSERT, the selection of rows on which the operation is performed is not guaranteed. If you want the TOP(n) clause to operate on rows in a meaningful chronological order, you must use TOP together with ORDER BY in a subselect statement. The following query deletes the 10 rows of the sales_big table that have the earliest order dates: delete from sales_big where sales_id in (select top 10 sales_id from sales_big order by ord_date) To ensure that only 10 rows are deleted, the column specified in the subselect statement ( sales_id) must be the primary key of the table. Using a nonkey column in the subselect statement could result in the deletion of more than 10 rows if the specified column matched duplicate values. NOTE SQL Server Books Online states that when you use TOP (n) with INSERT, UPDATE, and DELETE operations, the rows affected should be a random selection of the TOP(n) rows from the underlying table. In practice, this behavior has not been observed. Using TOP (n) with INSERT, UPDATE, and DELETE appears to affect only the first n matching rows. However, because the row selection is not guaranteed, it is still recommended that you use TOP together with ORDER BY in a subselect to ensure the expected result. The OUTPUT Clause By default, the execution of a DML statement such as INSERT, UPDATE, or DELETE does not produce any results that indicate what rows changed except for checking @@ROWCOUNT to determine the number of rows affected. In SQL Server 2005, the INSERT, UPDATE, and DELETE statements were enhanced to support an OUTPUT clause to be able to identify the actual rows affected by the DML statement. The OUTPUT clause allows you to return data from a modification statement (INSERT, UPDATE, or DELETE). This data can be returned as a result set to the caller or returned into a table vari- able or an output table. To capture information on the affected rows, the OUTPUT clause provides access to the inserted and deleted virtual tables that are normally accessible . like. In Case You Missed It: New Transact -SQL Features in SQL Server 2005 SQL Server 2005 introduced some new features and changes to the Transact -SQL (T -SQL) language: . The xml data type . The max. user-defined functions written in managed code hosted by SQL Server. (See Chapter 45, SQL Server and the .NET Framework,” for a full review of SQL Server managed hosting.) For more information and. 9901 2008- 09-24 00:00:00.000 DR4250 9904 2008- 09-21 00:00:00.000 BI4785 9914 2008- 09-20 00:00:00.000 BI0194 9911 2008- 09-19 00:00:00.000 BI3224 9905 2008- 09-18 00:00:00.000 FI0435 9917 2008- 09-17