Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 312 Part II Manipulating Data With Select The CASE expression method’s query execution plan is identical to the plan generated by the PIVOT method: SELECT CASE GROUPING(Category) WHEN 0 THEN Category WHEN 1 THEN ‘All Categories’ END AS Category, SUM(CASE WHEN Region = ‘MidWest’ THEN Amount ELSE 0 END) AS MidWest, SUM(CASE WHEN Region = ‘NorthEast’ THEN Amount ELSE 0 END) AS NorthEast, SUM(CASE WHEN Region = ‘South’ THEN Amount ELSE 0 END) AS South, SUM(CASE WHEN Region = ‘West’ THEN Amount ELSE 0 END) AS West, SUM(Amount) AS Total FROM RawData GROUP BY RollUp (Category) ORDER BY Coalesce(Category, ‘ZZZZ’) Result: Category MidWest NorthEast South West Total X 24 0 165 36 225 Y 38 181 287 0 506 Z 83 55 33 44 215 All Categories 145 236 485 80 946 Dynamic crosstab queries The rows of a crosstab query are automatically dynamically generated by the aggregation at runtime; however, in both the PIVOT method and the CASE expression method, the crosstab columns (region in this example) must be hard-coded in the SQL statement. The only way to create a crosstab query with dynamic columns is to determine the columns at execution time and assemble a dynamic SQL command to execute the crosstab query. While it could be done with a cursor, the following example uses a multiple-assignment variable SELECT to create the list of regions in the @SQLStr. A little string manipulation to assemble the pivot statement and an sp_executesql command completes the job: DECLARE @SQLStr NVARCHAR(1024) SELECT @SQLStr = COALESCE(@SQLStr + ‘,’, ‘’) + [a].[Column] FROM (SELECT DISTINCT Region AS [Column] FROM RawData) AS a SET @SQLStr = ‘SELECT Category, ‘ + @SQLStr + ‘ FROM (Select Category, Region, Amount from RawData) sq ‘ 312 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 313 Aggregating Data 12 + ‘ PIVOT (Sum (Amount) FOR Region IN (’ + @SQLStr + ‘)) AS pt’ PRINT @SQLStr EXEC sp_executesql @SQLStr Result: SELECT Category, MidWest,NorthEast,South,West FROM (Select Category, Region, Amount from RawData) sq PIVOT (Sum (Amount) FOR Region IN (MidWest,NorthEast,South,West)) AS pt Category MidWest NorthEast South West X 24 NULL 165 36 Y 38 181 287 NULL Z83553344 This example is only to demonstrate the technique for building a dynamic crosstab query. Anytime you’re working with dynamic SQL, be sure to guard against SQL injection, which is discussed in Chapter 29, ‘‘Dynamic SQL and Code Generation.’’ An Analysis Services cube is basically a dynamic crosstab query on steroids. For more about designing these high-performance interactive cubes, turn to Chapter 71, ‘‘Building Multidimensional Cubes with Analysis Services.’’ Unpivot The inverse of a crosstab query is the UNPIVOT command, which is extremely useful for normalizing denormalized data. Starting with a table that looks like the result of a crosstab, the UNPIVOT command will twist the data back to a normalized list. Of course, the UNPIVOT can only normalize the data sup- plied to it, so if the pivoted data is an aggregate summary, that’s all that will be normalized. The details that created the aggregate summary won’t magically reappear. The following script sets up a table populated with crosstab data: IF OBJECT_ID(’Ptable’) IS NOT NULL DROP TABLE Ptable go SELECT Category, MidWest, NorthEast, South, West INTO PTable FROM (SELECT Category, MidWest, NorthEast, South, West FROM (SELECT Category, Region, Amount FROM RawData) sq PIVOT (SUM(Amount) FOR Region IN (MidWest, NorthEast, South, West) )ASpt )ASQ 313 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 314 Part II Manipulating Data With Select SELECT * FROM PTable Result: Category MidWest NorthEast South West X 24 NULL 165 36 Y 38 181 287 NULL Z83553344 The UNPIVOT command can now pick apart the Ptable data and convert it back into a normalized form: SELECT * FROM PTable UNPIVOT (Measure FOR Region IN (South, NorthEast, MidWest, West) )assq Result: Category Measure Region X 165 South X 24 MidWest X 36 West Y 287 South Y 181 NorthEast Y 38 MidWest Z 33 South Z 55 NorthEast Z 83 MidWest Z 44 West Cumulative Totals (Running Sums) There are numerous reasons for calculating cumulative totals, or running sums, in a database, such as account balances and inventory quantity on hand, to name only two. Of course, it’s easy to just pump the data to a reporting tool and let the report control calculate the running sum, but those calculations are then lost. It’s much better to calculate the cumulative total in the database and then report from consistent numbers. Cumulative totals is one area that defies the norm for SQL. As a rule, SQL excels at working with sets, but calculating a cumulative total for a set of data is based on comparing individual rows, so an iterative row-based cursor solution performs much better than a set-based operation. 314 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 315 Aggregating Data 12 Correlated subquery solution First, here’s the set-based solution. The correlated subquery sums every row, from the first row to every row in the outer query. The first row sums from the first row to the first row. The second row sums from the first row to the second row. The third row sums from the first row to the third row, and so on until the hundred thousandths row sums from the first row to the hundred thousandths row. For a small set this solution works well enough, but as the data set grows, the correlated subquery method becomes exponentially slower, which is why whenever someone is blogging about this cool solution, the sample code tends to have a top(100) in the SELECT: USE AdventureWorks2008; SET NoCount NOCOUNT ON; SELECT OuterQuery.SalesOrderIdD, OuterQuery.TotalDue, (Select sumSELECT SUM(InnerQuery.TotalDue) From FROM Sales.SalesOrderHeader AS InnerQuery Where WHERE InnerQuery.SalesOrderID <= OuterQuery.SalesOrderID ) as AS CT FROM Sales.SalesOrderHeader AS OuterQuery ORDER BY OuterQuery.SalesOrderID; On Maui (my Dell 6400 notebook), the best time achieved for that query was 2 minutes, 19 seconds to process 31,465 rows. Youch! T-SQL cursor solution With this solution, the cursor fetches the next row, does a quick add, and updates the value in the row. Therefore, it’s doing more work than the previous SELECT — it’s writing the cumulative total value back to the table. The first couple of statements add a CumulativeTotal column and make sure the table isn’t frag- mented. From there, the cursor runs through the update: USE AdventureWorks; SET NoCount ON; ALTER TABLE Sales.SalesOrderHeader ADD CumulativeTotal MONEY NOT NULL CONSTRAINT dfSalesOrderHeader DEFAULT(0); ALTER INDEX ALL ON Sales.SalesOrderHeader REBUILD WITH (FILLFACTOR = 100, SORT_IN_TEMPDB = ON); DECLARE @SalesOrderID INT, @TotalDue MONEY, @CumulativeTotal MONEY = 0; 315 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 316 Part II Manipulating Data With Select DECLARE cRun CURSOR STATIC FOR SELECT SalesOrderID, TotalDue FROM Sales.SalesOrderHeader ORDER BY SalesOrderID; OPEN cRun; prime the cursor FETCH cRun INTO @SalesOrderID, @TotalDue; WHILE @@Fetch_Status = 0 BEGIN; SET @CumulativeTotal += @TotalDue; UPDATE Sales.SalesOrderHeader SET CumulativeTotal = @CumulativeTotal WHERE SalesOrderID = @SalesOrderID; fetch next FETCH cRun INTO @SalesOrderID, @TotalDue; END; CLOSE cRun; DEALLOCATE cRun; go SELECT SalesOrderID, TotalDue, CumulativeTotal FROM Sales.SalesOrderHeader ORDER BY OrderDate, SalesOrderID; go ALTER TABLE Sales.SalesOrderHeader DROP CONSTRAINT dfSalesOrderHeader; ALTER TABLE Sales.SalesOrderHeader DROP COLUMN CumulativeTotal; The T-SQL cursor with the additional update functionality pawned the set-based solution with an execution time of 15 seconds! w00t! That’s nearly a magnitude difference. Go cursor! Multiple assignment variable solution Another solution was posted on my blog (http://tinyurl.com/ajs3tr) in response to a screencast I did on cumulative totals and cursors. The multiple assignment variable accumulates data in a variable iteratively during a set-based operation. It’s fast — the following multiple assignment variable solves the cumulative total problem in about one second: DECLARE @CumulativeTotal MONEY = 0 UPDATE Sales.SalesOrderHeader 316 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 317 Aggregating Data 12 SET @CumulativeTotal=CumulativeTotal =@CumulativeTotal+ISNULL(TotalDue, 0) With SQL Server 2008, the multiple assignment variable seems to respect the order by cause, so I’m cautiously optimistic about using this solution. However, it’s not documented or supported by Microsoft, so if the order is critical, and it certainly is to a cumulative totals problem, then I recommend the T-SQL cursor solution. If you do choose the multiple assignment variable solution, be sure to test it thoroughly with every new service pack. Summary SQL Server excels in aggregate functions, with the proverbial rich suite of features, and it is very capable of calculating sums and aggregates to suit nearly any need. From the simple COUNT() aggregate function to the complex dynamic crosstab query and the new PIVOT command, these query methods enable you to create powerful data analysis queries for impressive reports. The most important points to remember about aggregation are as follows: ■ Aggregate queries generate a single summary row, so every column has to be an aggregate function. ■ There’s no performance difference between COUNT(*) and COUNT(pk). ■ Aggregate functions, such as COUNT(column) and AVG(column), ignore nulls, which can be a good thing, and a reason why nulls make life easier for the database developer. ■ GROUP BY queries divide the data source into several segmented data sets and then generate a summary row for each group. For GROUP BY queries, the GROUP BY columns can and should be in the column list. ■ In the logical flow of the query, the GROUP BY occurs after the FROM clause and the WHERE clause, so when coding the query, get the data properly selected and then add the GROUP BY. ■ Complex aggregations (e.g., nested aggregations) often require CTEs or subqueries. Design the query from the inside out — that is, design the aggregate subquery first and then add the outer query. ■ GROUP BY’s ROLLUP and CUBE option have a new syntax, and they can be as powerful as Analysis Service’s cubes. ■ There are several way to code a crosstab query. I recommend using a GROUP BY and CASE expressions, rather than the PIVOT syntax. ■ Dynamic crosstabs are possible only with dynamic SQL. ■ Calculating cumulative totals (running sums) is one of the few problems best solved by acursor. The next chapter continues working with summary data using the windowing and ranking technology of the OVER() clause. 317 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 318 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 319 Windowing and Ranking IN THIS CHAPTER Creating an independent sort of the result set Grouping result sets Calculating ranks, row numbers, and ntiles H ave you ever noticed the hidden arrow in the FedEx logo? Once you know that it’s there, it’s obvious,butinaninformalpollofFedEx drivers, not one of them was aware of the arrow. Sometimes, just seeing things in a different perspective can help clarify the picture. That’s what SQL’s windowing and ranking does — the windowing (using the over() clause) provides a new perspective on the data. The ranking functions then use that perspective to provide additional ways to manipulate the query results. Windowing and ranking are similar to the last chapter’s aggregate queries, but they belong in their own chapter because they work with an independent sort order separate from the query’s order by clause, and should be thought of as a different technology than traditional aggregate queries. Windowing Before the ranking functions can be applied to the query, the window must be established. Even though the SQL query syntax places these two steps together, logically it’s easier to think through the window and then add the ranking function. Referring back to the logical sequence of the query in Chapter 8, ‘‘Introducing Basic Query Flow,’’ the OVER() clause occurs in the latter half of the logical flow of the query in step 6 after the column expressions and ORDER BY but before any verbs ( OUTPUT, INSERT, UPDATE, DELETE,orUNION). 319 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 320 Part II Manipulating Data With Select What’s New with Windowing and Ranking? T he functionality was introduced in SQL Server 2005, and I had hoped it would be expanded for 2008. Windowing and ranking hold so much potential, and there’s much more functionality in the ANSI SQL specification, but unfortunately, there’s nothing new with windowing and ranking in SQL Server 2008. All the examples in this chapter use the AdventureWorks2008 sample database. The Over() clause The OVER() clause creates a new window on the data — think of it as a new perspective, or inde- pendent ordering, of the rows — which may or may not be the same as the sort order of the ORDER BY clause. In a way, the windowing capability creates an alternate flow to the query with its own sort order and ranking functions, as illustrated in Figure 13-1. The results of the windowing and ranking are passed back into the query before the ORDER BY clause. FIGURE 13-1 The windowing and ranking functions can be thought of as a parallel query process with an independent sort order. Data Source(s) From Where Col(s), Expr(s) Order By Windowing Sort Ranking Functions Predicate Windowing and Ranking Query Flow The complete syntax OVER(ORDER BY columns). The columns may be any available column or expression, just like the ORDER BY clause; but unlike the ORDER BY clause, the OVER() clause won’t accept a column ordinal position, e.g., 1, 2. Also, like the ORDER BY clause, it can be ascending (asc), the default, or descending ( desc); and it can be sorted by multiple columns. The window’s sort order will take advantage of indexes and can be very fast, even if the sort order is different from the main query’s sort order. 320 www.getcoolebook.com Nielsen c13.tex V4 - 07/21/2009 12:48pm Page 321 Windowing and Ranking 13 In the following query, the OVER() clause creates a separate view to the data sorted by OrderDate (ignore the ROW_NUMBER() function for now): USE AdventureWorks2008; SELECT ROW_NUMBER() OVER(ORDER BY OrderDate) as RowNumber, SalesOrderID, OrderDate FROM Sales.SalesOrderHeader WHERE SalesPersonID = 280 ORDER BY RowNumber; Result (abbreviated, and note that OrderDate does not include time information, so the results might vary within a given date): RowNumber SalesOrderID OrderDate 1 43664 2001-07-01 00:00:00.000 2 43860 2001-08-01 00:00:00.000 3 43866 2001-08-01 00:00:00.000 4 43867 2001-08-01 00:00:00.000 5 43877 2001-08-01 00:00:00.000 Partitioning within the window The OVER() clause normally creates a single sort order, but it can divide the windowed data into partitions, which are similar to groups in an aggregate GROUP BY query. This is dramatically powerful because the ranking functions will be able to restart with every partition. The next query example uses the OVER() clause to create a sort order of the query results by OrderDate, and then partition the data by YEAR() and MONTH(). Notice that the syntax is the opposite of the logical flow — the PARTITION BY goes before the ORDER BY within the OVER() clause: SELECT ROW_NUMBER() OVER(Partition By Year(OrderDate), Month(OrderDate) ORDER BY OrderDate) as RowNumber, SalesOrderID, OrderDate FROM Sales.SalesOrderHeader WHERE SalesPersonID = 280 ORDER BY OrderDate; Result (abbreviated): RowNumber SalesOrderID OrderDate 1 43664 2001-07-01 00:00:00.000 1 43860 2001-08-01 00:00:00.000 2 43866 2001-08-01 00:00:00.000 3 43867 2001-08-01 00:00:00.000 321 www.getcoolebook.com . regions in the @SQLStr. A little string manipulation to assemble the pivot statement and an sp_executesql command completes the job: DECLARE @SQLStr NVARCHAR(1024) SELECT @SQLStr = COALESCE(@SQLStr +. functionality in the ANSI SQL specification, but unfortunately, there’s nothing new with windowing and ranking in SQL Server 2008. All the examples in this chapter use the AdventureWorks2008 sample database. The. 0) With SQL Server 2008, the multiple assignment variable seems to respect the order by cause, so I’m cautiously optimistic about using this solution. However, it’s not documented or supported by Microsoft, so