Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 302 Part II Manipulating Data With Select Here, the fourth quarter is included in the result despite the lack of data for the fourth quarter for 2009. The GROUP BY ALL includes the fourth quarter because there is data for the fourth quarter for 2008: SELECT DATEPART(qq, SalesDate) AS [Quarter], Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg] FROM RawData WHERE Year(SalesDate) = 2009 GROUP BY ALL DATEPART(qq, SalesDate); Result: Quarter Count Sum Avg 1 6 218 36 2 6 369 61 3 7 217 72 4 0 NULL NULL The real problem with the GROUP BY ALL solution is that it’s dependent on data being present in the table, but outside the current Where clause filter. If the fourth quarter data didn’t exist for another year other than 2009, then the query would have not listed the fourth quarter, period. A better solution to listing all data in a GROUP BY is to left outer join with a known set of complete data. In the following case, the VALUELIST subquery sets up a list of quarters. The LEFT OUTER JOIN includes all the rows from the VALUELIST subquery and matches up any rows with values from the aggregate query: SELECT ValueList.Quarter, Agg.[Count], Agg.[Sum], Agg.[Avg] FROM ( VALUES (1), (2), (3), (4) ) AS ValueList (Quarter) LEFT JOIN (SELECT DATEPART(qq, SalesDate) AS [Quarter], COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg] FROM RawData WHERE YEAR(SalesDate) = 2009 GROUP BY DATEPART(qq, SalesDate)) Agg ON ValueList.Quarter = Agg.Quarter ORDER BY ValueList.Quarter ; 302 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 303 Aggregating Data 12 Result: Quarter Count Sum Avg 1 6 218 36 2 6 369 61 3 7 217 72 4 0 NULL NULL In my testing, the fixed values list solution is slightly faster than the deprecated GROUP BY ALL solution. Nesting aggregations Aggregated data is often useful, and it can be even more useful to perform secondary aggrega- tions on aggregated data. For example, an aggregate query can easily SUM() each category and year/quarter within a subquery, but which category has the max value for each year/quarter? An obvious MAX(SUM()) doesn’t work because there’s not enough information to tell SQL Server how to nest the aggregation groupings. Solving this problem requires a subquery to create a record set from the first aggregation, and an outer query to perform the second level of aggregation. For example, the following query sums by quarter and category, and then the outer query uses a MAX() to determine which sum is the greatest for each quarter: Select Y,Q, Max(Total) as MaxSum FROM ( Calculate Sums SELECT Category, Year(SalesDate) as Y, DatePart(q,SalesDate) as Q, Sum(Amount) as Total FROM RawData GROUP BY Category, Year(SalesDate), DatePart(q,SalesDate) )ASsq GROUP BY Y,Q ORDER BY Y,Q; If it’s easier to read, here’s the same query using common table expressions (CTEs) instead of a derived table subquery: WITH sq AS ( Calculate Sums SELECT Category, YEAR(SalesDate) AS Y, DATEPART(q, SalesDate) AS Q, SUM(Amount) AS Total FROM RawData GROUP BY Category, YEAR(SalesDate), DATEPART(q, SalesDate)) SELECT Y, Q, MAX(Total) AS MaxSum FROM sq GROUP BY Y, Q; 303 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 304 Part II Manipulating Data With Select Result: Y Q MaxSum 2005 4 79 2006 1 147 2006 2 215 2006 3 280 Including detail descriptions While it’s nice to report the MAX(SUM()) of 147 for the first quarter of 2006, who wants to manually look up which category matches that sum? The next logical step is to include descriptive information about the aggregate data. To add descriptive information for the detail columns, join with a subquery on the detail values: SELECT MaxQuery.Y, MaxQuery.Q, AllQuery.Category, MaxQuery.MaxSum as MaxSum FROM ( Find Max Sum Per Year/Quarter Select Y,Q, Max(Total) as MaxSum From ( Calculate Sums select Category, Year(SalesDate) as Y, DatePart(q,SalesDate) as Q, Sum(Amount) as Total from RawData group by Category, Year(SalesDate), DatePart(q,SalesDate)) AS sq Group By Y,Q )ASMaxQuery INNER JOIN ( All Data Query Select Category, Year(SalesDate) as Y, DatePart(q,SalesDate) as Q, Sum(Amount) as Total From RawData Group By Category, Year(SalesDate), DatePart(q,SalesDate) )ASAllQuery ON MaxQuery.Y = AllQuery.Y AND MaxQuery.Q = AllQuery.Q AND MaxQuery.MaxSum = AllQuery.Total ORDER BY MaxQuery.Y, MaxQuery.Q; Result: Y Q Category MaxSum 2008 4 Y 79 2009 1 Y 147 2009 2 Z 215 2009 3 Y 280 While the query appears complex at first glance, it’s actually just an extension of the preceding query (in bold, with the table alias of MaxQuery.) 304 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 305 Aggregating Data 12 The second subquery (with the alias of AllQuery) finds the sum of every category and year/quarter. Joining MaxQuery with AllQuery on the sum and year/quarter is used to locate the category and return the descriptive value along with the detail data. In this case, the CTE solution really starts to pay off, as the subquery doesn’t have to be repeated. The following query is exactly equivalent to the preceding one (same results, same execution plan, same per- formance), but shorter, easier to understand, and cheaper to maintain: WITH AllQuery AS ( All Data Query SELECT Category, YEAR(SalesDate) AS Y, DATEPART(qq, SalesDate) AS Q, SUM(Amount) AS Total FROM RawData GROUP BY Category, YEAR(SalesDate), DATEPART(qq, SalesDate)) SELECT MaxQuery.Y, MaxQuery.Q, AllQuery.Category, MaxQuery.MaxSum FROM ( Find Max Sum Per Year/Quarter Select Y,Q, Max(Total) as MaxSum From AllQuery Group By Y,Q ) AS MaxQuery INNER JOIN AllQuery ON MaxQuery.Y = AllQuery.Y AND MaxQuery.Q = AllQuery.Q AND MaxQuery.MaxSum = AllQuery.Total ORDER BY MaxQuery.Y, MaxQuery.Q; Another alternative is to use the ranking functions and the OVER() clause, introduced in SQL Server 2005. The Ranked CTE refers to the AllQuery CTE. The following query produces the same result and is slightly more efficient: WITH AllQuery AS ( All Data Query SELECT Category, YEAR(SalesDate) AS Y, DATEPART(qq, SalesDate) AS Q, SUM(Amount) AS Total FROM RawData GROUP BY Category, YEAR(SalesDate), DATEPART(qq, SalesDate)), Ranked AS ( All data ranked after summing SELECT Category, Y, Q, Total, RANK() OVER (PARTITION BY Y, Q ORDER BY Total DESC) AS rn FROM AllQuery) 305 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 306 Part II Manipulating Data With Select SELECT Y, Q, Category, Total AS MaxSum FROM Ranked WHERE rn = 1 ORDER BY Y, Q; Ranking functions and the OVER() clause are explained in the next chapter, ‘‘Windowing and Ranking.’’ OLAP in the Park While Reporting Services can easily add subtotals and totals without any extra work by the query, and Analysis Services builds beautiful cubes, such feats of data contortion are not exclusive to OLAP tools. The relational engine can take a lap in that park as well. The ROLLUP and CUBE extensions to GROUP BY generateOLAP-typesummariesofthedatawithsubto- tals and totals. The columns to be totaled are defined similarly to how grouping sets can define GROUP BY columns. The older non-ANSI standard WITH ROLLUP and WITH CUBE are deprecated. The syntax still works for now, but they will be removed from a future version of SQL Server. This section covers only the newer syntax — it’s much cleaner and offers more control. I think you’ll like it. The ROLLUP and CUBE aggregate functions generate subtotals and grand totals as separate rows, and supply a null in the GROUP BY column to indicate the grand total. ROLLUP generates subtotal and total rows for the GROUP BY columns. CUBE extends the capabilities by generating subtotal rows for every GROUP BY column. ROLLUP and CUBE queries also automatically generate a grand total row. A special GROUPING() function is true when the row is a subtotal or grand total row for the group. Rollup subtotals The ROLLUP option, placed after the GROUP BY clause, instructs SQL Server to generate an additional total row. In this example, the GROUPING() function is used by a CASE expression to convert the total row to something understandable: SELECT GROUPING(Category) AS ‘Grouping’, Category, CASE GROUPING(Category) WHEN 0 THEN Category WHEN 1 THEN ‘All Categories’ END AS CategoryRollup, SUM(Amount) AS Amount FROM RawData GROUP BY ROLLUP(Category); Result: Grouping Category CategoryRollup Amount 0XX 225 0YY 506 306 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 307 Aggregating Data 12 0 Z Z 215 1 NULL All Categories 946 The previous example had one column in the GROUP BY ROLLUP(),butjustastheGROUP BY can organize by multiple columns, so can the GROUP BY ROLLUP(). The next example builds a more detailed summary of the data, with subtotals for each grouping of category and region: SELECT CASE GROUPING(Category) WHEN 0 THEN Category WHEN 1 THEN ‘All Categories’ END AS Category, CASE GROUPING(Region) WHEN 0 THEN Region WHEN 1 THEN ‘All Regions’ END AS Region, SUM(Amount) AS Amount FROM RawData GROUP BY ROLLUP(Category, Region) Result: Category Region Amount X MidWest 24 X NorthEast NULL X South 165 X West 36 Y MidWest 38 Y NorthEast 181 Y South 287 Z MidWest 83 Z NorthEast 55 Z South 33 Z West 44 All Categories All Regions 946 But wait, there’s more. Multiple columns can be combined into a single grouping level. The following query places Category and Region in parentheses inside the ROLLUP parentheses and thus treats each combination of category and region as a single group: SELECT CASE GROUPING(Category) WHEN 0 THEN Category WHEN 1 THEN ‘All Categories’ END AS Category, CASE GROUPING(Region) 307 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 308 Part II Manipulating Data With Select WHEN 0 THEN Region WHEN 1 THEN ‘All Regions’ END AS Region, COUNT(*) AS Count FROM RawData GROUP BY ROLLUP((Category, Region)) Result: Category Region Amount X MidWest 24 X NorthEast NULL X South 165 X West 36 Y MidWest 38 Y NorthEast 181 Y South 287 Z MidWest 83 Z NorthEast 55 Z South 33 Z West 44 All Categories All Regions 946 Cube queries A cube query is the next logical progression beyond a rollup query: It adds subtotals for every possible grouping in a multidimensional manner — just like Analysis Services. Using the same example, the rollup query had subtotals for each category; the cube query has subtotals for each category and each reagion: SELECT CASE GROUPING(Category) WHEN 0 THEN Category WHEN 1 THEN ‘All Categories’ END AS Category, CASE GROUPING(Region) WHEN 0 THEN Region WHEN 1 THEN ‘All Regions’ END AS Region, COUNT(*) AS Count FROM RawData R GROUP BY CUBE(Category, Region) ORDER BY Coalesce(R.Category, ‘ZZZZ’), Coalesce(R.Region, ‘ZZZZ’) Result: Category Region Amount X MidWest 24 X NorthEast NULL 308 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 309 Aggregating Data 12 X South 165 X West 36 X All Regions 225 Y MidWest 38 Y NorthEast 181 Y South 287 Y All Regions 506 Z MidWest 83 Z NorthEast 55 Z South 33 Z West 44 Z All Regions 215 All Categories MidWest 145 All Categories NorthEast 236 All Categories South 485 All Categories West 80 All Categories All Regions 946 Building Crosstab Queries Crosstab queries take the power of the previous cube query and give it more impact. Although an aggregate query can GROUP BY multiple columns, the result is still columnar and less than perfect for scanning numbers quickly. The cross-tabulation, or crosstab, query pivots the second GROUP BY column (or dimension) values counterclockwise 90 degrees and turns it into the crosstab columns, as shown in Figure 12-4. The limitation, of course, is that while a columnar GROUP BY query can have multiple aggregate functions, a crosstab query has difficulty displaying more than a single measure. The term crosstab query describes the result set, not the method of creating the crosstab, because there are multiple programmatic methods for generating a crosstab query — some better than others. The following sections describe ways to create the same result. Pivot method Microsoft introduced the PIVOT method for coding crosstab queries with SQL Server 2005. The pivot method deviates from the normal logical query flow by performing the aggregate GROUP BY function and generating the crosstab results as a data source within the FROM clause. If you think of PIVOT as a table-valued function that’s used as a data source, then it accepts two parameters. The first parameter is the aggregate function for the crosstab’s values. The second measure parameter lists the pivoted columns. In the following example, the aggregate function sums the Amount column, and the pivoted columns are the regions. Because PIVOT is part of the FROM clause, the data set needs a named range or table alias: SELECT Category, MidWest, NorthEast, South, West FROM RawData PIVOT (SUM(Amount) FOR Region IN (South, NorthEast, MidWest,West) )ASpt 309 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 310 Part II Manipulating Data With Select FIGURE 12-4 Pivoting the second group by group by column creates a crosstab query. Here, the previous group by cube query’s region values are pivoted to become the crosstab query columns. Region values pivot into columns Result: Category MidWest NorthEast South West Y NULL NULL 12 NULL Y NULL NULL 24 NULL Y NULL NULL 15 NULL Y NULL 28 NULL NULL X NULL NULL 11 NULL X 24 NULL NULL NULL X NULL NULL NULL 36 Y NULL NULL 47 NULL Y 38 NULL NULL NULL Y NULL 62 NULL NULL Z NULL NULL 33 NULL Z 83 NULL NULL NULL Z NULL NULL NULL 44 Z NULL 55 NULL NULL X NULL NULL 68 NULL X NULL NULL 86 NULL Y NULL NULL 54 NULL Y NULL NULL 63 NULL 310 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 311 Aggregating Data 12 Y NULL NULL 72 NULL Y NULL 91 NULL NULL Y NULL NULL NULL NULL Z NULL NULL NULL NULL X NULL NULL NULL NULL X NULL NULL NULL NULL The result is not what was expected! This doesn’t look at all like the crosstab result shown in Figure 12-4. That’s because the PIVOT function used every column provided to it. Because the Amount and Region are specified, it assumed that every remaining column should be used for the GROUP BY, so it grouped by Category and SalesDate. There’s no way to explicitly define the GROUP BY for the PIVOT. It uses an implicit GROUP BY. The solution is to use a subquery to select only the columns that should be submitted to the PIVOT command: SELECT Category, MidWest, NorthEast, South, West FROM (SELECT Category, Region, Amount FROM RawData) sq PIVOT (SUM(Amount) FOR Region IN (MidWest, NorthEast, South, West) )ASpt Result: Category MidWest NorthEast South West X 24 NULL 165 36 Y 38 181 287 NULL Z83553344 Now the result looks closer to the crosstab result in Figure 12-4. Case expression method The CASE expression method starts with a normal GROUP BY query generating a row for each value in the GROUP BY column. Adding a ROLLUP function to the GROUP BY adds a nice grand totals row to the crosstab. To generate the crosstab columns, a CASE expression filters the data summed by the aggregate func- tion. For example, if the region is ‘‘south,’’ then the SUM() will see the amount value, but if the region isn’t ‘‘south ,’’ then the CASE expression passes a 0 to the SUM() function. It’s beautifully simple. The CASE expression has three clear advantages over the PIVOT method, making it easier both to code and to maintain: ■ The GROUP BY is explicit. There’s no guessing which columns will generate the rows. ■ The crosstab columns are defined only once. ■ It’s easy to add a grand totals row. 311 www.getcoolebook.com . quarter for 2008: SELECT DATEPART(qq, SalesDate) AS [Quarter], Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg] FROM RawData WHERE Year(SalesDate) = 2009 GROUP BY ALL DATEPART(qq, SalesDate); Result: Quarter. sections describe ways to create the same result. Pivot method Microsoft introduced the PIVOT method for coding crosstab queries with SQL Server 2005. The pivot method deviates from the normal logical. year/quarter? An obvious MAX(SUM()) doesn’t work because there’s not enough information to tell SQL Server how to nest the aggregation groupings. Solving this problem requires a subquery to create