Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 292 Part II Manipulating Data With Select Because SQL is now returning information from a set, rather than building a record set of rows, as soon as a query includes an aggregate function, every column (in the column list, in the expression, or in the ORDER BY) must participate in an aggregate function. This makes sense because if a query returned the total number of order sales, then it could not return a single order number on the summary row. Because aggregate functions are expressions, the result will have a null column name. Therefore, use an alias to name the column in the results. To demonstrate the mathematical aggregate functions, the following query produces a SUM(), AVG(), MIN(),andMAX() of the amount column. SQL Server warns in the result that null values are ignored by aggregate functions, which are examined in more detail soon: SELECT SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max] FROM RawData ; Result: Sum Avg Min Max 946 47 11 91 Warning: Null value is eliminated by an aggregate or other SET operation. There’s actually more to the COUNT() function than appears at first glance. The next query exercises four variations of the COUNT() aggregate function: SELECT COUNT(*) AS CountStar, COUNT(RawDataID) AS CountPK, COUNT(Amount) AS CountAmount, COUNT(DISTINCT Region) AS Regions FROM RawData; Result: CountStar CountPK CountAmount Regions 24 24 20 4 Warning: Null value is eliminated by an aggregate or other SET operation. To examine this query in detail, the first column, COUNT(*), counts every row, regardless of any values in the row. COUNT(RawDataID) counts all the rows with a non-null value in the primary key. Because primary keys, by definition, can’t have any nulls, this column also counts every row. These two methods of counting rows have the same query execution plan, same performance, and same result. The third column, COUNT(Amount), demonstrates why every aggregate query includes a warning. It counts the number of rows with an actual value in the Amount column, and it ignores any rows 292 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 293 Aggregating Data 12 with a null value in the Amount column. Because there are four rows with null amounts, this COUNT(Amount) finds only 20 rows. COUNT(DISTINCT region) is the oddball of this query. Instead of counting rows, it counts the unique values in the region column. The RawData table data has four regions: MidWest, NorthEast, South,andWest. Therefore, COUNT(DISTINCT region) returns 4.NotethatCOUNT(DISTINCT *) is invalid; it requires a specific column. Aggregates, averages, and nulls Aggregate functions ignore nulls, which creates a special situation when calculating averages. A SUM() or AVG() aggregate function will not error out on a null, but simply skip the row with a null. For this reason, a SUM()/COUNT(*) calculation may provide a different result from an AVG() function. The COUNT(*) function includes every row, whereas the AVG() function might divide using a smaller count of rows. To test this behavior, the next query uses three methods of calculating the average amount, and each method generates a different result: SELECT AVG(Amount) AS [Integer Avg], SUM(Amount) / COUNT(*) AS [Manual Avg], AVG(CAST((Amount) AS NUMERIC(9, 5))) AS [Numeric Avg] FROM RawData; Result: Integer Avg Manual Avg Numeric Avg 47 39 47.300000 The first column performs the standard AVG() aggregate function and divides the sum of the amount (946) by the number of rows with a non-null value for the amount (20). The SUM(AMOUNT)/COUNT(*) calculation in column two actually divides 946 by the total number of rows in the table (24), yielding a different answer. The last column provides the best answer. It uses the AVG() function so it ignores null values, but it also improves the precision of the answer. The trick is that the precision of the aggregate function is determined by the data type precision of the source values. SQL Server’s Query Optimizer first converts the Amount values to a numeric(9,5) data type and then passes the values to the AVG() function. Using aggregate functions within the Query Designer When using Management Studio’s Query Designer (select a table in the Object Explorer ➪ Context Menu ➪ Edit Top 200 Rows), a query can be converted into an aggregate query using the Group By toolbar button, as illustrated in Figure 12-2. 293 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 294 Part II Manipulating Data With Select FIGURE 12-2 Performing an aggregate query within Management Studio’s Query Designer. The aggregate function for the column is selected using the drop-down box in the Group By column. For more information on using the Query Designer to build and execute queries, turn to Chapter 6, ‘‘Using Management Studio.’’ Beginning statistics Statistics is a large and complex field of study, and while SQL Server does not pretend to replace a full statistical analysis software package, it does calculate standard deviation and variance, both of which are important for understanding the bell-curve spread of numbers. An average alone is not sufficient to summarize a set of values (in the lexicon of statistics, a ‘‘set’’ is referred to as a population). The value in the exact middle of a population is the statistical mean or median (which is different from the average or arithmetic mean). The difference, or how widely dispersed the values are from the mean, is called the population’s variance. For example, the populations 294 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 295 Aggregating Data 12 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and (4, 4, 5, 5, 5, 5, 6, 6) both average to 5, but the values in the first set vary widely from the median, whereas the second set’s values are all close to the median. The standard deviation is the square root of the variance and describes the shape of the bell curve formed by the population. The following query uses the StDevP() and VarP() functions to return the statistical variance and the standard deviation of the entire population of the RawData table: SELECT StDevP(Amount) as [StDev], VarP(Amount) as [Var] FROM RawData; Result: StDevP VarP 24.2715883287435 589.11 To perform extensive statistical data analysis, I recommend exporting the query result set to Excel and tapping Excel’s broad range of statistical functions. The statistical formulas differ slightly when calculating variance and standard deviation from the entire population versus a sampling of the population. If the aggregate query includes the entire population, then use the StDevP() and VarP() aggregate functions, which use the bias or n method of calculating the deviation. However, if the query is using a sampling or subset of the population, then use the StDev() and Var() aggregate functions so that SQL Server will use the unbiased or n-1 statistical method. Because GROUP BY queries slice the population into subsets, these queries should always use StDevP() and VarP() functions. All of these aggregate functions also work with the OVER() clause; see Chapter 13, ‘‘Windowing and Ranking.’’ Grouping within a Result Set Aggregate functions are all well and good, but how often do you need a total for an entire table? Most aggregate requirements will include a date range, department, type of sale, region, or the like. That presents a problem. If the only tool to restrict the aggregate function were the WHERE clause, then database developers would waste hours replicating the same query, or writing a lot of dynamic SQL queries and the code to execute the aggregate queries in sequence. Fortunately, aggregate functions are complemented by the GROUP BY function, which automatically par- titions the data set into subsets based on the values in certain columns. Once the data set is divided into 295 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 296 Part II Manipulating Data With Select subgroups, the aggregate functions are performed on each subgroup. The final result is one summation row for each group, as shown in Figure 12-3. A common example is grouping the sales result by salesperson. A SUM() function without the grouping would produce the SUM() of all sales. Writing a query for each salesperson would provide a SUM() for each person, but maintaining that over time would be cumbersome. The grouping function automatically creates a subset of data grouped for each unique salesperson, and then the SUM() function is calculated for each salesperson’s sales. Voil ` a. FIGURE 12-3 The group by clause slices the data set into multiple subgroups. group group group group group group row row row row row row Data Source(s) Where From Col(s) Expr(s) Data Set Data Set Having Order By Predicate Simple groupings Some queries use descriptive columns for the grouping, so the data used by the GROUP BY clause is the same data you need to see to understand the groupings. For example, the next query groups by category: SELECT Category, Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max] FROM RawData GROUP BY Category; Result: Category Count Sum Avg Min Max X 5 225 45 11 86 Y 15 506 46 12 91 Z 4 215 53 33 83 The first column of this query returns the Category column. While this column does not have an aggregate function, it still participates within the aggregate because that’s the column by which the query is being grouped. It may therefore be included in the result set because, by definition, there can be only a single category value in each group. Each row in the result set summarizes one category, and the aggregate functions now calculate the row count, sum average, minimum value, and maximum value for each category. 296 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 297 Aggregating Data 12 SQL is not limited to grouping by a column. It’s possible to group by an expression, but note that the exact same expression must be used in the SELECT list, not the individual columns used to generate the expression. Nor is SQL limited to grouping by a single column or expression. Grouping by multiple columns and expressions is quite common. The following query is an example of grouping by two expressions that calculate year number and quarter from SalesDate: SELECT Year(SalesDate) as [Year], DatePart(q,SalesDate) as [Quarter], Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max] FROM RawData GROUP BY Year(SalesDate), DatePart(q,SalesDate); Result: Year Quarter Count Sum Avg Min Max 2009 1 6 218 36 11 62 2009 2 6 369 61 33 86 2009 3 8 280 70 54 91 2008 4 4 79 19 12 28 For the purposes of a GROUP BY, null values are considered equal to other nulls and are grouped together into a single result row. Grouping sets Normally, SQL Server groups by every unique combination of values in every column listed in the GROUP BY clause. Grouping sets is a variation of that theme that’s new for SQL Server 2008. With grouping sets, a summation row is generated for each unique value in each set. You can think of grouping sets as executing several GROUP BY queries (one for each grouping set) and then combining, or unioning, the results. For example, the following two queries produce the same result. The first query uses two GROUP BY queries unioned together; the second query uses the new grouping set feature: SELECT NULL AS Category, Region, COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max] FROM RawData GROUP BY Region 297 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 298 Part II Manipulating Data With Select UNION SELECT Category, Null, COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max] FROM RawData GROUP BY Category; SELECT Category, Region, COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max] FROM RawData GROUP BY GROUPING SETS (Category, Region); Result (same for both queries): Category Region Count Sum Avg Min Max NULL MidWest 3 145 48 24 83 NULL NorthEast 6 236 59 28 91 NULL South 12 485 44 11 86 NULL West 380403644 X NULL 7 225 45 11 86 Y NULL 12 506 46 12 91 Z NULL 5 215 53 33 83 There’s more to grouping sets than merging multiple GROUP BY queries; they’re also used with ROLLUP and CUBE, covered later in this chapter. Filtering grouped results When combined with grouping, filtering can be a problem. Are the row restrictions applied before the GROUP BY or after the GROUP BY? Some databases use nested queries to properly filter before or after the GROUP BY. SQL, however, uses the HAVING clause to filter the groups. At the beginning of this chapter, you saw the simplified order of the SQL SELECT statement’s execution. A more complete order is as follows: 1. The FROM clause assembles the data from the data sources. 2. The WHERE clause restricts the rows based on the conditions. 3. The GROUP BY clause assembles subsets of data. 4. Aggregate functions are calculated. 298 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 299 Aggregating Data 12 5. The HAVING clause filters the subsets of data. 6. Any remaining expressions are calculated. 7. The ORDER BY sorts the results. Continuing with the RawData sample table, the following query removes from the analysis any grouping ‘‘having’’ an average of less than or equal to 25 by accepting only those summary rows with an average greater than 25: SELECT Year(SalesDate) as [Year], DatePart(q,SalesDate) as [Quarter], Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg] FROM RawData GROUP BY Year(SalesDate), DatePart(q,SalesDate) HAVING Avg(Amount) > 25 ORDER BY [Year], [Quarter]; Result: Year Quarter Count Sum Avg 2006 1 6 218 36 2006 2 6 369 61 2006 3 8 280 70 Without the HAVING clause, the fourth quarter of 2005, with an average of 19, would have been included in the result set. Aggravating Queries A few aspects of GROUP BY queries can be aggravating when developing applications. Some developers simply avoid aggregate queries and make the reporting tool do the work, but the Database Engine will be more efficient than any client tool. Here are four typical aggravating problems and my recommended solutions. Including group by descriptions The previous aggregate queries all executed without error because every column participated in the aggregate purpose of the query. To test the rule, the following script adds a category table and then attempts to return a column that isn’t included as an aggregate function or GROUP BY column: CREATE TABLE RawCategory ( RawCategoryID CHAR(1) NOT NULL PRIMARY KEY, CategoryName VARCHAR(25) NOT NULL ); 299 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 300 Part II Manipulating Data With Select INSERT RawCategory (RawCategoryID, CategoryName) VALUES (’X’, ‘Sci-Fi’), (’Y’, ‘Philosophy’), (’Z’, ‘Zoology’); ALTER TABLE RawData ADD CONSTRAINT FT_Category FOREIGN KEY (Category) REFERENCES RawCategory(RawCategoryID); including data outside the aggregate function or group by SELECT R.Category, C.CategoryName, Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max] FROM RawData AS R INNER JOIN RawCategory AS C ON R.Category = C.RawCategoryID GROUP BY R.Category; As expected, including CategoryName in the column list causes the query to return an error message: Msg 8120, Level 16, State 1, Line 1 Column ‘RawCategory.CategoryName’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Here are three solutions for including non-aggregate descriptive columns. Which solution performs best depends on the size and mix of the data and indexes. The first solution is to simply include the additional columns in the GROUP BY clause: SELECT R.Category, C.CategoryName, Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max] FROM RawData AS R INNER JOIN RawCategory AS C ON R.Category = C.RawCategoryID GROUP BY R.Category, C.CategoryName ORDER BY R.Category, C.CategoryName; Result: Category CategoryName Sum Avg Min Max X Sci-Fi 225 45 11 86 Y Philosophy 506 46 12 91 Z Zoology 215 53 33 83 300 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 301 Aggregating Data 12 Another simple solution might be to include the descriptive column in an aggregate function that accepts text, such as MIN() or MAX(). This solution returns the descriptor while avoiding grouping by an additional column: SELECT Category, MAX(CategoryName) AS CategoryName, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max] FROM RawData R JOIN RawCategory C ON R.Category = C.RawCategoryID GROUP BY Category ORDER BY Category, CategoryName Another possible solution, although more complex, is to embed the aggregate function in a subquery and then include the additional columns in the outer query. In this solution, the subquery does the grunt work of the aggregate function and GROUP BY, leaving the outer query to handle the JOIN and bring in the descriptive column(s). For larger data sets, this may be the best-performing solution: SELECT sq.Category, C.CategoryName, sq.[Sum], sq.[Avg], sq.[Min], sq.[Max] FROM (SELECT Category, Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max] FROM RawData GROUP BY Category ) AS sq INNER JOIN RawCategory AS C ON sq.Category = C.RawCategoryID ORDER BY sq.Category, C.CategoryName; Which solution performs best depends on the data mix. If it’s an ad hoc query, then the simplest query to write is probably the first solution. If the query is going into production as part of a stored proce- dure, then I recommend testing all three solutions against a full data load to determine which solution actually performs best. Never underestimate the optimizer. Including all group by values The GROUP BY functions occur following the where clause in the logical order of the query. This can present a problem if the query needs to report all of the GROUP BY column values even though the data needs to be filtered. For example, a report might need to include all the months even though there’s no data for a given month. A GROUP BY query won’t return a summary row for a group that has no data. The simple solution is to use the GROUP BY ALL option, which includes all GROUP BY values regardless of the WHERE clause. However, it has a limitation: It only works well when grouping by a single expres- sion. A more severe limitation is that Microsoft lists it as deprecated, meaning it will be removed from a future version of SQL Server. Nulltheless, here’s an example. 301 www.getcoolebook.com . 54 91 2008 4 4 79 19 12 28 For the purposes of a GROUP BY, null values are considered equal to other nulls and are grouped together into a single result row. Grouping sets Normally, SQL Server. column listed in the GROUP BY clause. Grouping sets is a variation of that theme that’s new for SQL Server 2008. With grouping sets, a summation row is generated for each unique value in each set single expres- sion. A more severe limitation is that Microsoft lists it as deprecated, meaning it will be removed from a future version of SQL Server. Nulltheless, here’s an example. 301 www.getcoolebook.com