Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 282 Part II Manipulating Data With Select relational division query would list only those students who passed the required courses and no others. A relational division with a remainder, also called an approximate divide, would list all the students who passed the required courses and include students who passed any additional courses. Of course, that example is both practical and academic. Relational division is more complex than a join. A join simply finds any matches between two data sets. Relational division finds exact matches between two data sets. Joins/subqueries and relational division solve different types of questions. For example, the following questions apply to the sample databases and compare the two methods: ■ Joins/subqueries: ■ CHA2: Who has ever gone on a tour? ■ CHA2: Who lives in the same region as a base camp? ■ CHA2: Who has attended any event in his or her home region? ■ Exact relational division: ■ CHA2: Who has gone on every tour in his or her home state but no tours outside it? ■ OBXKites: Who has purchased every kite but nothing else? ■ Family: Which women (widows or divorcees) have married the same husbands as each other, but no other husbands? ■ Relational division with remainders: ■ CHA2: Who has gone on every tour in his or her home state, and possibly other tours as well? ■ OBXKites: Who has purchased every kite and possibly other items as well? ■ Family: Which women have married the same husbands and may have married other men as well? Relational division with a remainder Relational division with a remainder essentially extracts the quotient while allowing some leeway for rows that meet the criteria but contain additional data as well. In real-life situations this type of division is typically more useful than an exact relational division. The previous OBX Kites sales question (‘‘Who has purchased every kite and possibly other items as well?’’) is a good one to use to demonstrate relational division. Because it takes five tables to go from contact to product category, and because the question refers to the join between OrderDetail and Product, this question involves enough complexity that it simulates a real-world relational-database problem. The toy category serves as a good example category because it contains only two toys and no one has purchased a toy in the sample data, so the query will answer the question ‘‘Who has purchased at least one of every toy sold by OBX Kites?’’ (Yes, my kids volunteered to help test this query.) First, the following data will mock up a scenario in the OBXKites database. The only toys are ProductCode 1049 and 1050. The OBXKites database uses unique identifiers for primary keys and therefore uses stored procedures for all inserts. The first Order and OrderDetail inserts will list the stored procedure parameters so the following stored procedure calls are easier to understand: USE OBXKites; DECLARE @OrderNumber INT; 282 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 283 Including Data with Subqueries and CTEs 11 The first person, ContactCode 110, orders exactly all toys: EXEC pOrder_AddNew @ContactCode = ‘110’, @EmployeeCode = ‘120’, @LocationCode = ‘CH’, @OrderDate= ‘2002/6/1’, @OrderNumber = @OrderNumber output; EXEC pOrder_AddItem @OrderNumber = @OrderNumber, @Code = ‘1049’, @NonStockProduct = NULL, @Quantity = 12, @UnitPrice = NULL, @ShipRequestDate = ‘2002/6/1’, @ShipComment = NULL; EXEC pOrder_AddItem @OrderNumber, ‘1050’, NULL, 3, NULL, NULL, NULL; The second person, ContactCode 111, orders exactly all toys — and toy 1050 twice: EXEC pOrder_AddNew ‘111’, ‘119’, ‘JR’, ‘2002/6/1’, @OrderNumber output; EXEC pOrder_AddItem @OrderNumber, ‘1049’, NULL, 6, NULL, NULL, NULL; EXEC pOrder_AddItem @OrderNumber, ‘1050’, NULL, 6, NULL, NULL, NULL; EXEC pOrder_AddNew ‘111’, ‘119’, ‘JR’, ‘2002/6/1’, @OrderNumber output; EXEC pOrder_AddItem @OrderNumber, ‘1050’, NULL, 6, NULL, NULL, NULL; The third person, ContactCode 112, orders all toys plus some other products: EXEC pOrder_AddNew ‘112’, ‘119’, ‘JR’, ‘2002/6/1’, @OrderNumber output; EXEC pOrder_AddItem @OrderNumber, ‘1049’, NULL, 6, NULL, NULL, NULL; EXEC pOrder_AddItem @OrderNumber, ‘1050’, NULL, 5, NULL, NULL, NULL; EXEC pOrder_AddItem @OrderNumber, ‘1001’, NULL, 5, NULL, NULL, NULL; EXEC pOrder_AddItem @OrderNumber, ‘1002’, NULL, 5, NULL, NULL, NULL; The fourth person, ContactCode 113, orders one toy: EXEC pOrder_AddNew ‘113’, ‘119’, ‘JR’, ‘2002/6/1’, @OrderNumber output; 283 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 284 Part II Manipulating Data With Select EXEC pOrder_AddItem @OrderNumber, ‘1049’, NULL, 6, NULL, NULL, NULL; In other words, only customers 110 and 111 order all the toys and nothing else. Customer 112 pur- chases all the toys, as well as some kites. Customer 113 is an error check because she bought only one toy. At least a couple of methods exist for coding a relational-division query. The original method, proposed by Chris Date, involves using nested correlated subqueries to locate rows in and out of the sets. A more direct method has been popularized by Joe Celko: It involves comparing the row count of the dividend and divisor data sets. Basically, Celko’s solution is to rephrase the question as ‘‘For whom is the number of toys ordered equal to the number of toys available?’’ The query is asking two questions. The outer query will group the orders with toys for each contact, and the subquery will count the number of products in the toy product category. The outer query’s HAVING clause will then compare the distinct count of contact products ordered that are toys against the count of products that are toys: Is number of toys ordered SELECT Contact.ContactCode FROM dbo.Contact JOIN dbo.[Order] ON Contact.ContactID = [Order].ContactID JOIN dbo.OrderDetail ON [Order].OrderID = OrderDetail.OrderID JOIN dbo.Product ON OrderDetail.ProductID = Product.ProductID JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Toy’ GROUP BY Contact.ContactCode HAVING COUNT(DISTINCT Product.ProductCode) = equal to number of toys available? (SELECT Count(ProductCode) FROM dbo.Product JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Toy’); Result: ContactCode 110 111 112 284 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 285 Including Data with Subqueries and CTEs 11 Some techniques in the previous query — namely, group by, having,andcount() —are explained in the next chapter, ‘‘Aggregating Data.’’ Exact relational division Exact relational division finds exact matches without any remainder. It takes the basic question of rela- tional division with remainder and tightens the method so that the divisor will have no extra rows that cause a remainder. In practical terms it means that the example question now asks, ‘‘Who has ordered only every toy?’’ If you address this query with a modified form of Joe Celko’s method, the pseudocode becomes ‘‘For whom is the number of toys ordered equal to the number of toys available, and also equal to the total number of products ordered?’’ If a customer has ordered additional products other than toys, then the third part of the question eliminates that customer from the result set. The SQL code contains two primary changes to the previous query. One, the outer query must find both the number of toys ordered and the number of all products ordered. It does this by finding the toys purchased in a derived table and joining the two data sets. Two, the HAVING clause must be modified to compare the number of toys available with both the number of toys purchased and the number of all products purchased, as follows: Exact Relational Division Is number of all products ordered SELECT Contact.ContactCode FROM dbo.Contact JOIN dbo.[Order] ON Contact.ContactID = [Order].ContactID JOIN dbo.OrderDetail ON [Order].OrderID = OrderDetail.OrderID JOIN dbo.Product ON OrderDetail.ProductID = Product.ProductID JOIN dbo.ProductCategory P1 ON Product.ProductCategoryID = P1.ProductCategoryID JOIN and number of toys ordered (SELECT Contact.ContactCode, Product.ProductCode FROM dbo.Contact JOIN dbo.[Order] ON Contact.ContactID = [Order].ContactID JOIN dbo.OrderDetail ON [Order].OrderID = OrderDetail.OrderID JOIN dbo.Product ON OrderDetail.ProductID = Product.ProductID JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Toy’ 285 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 286 Part II Manipulating Data With Select ) ToysOrdered ON Contact.ContactCode = ToysOrdered.ContactCode GROUP BY Contact.ContactCode HAVING COUNT(DISTINCT Product.ProductCode) = equal to number of toys available? (SELECT Count(ProductCode) FROM dbo.Product JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Toy’) AND equal to the total number of any product ordered? AND COUNT(DISTINCT ToysOrdered.ProductCode) = (SELECT Count(ProductCode) FROM dbo.Product JOIN dbo.ProductCategory ON Product.ProductCategoryID = ProductCategory.ProductCategoryID WHERE ProductCategory.ProductCategoryName = ‘Toy’); The result is a list of contacts containing the number of toys purchased (2) and the number of total products purchased (2), both equal to the number of products available (2): ContactCode 110 111 Composable SQL Composable SQL, also called select from output or DML table source (in SQL Server BOL), is the ability to pass data from an insert, update, or delete’s output clause to an outer query. This is a very powerful new way to build subqueries, and it can significantly reduce the amount of code and improve the per- formance of code that needs to write to one table, and then, based on that write, write to another table. To track the evolution of composable SQL (illustrated in Figure 11-3), SQL Server has always had DML triggers, which include the inserted and deleted virtual tables. Essentially, these are a view to the DML modification that fired the triggers. The deleted table holds the before image of the data, and the inserted table holds the after image. Since SQL Server 2005, any DML statement that modifies data ( INSERT, UPDATE, DELETE, MERGE) can have an optional OUTPUT clause that can SELECT from the virtual inserted and deleted table. The OUTPUT clause can pass the data to the client or insert it directly into a table. 286 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 287 Including Data with Subqueries and CTEs 11 The inserted and deleted virtual tables are covered in Chapter 26, ‘‘Creating DML Trig- gers,’’ and the output clause is detailed in Chapter 15, ‘‘Modifying Data.’’ In SQL Server 2008, composable SQL can place the DML statements and its OUTPUT clause in a sub- query and then select from that subquery. The primary benefit of composable SQL, as opposed to just using the OUTPUT clause to insert into a table, is that OUTPUT clause data may be further filtered and manipulated by the outer query. FIGURE 11-3 Composable SQL is an evolution of the inserted and deleted tables. Output Select From Output Inserted Deleted Insert Select From SQL 2008 SQL 2005 SQL 2000 DML Insert, Update, Delete, Merge Client, table variable, temp tables, tables subquery The following script first creates a table and then has a composable SQL query. The subquery has an UPDATE command with an OUTPUT clause. The OUTPUT clause passes the oldvalue and newvalue columns to the outer query. The outer query filters out TestData and then inserts it into the CompSQL table: CREATE TABLE CompSQL (oldvalue varchar(50), newvalue varchar(50)); INSERT INTO CompSQL (oldvalue, newvalue ) SELECT oldvalue, newvalue FROM (UPDATE HumanResources.Department SET GroupName = ‘Composable SQL Test’ OUTPUT Deleted.GroupName as ‘oldvalue’, Inserted.GroupName as ‘newvalue’ WHERE Name = ‘Sales’) Q; 287 www.getcoolebook.com Nielsen c11.tex V4 - 07/23/2009 1:54pm Page 288 Part II Manipulating Data With Select SELECT oldvalue, newvalue FROM CompSQL WHERE newvalue <> ‘TestData’; Result: oldvalue newvalue Sales and Marketing Composable SQL Test Note several restrictions on composable SQL: ■ The update DML in the subquery must modify a local table and cannot be a partitioned view. ■ The composable SQL query cannot include nested composable SQL, aggregate function, sub- query, ranking function, full-text features, user-defined functions that perform data access, or the textptr function. ■ The target table must be a local base table with no triggers, no foreign keys, no merge replication, or updatable subscriptions for transactional replication. Summary While the basic nuts and bolts of subqueries may appear simple, they open a world of possibilities, as they enable you to build complex nested queries that pull and twist data into the exact shape that is needed to solve a difficult problem. As you continue to play with subqueries, I think you’ll agree that herein lies the power of SQL — and if you’re still developing primarily with the GUI tools, this might provide the catalyst to move you to developing SQL using the query text editor. A few key points from this chapter: ■ Simple subqueries are executed once and the results are inserted into the outer query. ■ Subqueries can be used in nearly every portion of the query — not just as derived tables. ■ Correlated subqueries refer to the outer query, so they can’t be executed by themselves. Con- ceptually, the outer query is executed and the results are passed to the correlated subquery, which is executed once for every row in the outer query. ■ You don’t need to memorize how to code relational division; just remember that if you need to join not on any row but every row, then relational division is the set-based solution to do the job. ■ Composable SQL is useful if you need to write to multiple tables from a single transaction, but there are plenty of limitations. The previous chapters established the foundation for working with SQL, covering the SELECT state- ment, expressions, joins, and unions, while this chapter expanded the SELECT with powerful subqueries and CTEs. If you’re reading through this book sequentially, congratulations — you are now over the hump of learning SQL. If you can master relational algebra and subqueries, the rest is a piece of cake. The next chapter continues to describe the repertoire of data-retrieval techniques with aggregation queries, where using subqueries pays off. 288 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 289 Aggregating Data IN THIS CHAPTER Calculating sums and averages Statistical analysis Grouping data within a query Solving aggravating aggregation problems Generating cumulative totals Building crosstab queries with the case, pivot, and dynamic methods T he Information Architecture Principle in Chapter 2 implies that informa- tion, not just data, is an asset. Turning raw lists of keys and data into useful information often requires summarizing data and grouping it in meaningful ways. While summarization and analysis can certainly be performed with other tools, such as Reporting Services, Analysis Services, or an external tool such as SAS, SQL is a set-based language, and a fair amount of summarizing and grouping can be performed very well within the SQL SELECT statement. SQL excels at calculating sums, max values, and averages for the entire data set or for segments of data. In addition, SQL queries can create cross-tabulations, commonly known as pivot tables. Simple Aggregations The premise of an aggregate query is that instead of returning all the selected rows, SQL Server returns a single row of computed values that summarizes the original data set, as illustrated in Figure 12-1. More complex aggregate queries can slice the selected rows into subsets and then summarize every subset. The types of aggregate calculations range from totaling the data to performing basic statistical operations. It’s important to note that in the logical order of the SQL query, the aggregate functions (indicated by the Summing function in the diagram) occur following the FROM clause and the WHERE filters. This means that the data can be assem- bled and filtered prior to being summarized without needing to use a subquery, although sometimes a subquery is still needed to build more complex aggregate queries (as detailed later in the ‘‘Aggravating Queries’’ section in this chapter.) 289 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 290 Part II Manipulating Data With Select What’s New with Query Aggregations? M icrosoft continues to evolve T-SQL’s ability to aggregate data. SQL Server 2005 included the capability to roll your own aggregate functions using the .NET CLR. SQL Server 2008 expands this feature by removing the 8,000-byte limit on intermediate results for CLR user-defined aggregate functions. The most significant enhancement to query aggregation in SQL Server 2008 is the ability to use grouping sets to further define the CUBE and ROLLUP functions with the GROUP BY clause. WITH ROLLUP and WITH CUBE have been deprecated, as they are non-ISO-compliant syntax for special cases of the ISO-compliant syntax. They are replaced with the new, more powerful, syntax for ROLLUP and CUBE. FIGURE 12-1 The aggregate function produces a single row result from a data set. Where From Col(s), Expr(s) Single Row Summing Data Source(s) Basic aggregations SQL includes a set of aggregate functions, listed in Table 12-1, which can be used as expressions in the SELECT statement to return summary data. ON the WEBSITE ON the WEBSITE The code examples for this chapter use a small table called RawData. The code to create and populate this data set is at the beginning of the chapter’s script. You can download the script from www.SQLServerBible.com. CREATE TABLE RawData ( RawDataID INT NOT NULL IDENTITY PRIMARY KEY, Region VARCHAR(10) NOT NULL, Category CHAR(1) NOT NULL, Amount INT NULL, SalesDate Date NOT NULL ); 290 www.getcoolebook.com Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 291 Aggregating Data 12 TABLE 12-1 Basic Aggregate Functions Aggregate Function Data Type Supported Description sum() Numeric Totals all the non-null values in the column avg() Numeric Averages all the non-null values in the column. The result has the same data type as the input, so the input is often converted to a higher precision, such as avg(cast col as a float). min() Numeric, string, datetime Returns the smallest number or the first datetime or the first string according to the current collation from the column max() Numeric, string, datetime Returns the largest number or the last datetime or the last string according to the current collation from the column Count[_big](*) Any data type (row-based) Performs a simple count of all the rows in the result set up to 2,147,483,647. The count_big() variation uses the bigint data type and can handle up to 2 ˆ 63-1 rows. Count[_big] ([distinct] column) Any data type (row-based) Performs a simple count of all the rows with non-null values in the column in the result set up to 2,147,483,647. The distinct option eliminates duplicate rows. Will not count blobs. This simple aggregate query counts the number of rows in the table and totals the Amount column. In lieu of returning the actual rows from the RawData table, the query returns the summary row with the row count and total. Therefore, even though there are 24 rows in the RawData table, the result is a single row: SELECT COUNT(*) AS Count, SUM(Amount) AS [Sum] FROM RawData; Result: Count Sum 24 946 291 www.getcoolebook.com . Data.’’ In SQL Server 2008, composable SQL can place the DML statements and its OUTPUT clause in a sub- query and then select from that subquery. The primary benefit of composable SQL, as opposed. query. FIGURE 11-3 Composable SQL is an evolution of the inserted and deleted tables. Output Select From Output Inserted Deleted Insert Select From SQL 2008 SQL 2005 SQL 2000 DML Insert, Update, Delete, Merge Client,. products available (2): ContactCode 110 111 Composable SQL Composable SQL, also called select from output or DML table source (in SQL Server BOL), is the ability to pass data from an insert,