Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 242 Part II Manipulating Data With Select INSERT dbo.One(OnePK, Thing1) VALUES (2, ‘New Thing’); INSERT dbo.One(OnePK, Thing1) VALUES (3, ‘Red Thing’); INSERT dbo.One(OnePK, Thing1) VALUES (4, ‘Blue Thing’); INSERT dbo.Two(TwoPK, OnePK, Thing2) VALUES(1,0, ‘Plane’); INSERT dbo.Two(TwoPK, OnePK, Thing2) VALUES(2,2, ‘Train’); INSERT dbo.Two(TwoPK, OnePK, Thing2) VALUES(3,3, ‘Car’); INSERT dbo.Two(TwoPK, OnePK, Thing2) VALUES(4,NULL, ‘Cycle’); FIGURE 10-9 The Red Thing Blue Thing example has data to view every type of join. Old Thing Red Thing New Thing Blue Thing Plane Cycle Train Car An inner join between table One and table Two will return only the two matching rows: SELECT Thing1, Thing2 FROM dbo.One INNER JOIN dbo.Two ON One.OnePK = Two.OnePK; Result: Thing1 Thing2 New Thing Train Red Thing Car A left outer join will extend the inner join and include the rows from table One without a match: SELECT Thing1, Thing2 FROM dbo.One LEFT OUTER JOIN dbo.Two ON One.OnePK = Two.OnePK; 242 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 243 Merging Data with Joins and Unions 10 All the rows are now returned from table One, but two rows are still missing from table Two: Thing1 Thing2 Old Thing NULL New Thing Train Red Thing Car Blue Thing NULL A full outer join will retrieve every row from both tables, regardless of a match between the tables: SELECT Thing1, Thing2 FROM dbo.One FULL OUTER JOIN dbo.Two ON One.OnePK = Two.OnePK; The plane and cycle from table Two are now listed along with every row from table One: Thing1 Thing2 Old Thing NULL New Thing Train Red Thing Car Blue Thing NULL NULL Plane NULL Cycle As this example shows, full outer joins are an excellent tool for finding all the data, even bad data. Set difference queries, explored later in this chapter, build on outer joins to zero in on bad data. Placing the conditions within outer joins When working with inner joins, a condition has the same effect whether it’s in the JOIN clause or the WHERE clause, but that’s not the case with outer joins: ■ When the condition is in the JOIN clause, SQL Server includes all rows from the outer table and then uses the condition to include rows from the second table. ■ When the restriction is placed in the WHERE clause, the join is performed and then the WHERE clause is applied to the joined rows. The following two queries demonstrate the effect of the placement of the condition. In the first query, the left outer join includes all rows from table One and then joins those rows from table Two where OnePK is equal in both tables and Thing1’s value is New Thing. The result is all the rows from table One,androwsfromtableTwo that meet both join restrictions: SELECT Thing1, Thing2 FROM dbo.One LEFT OUTER JOIN dbo.Two 243 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 244 Part II Manipulating Data With Select ON One.OnePK = Two.OnePK AND One.Thing1 = ‘New Thing’; Result: Thing1 Thing2 Old Thing NULL New Thing Train Red Thing NULL Blue Thing NULL The second query first performs the left outer join, producing the same four rows as the previous query but without the AND condition. The WHERE clause then restricts that result to those rows where Thing1 is equal to New Thing1. The net effect is the same as when an inner join was used (but it might take more execution time): SELECT Thing1, Thing2 FROM dbo.One LEFT OUTER JOIN dbo.Two ON One.OnePK = Two.OnePK WHERE One.Thing1 = ‘New Thing’; Result: Thing1 Thing2 New Thing Train Multiple outer joins Coding a query with multiple outer joins can be tricky. Typically, the order of data sources in the FROM clause doesn’t matter, but here it does. The key is to code them in a sequential chain. Think through it this way: 1. Grab all the customers regardless of whether they’ve placed any orders. 2. Then grab all the orders regardless of whether they’ve shipped. 3. Then grab all the ship details. When chaining multiple outer joins, stick to left outer joins, as mixing left and right outer joins becomes very confusing very fast. Be sure to unit test the query with a small sample set of data to ensure that the outer join chain is correct. Self-Joins A self-join is a join that refers back to the same table. This type of unary relationship is often used to extract data from a reflexive (also called a recursive) relationship, such as organizational charts (employee to boss). Think of a self-join as a table being joined with a temporary copy of itself. 244 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 245 Merging Data with Joins and Unions 10 The Family sample database uses two self-joins between a child and his or her parents, as shown in the database diagram in Figure 10-10. The mothers and fathers are also people, of course, and are listed in the same table. They link back to their parents, and so on. The sample database is populated with five fictitious generations that can be used for sample queries. FIGURE 10-10 The database diagram of the Family database includes two unary relationships (children to parents) on the left and a many-to-many unary relationship (husband to wife) on the right. The key to constructing a self-join is to include a second reference to the table using a table alias. Once the table is available twice to the SELECT statement, the self-join functions much like any other join. In the following example, the dbo.Person table is referenced using the table alias Mother: Switching over to the Family sample database, the following query locates the children of Audry Hal- loway: USE Family; SELECT Child.PersonID, Child.FirstName, Child.MotherID, Mother.PersonID 245 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 246 Part II Manipulating Data With Select FROM dbo.Person AS Child INNER JOIN dbo.Person AS Mother ON Child.MotherID = Mother.PersonID WHERE Mother.LastName = ‘Halloway’ AND Mother.FirstName = ‘Audry’; The query uses the Person table twice. The first reference (aliased as Child) is joined with the sec- ond reference (aliased as Mother), which is restricted by the WHERE clause to only Audry Halloway. Only the rows with a MotherID that points back to Audry will be included in the inner join. Audry’s PersonID is 6 and her children are as follows: PersonID FirstName MotherID PersonID 8 Melanie 6 6 7 Corwin 6 6 9 Dara 6 6 10 James 6 6 While the previous query adequately demonstrates a self-join, it would be more useful if the mother weren’t hard-coded in the WHERE clause, and if more information were provided about each birth, as follows: SELECT CONVERT(NVARCHAR(15),C.DateofBirth,1) AS Date, C.FirstName AS Name, C.Gender AS G, ISNULL(F.FirstName + ‘ ‘ + F.LastName, ‘ * unknown *’) as Father, M.FirstName + ‘ ‘ + M.LastName as Mother FROM dbo.Person AS C LEFT OUTER JOIN dbo.Person AS F ON C.FatherID = F.PersonID INNER JOIN dbo.Person AS M ON C.MotherID = M.PersonID ORDER BY C.DateOfBirth; This query makes three references to the Person table: the child, the father, and the mother, with mnemonic one-letter aliases. The result is a better listing: Date Name G Father Mother 5/19/22 James M James Halloway Kelly Halloway 8/05/28 Audry F Bryan Miller Karen Miller 8/19/51 Melanie F James Halloway Audry Halloway 8/30/53 James M James Halloway Audry Halloway 2/12/58 Dara F James Halloway Audry Halloway 3/13/61 Corwin M James Halloway Audry Halloway 3/13/65 Cameron M Richard Campbell Elizabeth Campbell For more ideas about working with hierarchies and self-joins, refer to Chapter 17, ‘‘Traversing Hierarchies.’’ 246 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 247 Merging Data with Joins and Unions 10 Cross (Unrestricted) Joins The cross join, also called an unrestricted join, is a pure relational algebra multiplication of the two source tables. Without a join condition restricting the result set, the result set includes every possible combination of rows from the data sources. Each row in data set one is matched with every row in data set two — for example, if the first data source has five rows and the second data source has four rows, a cross join between them would result in 20 rows. This type of result set is referred to as a Cartesian product. Using the One/Two sample tables, a cross join is constructed in Management Studio by omitting the join condition between the two tables, as shown in Figure 10-11. FIGURE 10-11 A graphical representation of a cross join is simply two tables without a join condition. In code, this type of join is specified by the keywords CROSS JOIN and the lack of an ON condition: SELECT Thing1, Thing2 FROM dbo.One CROSS JOIN dbo.Two; 247 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 248 Part II Manipulating Data With Select The result of a join without restriction is that every row in table One matches with every row from table Two: Thing1 Thing2 Old Thing Plane New Thing Plane Red Thing Plane Blue Thing Plane Old Thing Train New Thing Train Red Thing Train Blue Thing Train Old Thing Car New Thing Car Red Thing Car Blue Thing Car Old Thing Cycle New Thing Cycle Red Thing Cycle Blue Thing Cycle Sometimes cross joins are the result of someone forgetting to draw the join in a graphical-query tool; however, they are useful for populating databases with sample data, or for creating empty ‘‘pidgin hole’’ rows for population during a procedure. Understanding how a cross join multiplies data is also useful when studying relational division, the inverse of relational multiplication. Relational division requires subqueries, so it’s explained in the next chapter. Exotic Joins Nearly all joins are based on a condition of equality between the primary key of a primary table and the foreign key of a secondary table, which is why the inner join is sometimes called an equi-join. Although it’s commonplace to base a join on a single equal condition, it is not a requirement. The condition between the two columns is not necessarily equal, nor is the join limited to one condition. The ON condition of the join is in reality nothing more than a WHERE condition restricting the product of the two joined data sets. Where-clause conditions may be very flexible and powerful, and the same is true of join conditions. This understanding of the ON condition enables the use of three powerful tech- niques: (theta) joins, multiple-condition joins,andnon-key joins. Multiple-condition joins If a join is nothing more than a condition between two data sets, then it makes sense that multiple con- ditions are possible at the join. In fact, multiple-condition joins and joins go hand-in-hand. Without the ability to use multiple-condition joins, joins would be of little value. 248 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 249 Merging Data with Joins and Unions 10 If the database schema uses natural primary keys, then there are probably tables with composite primary keys, which means queries must use multiple-condition joins. Join conditions can refer to any table in the FROM clause, enabling interesting three-way joins: FROM A INNER JOIN B ON A.col = B.col INNER JOIN C ON B.col = C.col AND A.col = C.col; The first query in the previous section, ‘‘Placing the Conditions within Outer Joins,’’ was a multiple- condition join. (theta) joins A theta join (depicted throughout as ) is a join based on a non-equal on condition. In relational the- ory, conditional operators (=, >, <, >=, <=, <>) are called operators. While the equals condi- tion is technically a operator, it is commonly used, so only joins with conditions other than equal are referred to as joins. The condition may be set within Management Studio’s Query Designer using the join Properties dia- log, as previously shown in Figure 10-7. Non-key joins Joins are not limited to primary and foreign keys. The join can match a row in one data source with a row in another data source using any column, as long as the columns share compatible data types and the data match. For example, an inventory allocation system would use a non-key join to find products that are expected to arrive from the supplier before the customer’s required ship date. A non-key join between the PurchaseOrder and OrderDetail tables with a condition between PO.DateExpected and OD.DateRequired will filter the join to those products that can be allocated to the customer’s orders. The following code demonstrates the non-key join (this is not in a sample database): SELECT OD.OrderID, OD.ProductID, PO.POID FROM OrderDetail AS OD INNER JOIN PurchaseOrder AS PO ON OD.ProductID = PO.ProductID AND OD.DateRequired > PO.DateExpected; When working with inner joins, non-key join conditions can be placed in the WHERE clause or in the JOIN. Because the conditions compare similar values between two joined tables, I often place these con- ditions in the JOIN portion of the FROM clause, rather than the WHERE clause. The critical difference depends on whether you view the conditions as a part of creating the record set upon which the rest of the SQL SELECT statement is acting, or as a filtering task that follows the FROM clause. Either way, the query-optimization plan is identical, so use the method that is most readable and seems most logical 249 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 250 Part II Manipulating Data With Select to you. Note that when constructing outer joins, the placement of the condition in the JOIN or in the WHERE clause yields different results, as explained earlier in the section ‘‘Placing the Conditions within Outer Joins.’’ Asking the question, ‘‘Who are twins?’’ of the Family sample database uses all three exotic join techniques in the join between person and twin. The join contains three conditions. The Person.PersonID <> Twin.PersonID condition is a join that prevents a person from being considered his or her own twin. The join condition on MotherID, while a foreign key, is nonstandard because it is being joined with another foreign key. The DateOfBirth condition is definitely a non-key join condition: SELECT Person.FirstName + ‘ ‘ + Person.LastName AS Person, Twin.FirstName + ‘ ‘ + Twin.LastName AS Twin, Person.DateOfBirth FROM dbo.Person INNER JOIN dbo.Person AS Twin ON Person.PersonID <> Twin.PersonID AND Person.MotherID = Twin.MotherID AND Person.DateOfBirth = Twin.DateOfBirth; The following is the same query, this time with the exotic join condition moved to the WHERE clause. Not surprisingly, SQL Server’s Query Optimizer produces the exact same query execution plan for each query: SELECT Person.FirstName + ‘ ‘ + Person.LastName AS Person, Twin.FirstName + ‘ ‘ + Twin.LastName AS Twin, Person.DateOfBirth FROM dbo.Person INNER JOIN dbo.Person AS Twin ON Person.MotherID = Twin.MotherID AND Person.DateOfBirth = Twin.DateOfBirth WHERE Person.PersonID <> Twin.PersonID; Result: Person Twin DateOfBirth Abbie Halloway Allie Halloway 1979-010-14 00:00:00.000 Allie Halloway Abbie Halloway 1979-010-14 00:00:00.000 The difficult query scenarios at the end of the next chapter also demonstrate exotic joins, which are oftenusedwithsubqueries. Set Difference Queries A query type that’s useful for analyzing the correlation between two data sets is a set difference query, sometimes called a left (or right) anti-semi join, which finds the difference between the two data sets based on the conditions of the join. In relational algebra terms, it removes the divisor from the dividend, 250 www.getcoolebook.com Nielsen c10.tex V4 - 07/21/2009 12:42pm Page 251 Merging Data with Joins and Unions 10 leaving the difference. This type of query is the inverse of an inner join. Informally, it’s called a find unmatched rows query. Set difference queries are great for locating out-of-place data or data that doesn’t match, such as rows that are in data set one but not in data set two (see Figure 10-12). FIGURE 10-12 The set difference query finds data that is outside the intersection of the two data sets. Old Thing Red Thing New Thing Blue Thing Plane Cycle Train Car Table TwoTable One Set Difference Set Difference Left set difference query A left set difference query finds all the rows on the left side of the join without a match on the right side of the joins. Using the One and Two sample tables, the following query locates all rows in table One without a matchintable Two, removing set two (the divisor) from set one (the dividend). The result will be the rows from set one that do not have a match in set two. The outer join already includes the rows outside the intersection, so to construct a set difference query use an OUTER JOIN with an IS NULL restriction on the second data set’s primary key. This will return all the rows from table One that do not have a match in table Two: USE tempdb; SELECT Thing1, Thing2 FROM dbo.One LEFT OUTER JOIN dbo.Two ON One.OnePK = Two.OnePK WHERE Two.TwoPK IS NULL; Table One’s difference is as follows: Thing1 Thing2 Old Thing NULL Blue Thing NULL 251 www.getcoolebook.com . clause, but that’s not the case with outer joins: ■ When the condition is in the JOIN clause, SQL Server includes all rows from the outer table and then uses the condition to include rows from. critical difference depends on whether you view the conditions as a part of creating the record set upon which the rest of the SQL SELECT statement is acting, or as a filtering task that follows. query, this time with the exotic join condition moved to the WHERE clause. Not surprisingly, SQL Server s Query Optimizer produces the exact same query execution plan for each query: SELECT Person.FirstName