342 CHAPTER 17: THE SELECT STATEMENT However, I could write: SELECT sup_id, sup_name, (SELECT COUNT(*) FROM Orders WHERE Suppliers.sup_id = Orders.sup_id) FROM Suppliers; instead of writing: SELECT sup_id, sup_name, COUNT(*) FROM Suppliers LEFT OUTER JOIN Orders ON Suppliers.sup_id = Orders.sup_id GROUP BY sup_id, sup_name; 17.2.2 NULLs and OUTER JOINs The NULLs generated by the OUTER JOIN can occur in columns derived from source table columns that have been declared to be NOT NULL. Even if you tried to avoid all the problems with NULLs by making every column in every table of your database schema NOT NULL, they could still occur in OUTER JOIN and OLAP function results. However, a table can have NULLs and still be used in an OUTER JOIN. Consider different JOINs on the following two tables, which have NULLs in the common column: T1 T2 a x b x ======== =========== 1 'r' 7 'r' 2 'v' 8 's' 3 NULL 9 NULL A natural INNER JOIN on column x can only match those values that are equal to each other. But NULLs do not match to anything, even to other NULLs. Thus, there is one row in the result, on the value ‘r in column x in both tables.’ 17.2 OUTER JOINs 343 T1 INNER JOIN T2 ON (T1.x = T2.x) a T1.x b T2.x ======================== 1 'r' 7 'r' Now do a LEFT OUTER JOIN on the tables, which will preserve table T1, and you get: T1 LEFT OUTER JOIN T2 ON (T1.x = T2.x) a T1.x b T2.x =========================== 1 'r' 7 'r' 2 'v' NULL NULL 3 NULL NULL NULL Again, there are no surprises. The original INNER JOIN row is still in the results. The other two rows of T1 that were not in the equi- JOIN do show up in the results, and the columns derived from table T2 are filled with NULLs. The RIGHT OUTER JOIN would behave the same way. The problems start with the FULL OUTER JOIN, which looks like this: T1 FULL OUTER JOIN T2 ON (T1.x = T2.x) a T1.x b T2.x ======================== 1 'r' 7 'r' 2 'v' NULL NULL 3 NULL NULL NULL NULL NULL 8 's' NULL NULL 9 NULL The way this result is constructed is worth explaining in detail. First do an INNER JOIN on T1 and T2, using the ON clause condition, and put those rows (if any) in the results. Then all rows in T1 that could not be joined are padded out with NULLs in the columns derived from T2 and inserted into the results. Finally, take the rows in T2 that could not be joined, pad them out with NULLs, and insert them into the results. The bad news is that the original tables cannot be reconstructed from an OUTER JOIN. Look at the results of the FULL OUTER JOIN, which we will call R1, and SELECT the first columns from it: 344 CHAPTER 17: THE SELECT STATEMENT SELECT T1.a, T1.x FROM R1 a x ========================= 1 'r' 2 'v' 3 NULL NULL NULL NULL NULL The created NULLs remain and cannot be differentiated from the original NULLs. But you cannot throw out those duplicate rows, because they may be in the original table T1. 17.2.3 NATURAL versus Searched OUTER JOINs It is worth mentioning in passing that Standard SQL has a NATURAL LEFT OUTER JOIN, but it is not implemented in most current versions of SQL. Even those that have the syntax are actually creating an ON clause with equality tests, like the examples we have been using in this chapter. A NATURAL JOIN has only one copy of the common column pairs in its result. The searched OUTER JOIN has both of the original columns, with their table-qualified names. The NATURAL JOIN has to have a correlation name for the result table to identify the shared columns. We can build a NATURAL LEFT OUTER JOIN by using the COALESCE() function to combine the common column pairs into a single column and put the results into a VIEW where the columns can be properly named, thus: CREATE VIEW NLOJ12 (x, a, b) AS SELECT COALESCE(T1.x, T2.x), T1.a, T2.b FROM T1 LEFT OUTER JOIN T2 ON T1.x = T2.x; NLOJ12 x a b ============== 'r' 1 7 'v' 2 NULL NULL 3 NULL Unlike the NATURAL JOINs, the searched OUTER JOIN does not have to use a simple one-column equality as the JOIN search condition. 17.2 OUTER JOINs 345 The search condition can have several predicates, use other comparisons, and so forth. For example, T1 LEFT OUTER JOIN T2 ON (T1.x < T2.x) a T1.x b T2.x =================================== 1 'r' 8 's' 2 'v' NULL NULL 3 NULL NULL NULL as compared to: T1 LEFT OUTER JOIN T2 ON (T1.x > T2.x) a T1.x b T2.x ======================================== 1 'r' NULL NULL 2 'v' 7 'r' 2 'v' 8 's' 3 NULL NULL NULL Again, so much of current OUTER JOIN behavior is vendor-specific that the programmer should experiment with his own particular product to see what actually happens. 17.2.4 Self OUTER JOINs There is no rule that forbids an OUTER JOIN on the same table. In fact, this kind of self-join is a good trick for “flattening” a normalized table into a horizontal report. To illustrate the method, start with a table defined as CREATE TABLE Credits (student_nbr INTEGER NOT NULL, course_name CHAR(8) NOT NULL, PRIMARY KEY (student_nbr, course_name)); This table represents student IDs and a course name for each class they have taken. However, our rules say that students cannot get credit for CS-102 until they have taken its prerequisite, CS-101; they cannot get credit for CS-103 until they have taken its prerequisite, CS-102; and so forth. Let’s first load the table with some sample values. 346 CHAPTER 17: THE SELECT STATEMENT Notice that student 1 has both courses, student 2 has only the first of the series, and student 3 jumped ahead of sequence and therefore cannot get credit for his CS-102 course until he goes back and takes CS-101 as a prerequisite. Credits student_nbr course_name ========================== 1 'CS-101' 1 'CS-102' 2 'CS-101' 3 'CS-102' What we want is basically a histogram (bar chart) for each student, showing how far he or she has gone in his or her degree programs. Assume that we are only looking at two courses; the result of the desired query might look like this ( NULL is used to represent a missing value): (1, 'CS-101', 'CS-102') (2, 'CS-101', NULL) Clearly, this will need a self-JOIN, since the last two columns come from the same table, Credits. You have to give correlation names to both uses of the Credits table in the OUTER JOIN operator when you construct a self OUTER JOIN, just as you would with any other SELF- JOIN, thus: SELECT student_nbr, C1.course_name, C2.course_name FROM Credits AS C1 LEFT OUTER JOIN Credits AS C2 ON C1.student_nbr = C2.student_nbr AND C1.course_name = 'CS-101' AND C2.course_name = 'CS-102'; 17.2.5 Two or More OUTER JOINs Some relational purists feel that every operator should have an inverse, and therefore they do not like the OUTER JOIN. Others feel that the created NULLs are fundamentally different from the explicit NULLs in a base table and should have a special token. SQL uses its general-purpose NULLs and leaves things at that. Getting away from theory, you will also find that vendors have often done strange things with the ways their products work. 17.2 OUTER JOINs 347 A major problem is that OUTER JOIN operators do not have the same properties as INNER JOIN operators. The order in which FULL OUTER JOINs are executed will change the results (a mathematician would say that they are not associative). To show some of the problems that can come up when you have more than two tables, let us use three very simple two-column tables. Notice that some of the column values match and some do not match, but the three tables have all possible pairs of column names in them. CREATE TABLE T1 (a INTEGER NOT NULL, b INTEGER NOT NULL); INSERT INTO T1 VALUES (1, 2); CREATE TABLE T2 (a INTEGER NOT NULL, c INTEGER NOT NULL); INSERT INTO T2 VALUES (1, 3); CREATE TABLE T3 (b INTEGER NOT NULL, c INTEGER NOT NULL); INSERT INTO T3 VALUES (2, 100); Now let’s try some of the possible orderings of the three tables in a chain of LEFT OUTER JOINS. The problem is that a table can be preserved or unpreserved in the immediate JOIN and in the opposite state in the containing JOIN. SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T2) NATURAL LEFT OUTER JOIN T3); Result a b c =========== 1 2 NULL SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T2); Result a b c =========== 1 2 100 348 CHAPTER 17: THE SELECT STATEMENT SELECT T1.a, T1.b, T3.c FROM ((T1 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T2); Result a b c ============== NULL NULL NULL Even worse, the choice of column in the SELECT list can change the output. Instead of displaying T3.c, use T2.c and you will get: SELECT T1.a, T1.b, T2.c FROM ((T2 NATURAL LEFT OUTER JOIN T3) NATURAL LEFT OUTER JOIN T1); Result a b c =========== NULL NULL 3 17.2.6 OUTER JOINs and Aggregate Functions At the start of this chapter, we had a table of orders and a table of suppliers, which were to be used to build a report to tell us how much business we did with each supplier. The query that will do this is: SELECT Suppliers.sup_id, sup_name, SUM(order_amt) FROM Suppliers LEFT OUTER JOIN Orders ON Suppliers.sup_id = Orders.sup_id GROUP BY sup_id, sup_name; Some suppliers’ totals include credits for returned merchandise, so that our total business with them worked out to zero dollars. Each supplier with which we did no business will have a NULL in its order_amt column in the OUTER JOIN. The usual rules for aggregate functions with NULL values apply, so these suppliers will also show a zero total amount. It is also possible to use a function inside an aggregate function, so you could write SUM(COALESCE(T1.x, T2.x)) for the common column pairs. 17.2 OUTER JOINs 349 If you need to tell the difference between a true sum of zero and the result of a NULL in an OUTER JOIN, use the MIN() or MAX() function on the questionable column. These functions both return a NULL result for a NULL input, so an expression inside the MAX() function could be used to print the message MAX(COALESCE(order_amt, 'No Orders')), for example. Likewise, these functions could be used in a HAVING clause, but that would defeat the purpose of an OUTER JOIN. 17.2.7 FULL OUTER JOIN The FULL OUTER JOIN is a mix of the LEFT and RIGHT OUTER JOINs, with preserved rows constructed from both tables. The statement takes two tables and puts them in one result table. Again, this is easier to explain with an example than with a formal definition. It is also a way to show how to form a query that will perform the same function. Using Suppliers and Orders again, we find that we have suppliers with whom we have done no business, but we also have orders for which we have not decided on suppliers. To get all orders and all suppliers in one result table, we could use the SQL-89 query: SELECT sup_id, sup_name, order_amt regular INNER JOIN FROM Suppliers, Orders WHERE Suppliers.sup_id = Orders.sup_id UNION ALL SELECT sup_id, sup_name, CAST (NULL AS INTEGER) preserved rows of LEFT JOIN FROM Suppliers WHERE NOT EXISTS (SELECT * FROM Orders WHERE Suppliers.sup_id = Orders.sup_id) UNION ALL SELECT CAST (NULL AS CHAR(2)), CAST (NULL AS CHAR(10)), order_amt preserved rows of RIGHT JOIN FROM Orders WHERE NOT EXISTS (SELECT * FROM Suppliers WHERE Suppliers.sup_id = Orders.sup_id); The same thing in Standard SQL would be: 350 CHAPTER 17: THE SELECT STATEMENT SELECT sup_id, sup_name, order_amt FROM Orders FULL OUTER JOIN Suppliers ON (Suppliers.sup_id = Orders.sup_id); The FULL OUTER JOIN is not used as much as a LEFT or RIGHT OUTER JOIN. When you are doing a report, it is usually done from a viewpoint that leads to preserving only one side of the JOIN. That is, you might ask “What suppliers got no business from us?” or ask “What orders have not been assigned a supplier?” but a combination of the two questions is not likely to be in the same report. 17.2.8 WHERE Clause OUTER JOIN Operators As we have seen, SQL engines that use special operators in the WHERE clause for OUTER JOIN syntax get strange results. But with the Standard SQL syntax for OUTER JOINs, the programmer has to be careful in the WHERE to qualify the JOIN columns of the same name to be sure that he picks up the preserved column. Both of these are legal queries: SELECT * FROM T1 LEFT OUTER JOIN T2 ON T1.a = T2.a WHERE T1.a = 15; versus SELECT * FROM T1 LEFT OUTER JOIN T2 ON T1.a = T2.a WHERE T2.a = 15; However, the second one will reject the rows with generated NULLs in them. If that is what you wanted, then why bother with an OUTER JOIN in the first place? There is also a UNION JOIN in the SQL-92 Standard, which returns the results of a FULL OUTER JOIN without the rows that were in the INNER JOIN of the two tables. No product has implemented it as of 2005. Figure 17.1 shows the various JOINs. 17.3 Old versus New JOIN Syntax 351 17.3 Old versus New JOIN Syntax One of the classics of software engineering is a short paper by the late Edsger Dijkstra entitled “Go To Statement Considered Harmful” (Dijkstra 1968, pp. 147-148). Dijkstra argued for dropping the GOTO statement from programming languages in favor of what we now call structured programming. One of his observations was that programs that used blocks, WHILE loops, and IF-THEN-ELSE statements were easier to read and maintain. Programs that jumped around via GOTO statements were harder to follow, because the execution path could have arrived at a statement label from anywhere in the code. With the SQL-92 Standard, we added a set of infixed join operators to SQL, making the syntax closer to the way that relational algebra looks. The infixed OUTER JOIN syntax was meant to replace several different vendor options, which all had different syntax and semantics. It was absolutely needed. But while we were fixing that problem, we also added a few more options because they were easy to define. Most of them have not been Figure 17.1 SQL JOIN Functions. . course until he goes back and takes CS-101 as a prerequisite. Credits student_nbr course_name ========================== 1 'CS-101' 1 'CS-102' 2 'CS-101' . at two courses; the result of the desired query might look like this ( NULL is used to represent a missing value): (1, 'CS-101', 'CS-102') (2, 'CS-101', NULL) Clearly,. FROM Suppliers, Orders WHERE Suppliers.sup_id = Orders.sup_id UNION ALL SELECT sup_id, sup_name, CAST (NULL AS INTEGER) preserved rows of LEFT JOIN FROM Suppliers WHERE NOT EXISTS (SELECT