322 CHAPTER 17: THE SELECT STATEMENT 2 'Smyth' 'F' 17 | 8 'A' 3 'Smoot' 'F' 16 | 1 'A' 3 'Smoot' 'F' 16 | 2 'B' 3 'Smoot' 'F' 16 | 3 'C' 3 'Smoot' 'F' 16 | 4 'D' 3 'Smoot' 'F' 16 | 5 'A' 3 'Smoot' 'F' 16 | 6 'A' 3 'Smoot' 'F' 16 | 7 'A' 3 'Smoot' 'F' 16 | 8 'A' 4 'Adams' 'F' 17 | 1 'A' 4 'Adams' 'F' 17 | 2 'B' 4 'Adams' 'F' 17 | 3 'C' 4 'Adams' 'F' 17 | 4 'D' 4 'Adams' 'F' 17 | 5 'A' 4 'Adams' 'F' 17 | 6 'A' 4 'Adams' 'F' 17 | 7 'A' 4 'Adams' 'F' 17 | 8 'A' 5 'Jones' 'M' 16 | 1 'A' 5 'Jones' 'M' 16 | 2 'B' 5 'Jones' 'M' 16 | 3 'C' 5 'Jones' 'M' 16 | 4 'D' 5 'Jones' 'M' 16 | 5 'A' 5 'Jones' 'M' 16 | 6 'A' 5 'Jones' 'M' 16 | 7 'A' 5 'Jones' 'M' 16 | 8 'A' 6 'Celko' 'M' 17 | 1 'A' 6 'Celko' 'M' 17 | 2 'B' 6 'Celko' 'M' 17 | 3 'C' 6 'Celko' 'M' 17 | 4 'D' 6 'Celko' 'M' 17 | 5 'A' 6 'Celko' 'M' 17 | 6 'A' 6 'Celko' 'M' 17 | 7 'A' 6 'Celko' 'M' 17 | 8 'A' 7 'Vennor' 'F' 16 | 1 'A' 7 'Vennor' 'F' 16 | 2 'B' 7 'Vennor' 'F' 16 | 3 'C' 7 'Vennor' 'F' 16 | 4 'D' 7 'Vennor' 'F' 16 | 5 'A' 7 'Vennor' 'F' 16 | 6 'A' 7 'Vennor' 'F' 16 | 7 'A' 7 'Vennor' 'F' 16 | 8 'A' 17.1 SELECT and JOINs 323 8 'Murray' 'M' 18 | 1 'A' 8 'Murray' 'M' 18 | 2 'B' 8 'Murray' 'M' 18 | 3 'C' 8 'Murray' 'M' 18 | 4 'D' 8 'Murray' 'M' 18 | 5 'A' 8 'Murray' 'M' 18 | 6 'A' 8 'Murray' 'M' 18 | 7 'A' 8 'Murray' 'M' 18 | 8 'A' There are two predicates in the WHERE. The first predicate, grade = 'A', needs only the Students table. In fact, an optimizer in a real SQL engine would have removed those rows in the Students table that failed the test before doing the CROSS JOIN. The second predicate is Student.stud_nbr = Gradebook.stud_nbr, which requires both tables and the constructed row. Predicates that use values from two tables are called JOIN conditions for obvious reasons. Now remove the rows that do not meet the conditions. After the WHERE clause, the result table looks like this: Cross Join after WHERE clause Students Gradebook stud_nbr stud_name sex age | stud_nbr grade ======================================================= 1 'Smith' 'M' 16 | 1 'A' 5 'Jones' 'M' 16 | 5 'A' 6 'Celko' 'M' 17 | 6 'A' 7 'Vennor' 'F' 16 | 7 'A' 8 'Murray' 'M' 18 | 8 'A' We have a GROUP BY clause that will group the working table by sex, thus: by sex Students Gradebook stud_nbr stud_name sex age | stud_nbr grade ================================================= 1 'Smith' 'M' 16 | 1 A sex = 'M' group 5 'Jones' 'M' 16 | 5 A 6 'Celko' 'M' 17 | 6 A 8 'Murray' 'M' 18 | 8 A 7 'Vennor' 'F' 16 | 7 A sex = 'F' group 324 CHAPTER 17: THE SELECT STATEMENT And the aggregate functions in the SELECT clause are computed for each group: Aggregate functions sex COUNT(*) AVG(age) (MAX(age) - MIN(age)) AS age_range ================================================================ 'F' 1 16.00 (16 - 16) = 0 'M' 4 16.75 (18 - 16) = 2 The HAVING clause is applied to each group, the SELECT statement is applied last, and we get the final results: Full query with having clause sex COUNT(*) AVG(age) age_range ======================================= 'M' 4 16.75 2 Obviously, no real implementation actually produces these intermediate tables; that would be insanely expensive. They are just models to demonstrate how a statement works. The FROM clause can have joins and other operators that create working tables, but the same steps are followed in this order in a nested fashion. Subqueries in the WHERE clause are parsed and expanded the same way. 17.1.2 Correlated Subqueries in a SELECT Statement A correlated subquery is a subquery that references columns in the tables of its containing query. This is a way to “hide a loop” in SQL. Consider a query to find all the students who are younger than the oldest student of their gender: SELECT * FROM Students AS S1 WHERE age < (SELECT MAX(age) FROM Students AS S2 WHERE S1.sex = S2.sex); 1. A copy of the table is made for each correlation name, S1 and S2.Students AS S1: 17.1 SELECT and JOINs 325 stud_nbr stud_name sex age ================================= 1 'Smith' 'M' 16 2 'Smyth' 'F' 17 3 'Smoot' 'F' 16 4 'Adams' 'F' 17 5 'Jones' 'M' 16 6 'Celko' 'M' 17 7 'Vennor' 'F' 16 8 'Murray' 'M' 18 2. When you get to the WHERE clause and find the innermost query, you will see that you need to get data from the containing query. The model of execution says that each outer row has the subquery executed on it in parallel with the other rows. Assume we are working on student (1, 'Smith'), who is male. The query in effect becomes: SELECT 1, 'Smith', 'M', 16 FROM Students AS S1 WHERE 16 < (SELECT MAX(age) FROM Students AS S2 WHERE 'M' = S2.sex); 3. The subquery can now be calculated for male students; the maximum age is 18. When we expand this out for all the other rows, this will give us: SELECT 1, 'Smith', 'M', 16 FROM Students AS S1 WHERE 16 < 18; SELECT 2, 'Smyth', 'F', 17 FROM Students AS S1 WHERE 17 < 17; SELECT 3, 'Smoot', 'F', 16 FROM Students AS S1 WHERE 16 < 17; SELECT 4, 'Adams', 'F', 17 FROM Students AS S1 WHERE 17 < 17; SELECT 5, 'Jones', 'M', 16 FROM Students AS S1 WHERE 16 < 18; SELECT 6, 'Celko', 'M', 17 FROM Students AS S1 WHERE 17 < 18; SELECT 7, 'Vennor', 'F', 16 FROM Students AS S1 WHERE 16 < 17; SELECT 8, 'Murray', 'M', 18 FROM Students AS S1 WHERE 18 < 18; 4. These same steps have been done for each row in the containing query. The model is that all of the subqueries are resolved at once, but again, no implementation really does it that way. The 326 CHAPTER 17: THE SELECT STATEMENT usual approach is to build procedural loops in the database engine that scan through both tables. The optimizer decides what table is in what loop. The final results are: stud_nbr stud_name sex age ================================= 1 'Smith' 'M' 16 3 'Smoot' 'F' 16 5 'Jones' 'M' 16 6 'Celko' 'M' 17 7 'Vennor' 'F' 16 Again, no real product works this way, but it has to produce the same results as this process. 17.1.3 SELECT Statement Syntax SQL-92 added new syntax for JOINs using infixed operators in the FROM clause. The JOIN operators are quite general and flexible, allowing you to do things in a single statement that you could not do in the older notation. The basic syntax is: <joined table> ::= <cross join> | <qualified join> | (<joined table>) <cross join> ::= <table reference> CROSS JOIN <table reference> <qualified join> ::= <table reference> [NATURAL] [<join type>] JOIN <table reference> [<join specification>] <join specification> ::= <join condition> | <named columns join> <join condition> ::= ON <search condition> <named columns join> ::= USING (<join column list>) <join type> ::= INNER | <outer join type> [OUTER] | UNION <outer join type> ::= LEFT | RIGHT | FULL 17.1 SELECT and JOINs 327 <join column list> ::= <column name list> <table reference> ::= <table name> [[AS] <correlation name>[(<derived column list>)]] | <derived table> [AS] <correlation name> [(<derived column list>)] | <joined table> <derived table> ::= <table subquery> <column name list> ::= <column name> [{ <comma> <column name> } ] An INNER JOIN is done by forming the CROSS JOIN and then removing the rows that do not meet the JOIN specification given in the ON clause, as we just demonstrated in the last section. The ON clause can be as elaborate as you want to make it, as long as it refers to tables and columns within its scope. If a <qualified join> is used without a <join type>, INNER is implicit. However, in the real world, most INNER JOINs are done using equality tests on columns with the same names in different tables, rather than on elaborate predicates. Equi- JOINs are so common that Standard SQL has two shorthand ways of specifying them. The USING (c1, c2, , cn) clause takes the column names in the list and replaces them with the clause ON ((T1.c1, T1.c2, , T1.cn) = (T2.c1, T2.c2, , T2.cn)). Likewise, the NATURAL option is shorthand for a USING() clause that is a list of all the column names that are common to both tables. If NATURAL is specified, a JOIN specification cannot be given; it is already there. A strong warning: do not use NATURAL JOIN in production code. Any change to the column names will change the join at run time. For the same reason, do not use SELECT * in production code. But the NATURAL JOIN is more dangerous. As Daniel Morgan pointed out, a NATURAL JOIN between two tables with a column named comments can give you a meaningless join on a column containing kilobytes or megabytes of formatted text. The same sort of warning applies to the USING clause. Neither of these options is widely implemented or used as of 2005. What’s sad about this is that in a properly designed data model, they would work just fine. If you found out that product_id, product_nbr, 328 CHAPTER 17: THE SELECT STATEMENT and upc were all used for the same data element in your schema, you would do a global change to make sure that one data element has one and only one name. In this case, you would use the better industry standard name upc for this data element. The UNION JOIN and OUTER JOIN are topics in themselves and will be covered in separate sections. 17.1.4 The ORDER BY Clause Contrary to popular belief, the ORDER BY clause is not part of the SELECT statement; it is part of a CURSOR declaration. The reason people think it is part of the SELECT statement is that the only way you can get to the result set of a query in a host language is via a cursor. When a vendor tool builds a cursor under the covers for you, they usually allow you to include an ORDER BY clause on the query. Most optimizers will look at the result set and see from the query whether it is already in sorted order as a result of fetches done with an index, thus avoiding a redundant sorting operation. The bad news is that many programmers have written code that depended on the way that their particular release of a particular brand of SQL product presented the result. When an index is dropped or changed, or when the database is upgraded to a new release or has to be ported to another product, this automatic ordering can disappear. As part of a cursor, the ORDER BY clause has some properties that you probably did not know existed. Here is the standard syntax. <order by clause> ::= ORDER BY <sort specification list> <sort specification list> ::= <sort specification> [{ <comma> <sort specification> } ] <sort specification> ::= <sort key> [<collate clause >] [<ordering specification>] <sort key> ::= <column name> | <scalar expression> <ordering specification> ::= ASC | DESC The first things to note is that the sort keys are either column names that must appear in the SELECT clause, or scalar expressions. The use of the positional number of a column is a deprecated feature in Standard SQL. Deprecation is a term in the standards world that means this feature 17.1 SELECT and JOINs 329 will be removed from the next release of the standard; therefore, it should not be used, and your old code must be updated. These are illegal sorts: SELECT a, (b+c) AS d FROM Foobar ORDER BY a, b, c; illegal!! The columns b and c simply do not exist in the result set of the cursor, so there is no way to sort on them. However, in SQL-99 you were allowed to use a computation in the ORDER BY: SELECT a, b, c illegal!! FROM Foobar ORDER BY a, b, (b+c) The correct way to do this is to put the function calls or expressions in the SELECT list, name that column, and use the name in the ORDER BY clause. This practice lets the user see what values the sorting is done on. Think about it—what good is a report or display when you have no idea how it was sorted? Furthermore, the sorting columns pass information to middle-tier machines that can sort the data again before distributing it to other front-end clients. The sort order is based on the collation sequence of the column to which it is attached. The collation can be defined in the schema on character columns, but in most SQL products today collation is either ASCII or EBCDIC. You can expect Unicode to become more popular. The ORDER BY and NULLs Whether a sort key value that is NULL is considered greater or less than a non- NULL value is implementation-defined, but all sort key values that are NULL shall either be considered greater than all non-NULL values, or be considered less than all non- NULL values. There are SQL products that do it either way. In March 1999, Chris Farrar brought up a question from one of his developers that caused him to examine a part of the SQL Standard that I thought I understood. Chris found some differences between the general understanding and the actual wording of the specification. The situation can be described as follows: a table, Sortable, with two integer 330 CHAPTER 17: THE SELECT STATEMENT columns, a and b, contains two rows that happen to be in this physical order: Sortable a b ============ NULL 8 NULL 4 Given the following pseudo-query: SELECT a, b FROM Sortable ORDER BY a, b; The first question is whether it is legal SQL for the cursor to produce the result sequence: Cursor Result Sequence a b ============ NULL 8 NULL 4 The problem is that while the SQL Standard set up a rule to make the NULLs group together either before or after the known values, we never said that they have to act as if they were equal to each other. What is missing is a statement that says when comparing NULL to NULL, the result in the context of ORDER BY is that NULL is equal to NULL, just as it is in a GROUP BY. This was the intent of the committee, so the expected result should have been: Cursor Result Sequence a b ========== NULL 4 NULL 8 Phil Shaw, former IBM representative and one of the oldest members of the committee, dug up the section of the SQL-89 Standard that answered this problem. In SQL-89, the last General Rule of <comparison predicate> specified this: 17.1 SELECT and JOINs 331 “Although x = y” is unknown if both x and y are NULL values, in the context of GROUP BY, ORDER BY, and DISTINCT, a NULL value is identical to or is a duplicate of another NULL value. That is the rule that causes all NULLs to go into the same group, rather than each in its own group. Apply that rule, and then apply the rules for ORDER BY; the NULL values of column a of the two rows are equal, so you have to order the rows by the columns to the right in the ORDER BY—which is what every SQL product does. The sort keys are applied from left to right, and a column name can appear only once in the list. But there is no obligation on the part of SQL to use a stable (sequence-preserving) sort. A stable sort on cities, followed by a stable order on states would result in a list with cities sorted within each state, and the states sorted. While stability is a nice property, the nonstable sorts are generally much faster. You can use computed columns to get specialized sorting orders. For example, construct a table with a character column and an integer column. The goal is to order the results so that it first sorts the integer column descending, but then groups the related character column within the integers. This is much easier to show with an example: CREATE TABLE Foobar (fruit CHAR(10) NOT NULL, score INTEGER NOT NULL, PRIMARY KEY (fruit, score)); INSERT INTO Foobar VALUES ('Apples', 2); INSERT INTO Foobar VALUES ('Apples', 1); INSERT INTO Foobar VALUES ('Oranges', 5); INSERT INTO Foobar VALUES ('Apples', 5); INSERT INTO Foobar VALUES ('Banana', 2); I want to order the results as the following: ('Apples', 5) ('Apples', 2) ('Apples', 1) ('Oranges', 5) ('Banana', 2) In the above, the first pass of the sort would have produced this by sorting on the integer column: . 'Smoot' 'F' 16 | 3 'C' 3 'Smoot' 'F' 16 | 4 'D' 3 'Smoot' 'F' 16 | 5 'A' 3 'Smoot' 'F'. 'Adams' 'F' 17 | 2 'B' 4 'Adams' 'F' 17 | 3 'C' 4 'Adams' 'F' 17 | 4 'D' 4 'Adams' 'F'. 'Jones' 'M' 16 | 1 'A' 5 'Jones' 'M' 16 | 2 'B' 5 'Jones' 'M' 16 | 3 'C' 5 'Jones' 'M'