312 CHAPTER 16: QUANTIFIED SUBQUERY PREDICATES When you get a single FALSE result, the whole predicate is FALSE . As long as <table expression> has cardinality greater than zero and all non- NULL values, you will get a result of TRUE or FALSE . That sounds reasonable so far. Now let EmptyTable be an empty table (no rows, cardinality zero) and NullTable be a table with only NULL s in its rows (cardinality greater than zero). The rules for SQL say that <value expression> <comp op> ALL NullTable always returns UNKNOWN , and likewise <value expression> <comp op> ANY NullTable always returns UNKNOWN . This makes sense, because every row comparison test in the expansion would return UNKNOWN , so the series of OR and AND operators would behave in the usual way. However, <value expression> <comp op> ALL EmptyTable always returns TRUE , and <value expression> <comp op> ANY EmptyTable always returns FALSE . Most people have no trouble seeing why the ANY predicate works that way; you cannot find a match, so the result is FALSE. But most people have lots of trouble seeing why the ALL predicate is TRUE. This convention is called existential import, and I have just discussed it in Chapter 15. If I were to walk into a bar and announce that I can beat any pink elephant in the bar, that would be a true statement. The fact that there are no pink elephants in the bar merely shows that the problem is reduced to the minimum case. If this seems unnatural, then convert the ALL and ANY predicates into EXISTS predicates and look at the way that this rule preserves the properties that: 1. (∀x P(x)) = (¬ ∃ x (¬P(x))) 2. ( ∃ x P(x)) = ¬ (∀x ¬P(x)) The Table1.x <comp op> ALL (SELECT y FROM Table2 WHERE <search condition>) predicate converts to: NOT EXISTS (SELECT * FROM Table1, Table2 WHERE Table1.x <comp op> Table2.y AND NOT <search condition>) The Table1.x <comp op> ANY (SELECT y FROM Table2 WHERE <search condition>) predicate converts to: EXISTS (SELECT * 16.3 The ALL Predicate and Extrema Functions 313 FROM Table1, Table2 WHERE Table1.x <comp op> Table2.y AND <search condition>) Of the two quantified predicates, the <comp op> ALL predicate is used more. The ANY predicate is more easily replaced and more naturally written with an EXISTS() predicate or an IN() predicate. In fact, the standard defines the IN() predicate as shorthand for = ANY and the NOT IN() predicate as shorthand for <> ANY, which is how most people would construct them in English. The <comp op> ALL predicate is probably the more useful of the two, since it cannot be written in terms of an IN() predicate. The trick with it is to make sure that its subquery defines the set of values in which you are interested. For example, to find the authors whose books all sell for $19.95 or more, you could write: SELECT * FROM Authors AS A1 WHERE 19.95 < ALL (SELECT price FROM Books AS B1 WHERE A1.author_name = B1.author_name); The best way to think of this is to reverse the usual English sentence “Show me all x that are y” in your mind so that it says “y is the value of all x” instead. 16.3 The ALL Predicate and Extrema Functions It is counterintuitive at first that these two predicates are not the same in SQL: x >= (SELECT MAX(y) FROM Table1) x >= ALL (SELECT y FROM Table1) But you have to remember the rules for the extrema functions—they drop out all the NULLs before returning the greater or least values. The ALL predicate does not drop NULLs, so you can get them in the results. However, if you know that there are no NULLs in a column, or are willing to drop the NULLs yourself, then you can use the ALL predicate to construct single queries to do work that would otherwise be done by two 314 CHAPTER 16: QUANTIFIED SUBQUERY PREDICATES queries. For example, we could use the table of products and store managers we used earlier in this chapter and find which manager handles the largest number of products. To do this, we would first construct a grouped VIEW and group it again: CREATE VIEW TotalProducts (manager_name, product_tally) AS SELECT manager_name, COUNT(*) FROM Stores GROUP BY manager_name; SELECT manager_name FROM TotalProducts WHERE product_tally = (SELECT MAX(product_tally) FROM TotalProducts); But Alex Dorfman found a single query solution instead: SELECT manager_name, COUNT(*) FROM Stores GROUP BY manager_name HAVING COUNT(*) + 1 > ALL (SELECT DISTINCT COUNT(*) FROM Stores GROUP BY manager_name); The use of the SELECT DISTINCT in the subquery is to guarantee that we do not get duplicate rows when two managers handle the same number of products. You can also add a WHERE dept IS NOT NULL clause to the subquery to get the effect of a true MAX() aggregate function. 16.4 The UNIQUE Predicate The UNIQUE predicate is a test for the absence of duplicate rows in a subquery. The UNIQUE keyword is also used as a table or column. This predicate is used to define the constraint. The UNIQUE column constraint is implemented in many SQL implementations with a CREATE UNIQUE INDEX <indexname> ON <table>(<column list>) statement hidden under the covers. The syntax for this predicate is: 16.4 The UNIQUE Predicate 315 <unique predicate> ::= UNIQUE <table subquery> If any two rows in the subquery are equal to each other, the predicate is FALSE. However, the definition in the standard is worded in the negative, so that NULLs get the benefit of the doubt. The query can be written as an EXISTS predicate that counts rows, thus: EXISTS (SELECT <column list> FROM <subquery> WHERE (<column list>) IS NOT NULL GROUP BY <column list> HAVING COUNT(*) > 1); An empty subquery is always TRUE, since you cannot find two rows, and therefore duplicates do not exist. This makes sense on the face of it. NULLs are easier to explain with an example—say a table with only two rows, ('a', 'b') and ('a', NULL). The first columns of each row are non- NULL and are equal to each other, so we have a match so far. The second column in the second row is NULL and cannot compare to anything, so we skip the second column pair and go with what we have, and the test is TRUE. This is giving the NULLs the benefit of the doubt, since the NULL in the second row could become ‘b’ some day and give us a duplicate row. Now consider the case where the subquery has two rows, ('a', NULL) and ('a', NULL). The predicate is still TRUE, because the NULLs do not test equal or unequal to each other—not because we are making NULLs equal to each other. As you can see, it is a good idea to avoid NULLs in UNIQUE constraints. CHAPTER 17 The SELECT Statement T HE GOOD NEWS ABOUT SQL is that the programmer only needs to learn the SELECT statement to do almost all his work. The bad news is that the statement can have so many nested clauses that it looks like a Victorian novel! The SELECT statement is used to query the database. It combines one or more tables, can do some calculations, and finally puts the results into a result table that can be passed on to the host language. I have not spent much time on the simple one-table SELECT statements you see in introductory books. I am assuming that the readers are experienced SQL programmers and got enough of those queries when they were learning SQL. 17.1 SELECT and JOINs There is an order to the execution of the clauses of an SQL SELECT statement that does not seem to be covered in most beginning SQL books. It explains why some things work in SQL and others do not. 17.1.1 One-Level SELECT Statement The simplest possible SELECT statement is just “ SELECT * FROM Sometable; ” which returns the entire table as it stands. You can actually write this as “ TABLE Sometable ” in Standard SQL, but 318 CHAPTER 17: THE SELECT STATEMENT nobody seems to use that syntax. Though the syntax rules say that all you need are the SELECT and FROM clauses, in practice there is almost always a WHERE clause. Let’s look at the SELECT statement in detail. The syntax for the statement is: SELECT [ALL | DISTINCT] <scalar expression list> FROM <table expression> [WHERE <search condition>] [GROUP BY <grouping column list>] [HAVING <group condition>]; The order of execution is as follows: 1. Execute the FROM <table expression> clause and construct the working result table defined in that clause. The FROM can have all sorts of other table expressions, but the point is that they return a working table as a result. We will get into the details of those expressions later, with particular attention to the JOIN operators. The result table preserves the order of the tables, and the order of the columns within each, in the result. The result table is different from other tables in that each column retains the table name from which it was derived. Thus if table A and table B both have a column named x, there will be a column A.x and a column B.x in the results of the FROM clause. No product actually uses a CROSS JOIN to construct the intermediate table—the working table would get too large too fast. For example, a 1,000-row table and a 1,000-row table would- CROSS JOIN to get a 1,000,000-row working table. This is just the conceptual model we use to describe behavior. 2. If there is a WHERE clause, apply the search condition in it to each row of the FROM clause result table. The rows that test TRUE are retained; the rows that test FALSE or UNKNOWN are deleted from the working set. The WHERE clause is where the action is. The predicate can be quite complex and have nested subqueries. The syntax of a subquery is a SELECT statement, which is inside parentheses— failure to use parentheses is a common error for new SQL pro- grammers. Subqueries are where the original SQL got the name 17.1 SELECT and JOINs 319 “Structured English Query Language”—the ability to nest SELECT statements was the “structured” part. We will deal with those in another section. 3. If there is a GROUP BY <grouping column list> clause, execute it next. It uses the FROM and WHERE clause working table and breaks these rows into groups where the columns in the <grouping column list> all have the same value. NULL s are treated as if they were all equal to each other, and form their own group. Each group is then reduced to a single row in a new result table that replaces the old one. Each row represents information about its group. Standard SQL does not allow you to use the name of a calculated column such as “ (salary + commission) AS total_pay ” in the GROUP BY clause, because that column is computed and named in the SELECT clause of this query. It does not exist yet. However, you will find products that allow it because they cre- ate a result table first, using names in the SELECT cause, then fill the result table with rows created by the query. There are ways to get the same result by using VIEW s and derived table expressions, which we will discuss later. Only four things make sense as group characteristics: the columns that define it, the aggregate functions that summarize group characteristics, function calls and constants, and expres- sions built from those three things. 4. If there is a HAVING clause, apply it to each of the groups. The groups that test TRUE are retained; the groups that test FALSE or UNKNOWN are deleted. If there is no GROUP BY clause, the HAVING clause treats the whole table as a single group. It is not true that there must be a GROUP BY clause. Standard SQL prohibits correlated queries in a HAVING clause, but there are workarounds that use derived tables. The <group condition> must apply to columns in the grouped working table or to group properties, not to the indi- vidual rows that originally built the group. Aggregate functions used in the HAVING clause usually appear in the SELECT clause, but that is not part of the standard. Nor does the SELECT clause have to include all the grouping columns. 5. Finally, apply the SELECT clause to the result table . If a column does not appear in the <expression list> , it is dropped 320 CHAPTER 17: THE SELECT STATEMENT from the final results. Expressions can be constants or column names, or they can be calculations made from constants, columns, functions, and scalar subqueries. If the SELECT clause has the DISTINCT option, redundant duplicate rows are deleted from the final result table. The phrase “redundant duplicate” means that one copy of the row is retained. If the SELECT clause has the explicit ALL option or is missing the [ALL | DISTINCT] option, then all duplicate rows are preserved in the final results table. (Frankly, although it is legal syntax, nobody really uses the SELECT ALL option.) Finally, the results are returned. Let us carry an example out in detail, with a two-table join. SELECT sex, COUNT(*), AVG(age), (MAX(age) - MIN(age)) AS age_range FROM Students, Gradebook WHERE grade = 'A' AND Students.stud_nbr = Gradebook.stud_nbr GROUP BY sex HAVING COUNT(*) > 3; The two starting tables look like this: CREATE TABLE Students (stud_nbr INTEGER NOT NULL PRIMARY KEY, stud_name CHAR(10) NOT NULL, sex CHAR(1) NOT NULL, age INTEGER NOT NULL); Students stud_nbr stud_name sex age =============================== 1 'Smith' 'M' 16 2 'Smyth' 'F' 17 3 'Smoot' 'F' 16 4 'Adams' 'F' 17 5 'Jones' 'M' 16 6 'Celko' 'M' 17 17.1 SELECT and JOINs 321 7 'Vennor' 'F' 16 8 'Murray' 'M' 18 CREATE TABLE Gradebook (stud_nbr INTEGER NOT NULL PRIMARY KEY REFERENCES Students(stud_nbr), grade CHAR(1) NOT NULL); Gradebook stud_nbr grade ================= 1 'A' 2 'B' 3 'C' 4 'D' 5 'A' 6 'A' 7 'A' 8 'A' The CROSS JOIN in the FROM clause looks like this: Cross Join working table Students Gradebook stud_nbr stud_name sex age | stud_nbr grade ==================================================== 1 'Smith' 'M' 16 | 1 'A' 1 'Smith' 'M' 16 | 2 'B' 1 'Smith' 'M' 16 | 3 'C' 1 'Smith' 'M' 16 | 4 'D' 1 'Smith' 'M' 16 | 5 'A' 1 'Smith' 'M' 16 | 6 'A' 1 'Smith' 'M' 16 | 7 'A' 1 'Smith' 'M' 16 | 8 'A' 2 'Smyth' 'F' 17 | 1 'A' 2 'Smyth' 'F' 17 | 2 'B' 2 'Smyth' 'F' 17 | 3 'C' 2 'Smyth' 'F' 17 | 4 'D' 2 'Smyth' 'F' 17 | 5 'A' 2 'Smyth' 'F' 17 | 6 'A' 2 'Smyth' 'F' 17 | 7 'A' . 'A' 1 'Smith' 'M' 16 | 6 'A' 1 'Smith' 'M' 16 | 7 'A' 1 'Smith' 'M' 16 | 8 'A' 2 'Smyth'. 'M' 16 | 2 'B' 1 'Smith' 'M' 16 | 3 'C' 1 'Smith' 'M' 16 | 4 'D' 1 'Smith' 'M' 16 | 5 'A' . 'Smoot' 'F' 16 4 'Adams' 'F' 17 5 'Jones' 'M' 16 6 &apos ;Celko& apos; 'M' 17 17.1 SELECT and JOINs 321 7 'Vennor'