352 CHAPTER 17: THE SELECT STATEMENT implemented in actual products yet, and nobody seems to be missing the OUTER UNION or CORRESPONDING clause. The INNER JOIN operator did get to be popular. This was fairly easy to implement, since vendors only had to extend the parser without having to add more functionality. Additionally, it is a binary operator, and programmers are used to binary operators—add, subtract, multiply, and divide are all binary operators. E-R diagrams use lines between tables to show a relational schema. But this leads to a linear approach to problem solving that might not be such a good thing in SQL. Consider this statement, which would have been written in the traditional syntax as: SELECT a, b, c FROM Foo, Bar, Flub WHERE Foo.y BETWEEN Bar.x AND Flub.z; With the infixed syntax, I can write this same statement in any of several ways. For example: SELECT * FROM Foo INNER JOIN Bar ON Foo.y >= Bar.x INNER JOIN Flub ON Foo.y <= Flub.z; Humans tend to see things that are close together as a unit or as having a relationship. The extra reserved words in the infixed notation tend to work against that perception. The infixed notation invites a programmer to add one table at a time to the chain of joins. First I built and tested the Foo-Bar join, and when I was happy with the results, I added Flub. “Step-wise” program refinement was one of the mantras of structured programming. But look at the code; can you see that there is a BETWEEN relationship among the three tables? It is not easy, is it? In effect, you see only pairs of tables and not the whole problem. SQL is an “all-at-once” set-oriented language, not a “step-wise” language. Technically, the SQL engine is supposed to perform the infixed joins in left to right order as they appear in the FROM clause. It is free to rearrange the order of the joins, if the rearrangement does not change 17.4 Scope of Derived Table Names 353 the results. Order of execution does not make a difference with INNER JOINs, but it is very important with OUTER JOINs. 17.4 Scope of Derived Table Names Another problem is that many SQL programmers do not fully understand the rules for the scope of names. If an infixed join is given a derived table name, then all of the table names inside it are hidden from containing expressions. For example, this will fail: SELECT a, b, c wrong! FROM (Foo INNER JOIN Bar ON Foo.y >= Bar.x) AS Foobar (x, y) INNER JOIN Flub ON Foo.y <= Flub.z; It fails because the table name Foo is not available to the second INNER JOIN. However, this will work: SELECT a, b, c FROM (Foo INNER JOIN Bar ON Foo.y >= Bar.x) AS Foobar (x, y) INNER JOIN Flub ON Foobar.y <= Flub.z; If you start nesting lots of derived table expressions, you can force an order of execution in the query. It is generally not a good idea to try to outguess the optimizer. So far, I have shown fully qualified column names. It is a good programming practice, but it is not required. Assume that Foo and Bar both have a column named w. These statements will produce an ambiguous name error: SELECT a, b, c FROM Foo INNER JOIN Bar ON y >= x INNER JOIN Flub ON y <= w; 354 CHAPTER 17: THE SELECT STATEMENT SELECT a, b, c FROM Foo, Bar, Flub WHERE y BETWEEN x AND w But this statement will work from inside the parentheses first, and then does the outermost INNER JOIN last. SELECT a, b, c FROM Foo INNER JOIN (Bar INNER JOIN Flub ON y <= w) ON y >= x; If Bar did not have a column named w, then the parser would go to the next containing expression, find Foo.w, and use it. As an aside, there is a myth among new SQL programmers that the join conditions must be in the ON clause, and the search argument predicates ( SARGs) must be in the WHERE clause. It is a nice programming style and isolates the search arguments to one location for easy changes. But it is not a requirement. Am I against infixed joins? No, but they are a bit more complicated than they first appear, and if there are some OUTER JOINs in the mix, things can be very complicated. Just be careful with the new toys, kids. 17.5 JOINs by Function Calls JOINs can also be done inside functions that relate columns from one or more tables in their parameters. This is easier to explain with an actual example, from John Botibol of Deverill plc in Dorset, U.K. His problem was how to “flatten” legacy data stored in a flat file database into a relational format for a data warehouse. The data included a vast amount of demographic information on people, related to their subjects of interest. The subjects of interest were selected from a list; some subjects required just one answer, and others allowed multiple selections. The problem was that the data for multiple selections was stored as a string with a one or a zero in positional places to indicate “interested” or “not interested” in that item. The actual list of products was stored in another file as a list. Thus, for one person we might have something like 17.5 JOINs by Function Calls 355 ‘101110’ together with a list like 1 = Bananas, 2 = Apples, 3 = Bread, 4 = Fish, 5 = Meat, 6 = Butter, if the subject area was foods. The data was first moved into working tables like this: CREATE TABLE RawSurvey (rawkey INTEGER NOT NULL PRIMARY KEY, rawstring CHAR(20) NOT NULL); CREATE TABLE SurveyList (survey_id INTEGER NOT NULL PRIMARY KEY, surveytext CHAR(30) NOT NULL); There were always the correct number of ones and zeros for the number of question options in any group (thus, in this case, the answer strings always have six characters) and the list was in the correct order to match the positions in the string. The data had to be ported into SQL, which meant that each survey had to be broken down into a row for each response. CREATE TABLE Surveys (survey_id INTEGER NOT NULL, surveytext CHAR(30) NOT NULL, ticked INTEGER NOT NULL CONSTRAINT tick_mark CHECK (ticked IN (0, 1)) DEFAULT 0, PRIMARY KEY (survey_id, surveytext)); This table can be loaded with the query: INSERT INTO Surveys(survey_id, surveytext, ticked) SELECT rawkey, surveytext, SUBSTRING(rawstring FROM survey_id FOR 1) FROM RawSurvey, SurveyList; The tables are joined in the SUBSTRING() function, instead of with a theta operator. The SUBSTRING() function returns an empty string if survey_id goes beyond the end of the string. The query will always return a number of rows that is equal to or less than the number of characters in rawstring. The technique will adjust itself correctly for any number of possible survey answers. 356 CHAPTER 17: THE SELECT STATEMENT In the real problem, the table SurveyList always contained exactly the right number of entries for the length of the string to be exploded, and the string to be exploded always had exactly the right number of characters, so you did not need a WHERE clause to check for bad data. 17.6 The UNION JOIN The UNION JOIN was defined in Standard SQL, but I know of no SQL product that has implemented it. As the name implies, it is a cross between a UNION and a FULL OUTER JOIN. The definition followed easily from the other infixed JOIN operators. The syntax has no searched clause: <table expression 1> UNION JOIN <table expression 2> The statement takes two dissimilar tables and puts them into one result table. It preserves all the rows from both tables and does not try to consolidate them. Columns that do not exist in one table are simply padded out with NULLs in the result rows. Columns with the same names in the tables have to be renamed differently in the result. It is equivalent to: <table expression 1> FULL OUTER JOIN <table expression 2> ON 1 = 2; Any searched expression that is always FALSE will work. As an example of this, you might want to combine the medical records of male and female patients into one table with this query: SELECT * FROM (SELECT 'male', prostate FROM Males) OUTER UNION (SELECT 'female', pregnancy FROM Females); to get a result table like this: Result male prostate female pregnancy ================================== 'male' no NULL NULL 17.6 The UNION JOIN 357 'male' no NULL NULL 'male' yes NULL NULL 'male' yes NULL NULL NULL NULL 'female' no NULL NULL 'female' no NULL NULL 'female' yes NULL NULL 'female' yes Frédéric Brouard came up with a nice trick for writing a similar join—that is, a join on one table, say a basic table of student data, with either a table of data particular to domestic students or another table of data particular to foreign students, based on the value of a parameter. This differs from a true UNION JOIN in that it must have a “root” table to use for the outer joins. CREATE TABLE Students (student_nbr INTEGER NOT NULL PRIMARY KEY, student_type CHAR(1) NOT NULL DEFAULT 'D' CHECK (student_type IN ('D', 'F', )) ); CREATE TABLE DomesticStudents (student_nbr INTEGER NOT NULL PRIMARY KEY, REFERENCES Students(student_nbr), ); CREATE TABLE ForeignStudents (student_nbr INTEGER NOT NULL PRIMARY KEY, REFERENCES Students(student_nbr), ); SELECT Students.*, DomesticStudents.*, ForeignStudents.* FROM Students LEFT OUTER JOIN DomesticStudents ON CASE Students.student_type WHEN 'D' THEN 1 ELSE NULL END = 1 LEFT OUTER JOIN ForeignStudents ON CASE Students.student_type WHEN 'F' THEN 1 ELSE NULL END = 1; 358 CHAPTER 17: THE SELECT STATEMENT 17.7 Packing Joins We can relate two tables together based on quantities in each of them. The simplest example is filling customer orders from our inventories at various stores. To make life easier, let’s assume that we have only one product, we process orders in increasing customer_id order, and we draw from store inventory by increasing store_id. CREATE TABLE Inventory (store_id INTEGER NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty >= 0)); INSERT INTO Inventory (store_id, item_qty) VALUES (10, 2),(20, 3), (30, 2); CREATE TABLE Orders (customer_id CHAR(5) NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty > 0)); INSERT INTO Orders (customer_id, item_qty) VALUES ('Bill', 4), ('Fred', 2); What we want to do is fill Bill’s order for four units by taking two units from store 1 and two units from store 2. Next we process Fred’s order with the one unit left in store 1, and one unit from store 3. SELECT I.store_id, O.customer_id, (CASE WHEN O.end_running_qty <= I.end_running_qty THEN O.end_running_qty ELSE I.end_running_qty END - CASE WHEN O.start_running_qty >= I.start_running_qty THEN O.start_running_qty ELSE I.start_running_qty END) AS items_consumed_tally FROM (SELECT I1.store_id, SUM(I2.item_qty) - I1.item_qty, SUM(I2.item_qty) FROM Inventory AS I1, Inventory AS I2 WHERE I2.store_id <= I1.store_id GROUP BY I1.store_id, I1.item_qty) AS I (store_id, start_running_qty, end_running_qty) 17.8 Dr. Codd’s T-Join 359 INNER JOIN (SELECT O1.customer_id, SUM(O2.item_qty) - O1.item_qty, SUM(O2.item_qty) AS end_running_qty FROM Orders AS O1, Orders AS O2 WHERE O2.customer_id <= O1.customer_id GROUP BY O1.customer_id, O1.item_qty) AS O (store_id, start_running_qty, end_running_qty) ON O.start_running_qty < I.end_running_qty AND O.end_running_qty > I.start_running_qty; ORDER BY store_id, customer_id; This can also be done with the new SQL-99 OLAP operators. 17.8 Dr. Codd’s T-Join Dr. E. F. Codd introduced a set of new theta operators, called T- operators, which were based on the idea of a best-fit or approximate equality (Codd 1990). The algorithm for the operators is easier to understand with an example modified from Dr. Codd (Codd 1990). The problem is to assign the classes to the available classrooms. We want (class_size < room_size) to be true after the assignments are made. This will allow us a few empty seats in each room for late students. We can do this in one of two ways. The first way is to sort the tables in ascending order by classroom size and the number of students in a class. We start with the following tables: CREATE TABLE Rooms (room_nbr CHAR(2) PRIMARY KEY, room_size INTEGER NOT NULL); CREATE TABLE Classes (class_nbr CHAR(2) PRIMARY KEY, class_size INTEGER NOT NULL); These tables have the following rows in them: Classes class_nbr class_size ===================== 'c1' 80 'c2' 70 360 CHAPTER 17: THE SELECT STATEMENT 'c3' 65 'c4' 55 'c5' 50 'c6' 40 Rooms room_nbr room_size ================== 'r1' 70 'r2' 40 'r3' 50 'r4' 85 'r5' 30 'r6' 65 'r7' 55 The goal of the T-Join problem is to assign a class that is smaller than the classroom given it (class_size < room_size). Dr. Codd gives two approaches to the problem. 1. Ascending Order Algorithm: Sort both tables into ascending order. Reading from the top of the Rooms table, match each class with the first room that will fit. Classes Rooms class_nbr class_size room_nbr room_size ==================== =================== 'c6' 40 'r5' 30 'c5' 50 'r2' 40 'c4' 55 'r3' 50 'c3' 65 'r7' 55 'c2' 70 'r6' 65 'c1' 80 'r1' 70 'r4' 85 Results class_nbr class_size room_nbr room_size ======================================== 'c2' 70 'r4' 85 'c3' 65 'r1' 70 'c4' 55 'r6' 65 'c5' 50 'r7' 55 'c6' 40 'r3' 50 17.8 Dr. Codd’s T-Join 361 2. Descending Order Algorithm: Sort both tables into descending order. Reading from the top of the Classes table, match each class with the first room that will fit. Classes Rooms class_nbr class_size room_nbr room_size ===================== =================== 'c1' 80 'r4' 85 'c2' 70 'r1' 70 'c3' 65 'r6' 65 'c4' 55 'r7' 55 'c5' 50 'r3' 50 'c6' 40 'r2' 40 'r5' 30 Results class_nbr class_size room_nbr room_size ========================================= 'c1' 80 'r4' 85 'c3' 65 'r1' 70 'c4' 55 'r6' 65 'c5' 50 'r7' 55 'c6' 40 'r3' 50 Notice that the answers are different! Dr. Codd has never given a definition in relational algebra of the T-Join, so I propose that we need one. Informally, for each class, we want the smallest room that will hold it, while maintaining the T-Join condition. Or for each room, we want the largest class that will fill it, while maintaining the T-Join condition. These can be two different things, so you must decide which table is the driver. But either way, I advocate a “best fit” over Codd’s “first fit” approach. In effect, the Swedish and Croatian solutions given later in this section use my definition instead of Dr. Codd’s; the Colombian solution is true to the algorithmic approach. Other theta conditions can be used in place of the “less than” shown here. If “less than or equal” is used, all the classes are assigned to a room in this case, but not in all cases. This is left to the reader as an exercise. The first attempts in standard SQL are versions grouped by queries. They can, however, produce some rows that would be left out of the answers Dr. Codd was expecting. The first JOIN can be written as . REFERENCES Students(student_nbr), ); SELECT Students.*, DomesticStudents.*, ForeignStudents.* FROM Students LEFT OUTER JOIN DomesticStudents ON CASE Students.student_type WHEN 'D'. 'r1' 70 'r2' 40 'r3' 50 'r4' 85 'r5' 30 'r6' 65 'r7' 55 The goal of the T-Join problem is to assign a class that is smaller. 'r5' 30 'c5' 50 'r2' 40 'c4' 55 'r3' 50 'c3' 65 'r7' 55 'c2' 70 'r6' 65 'c1' 80 'r1'