402 CHAPTER 19: PARTITIONING DATA IN QUERIES 19.1.1 Partitioning by Ranges A common problem in data processing is classifying things by the way they fall into a range on a numeric or alphabetic scale. The best approach to translating a code into a value when ranges are involved is to set up a table with the high and the low values for each translated value in it. This is covered in the chapter on auxiliary tables, Chapter 22, in more detail, but here is a quick review. Any missing values will easily be detected, and the table can be validated for completeness. For example, we could create a table of ZIP code ranges and two-character state abbreviation codes, like this: CREATE TABLE StateZip (state_code CHAR(2) NOT NULL PRIMARY KEY, low_zip CHAR(5) NOT NULL UNIQUE, high_zip CHAR(5) NOT NULL UNIQUE, CONSTRAINT zip_order_okay CHECK(low_zip < high_zip), ); Here is a query that looks up the city name and state code from the ZIP code in the AddressBook table to complete a mailing label with a simple JOIN that looks like this: SELECT A1.name, A1.street, SZ.city, SZ.state_code, A1.zip FROM StateZip AS SZ, AddressBook AS A1 WHERE A1.zip BETWEEN SZ.low_zip AND SZ.high_zip; You must be careful with this predicate. If one of the three columns involved has a NULL in it, the BETWEEN predicate becomes UNKNOWN and will not be recognized by the WHERE clause. If you design the table of range values with the high value in one row equal to or greater than the low value in another row, both of those rows will be returned when the test value falls on the overlap. Single-Column Range Tables If you know that you have a partitioning in the range value tables, you can write a query in SQL that will let you use a table with only the high value and the translation code. The grading system table would have ((100%, ‘A’), (89%, ‘B’), (79%, ‘C’), (69%, ‘D’), and (59%,‘F’)) as its rows. Likewise, a table of the state code and the highest ZIP code in that state could do the same job as the BETWEEN predicate in the previous query. 19.1 Coverings and Partitions 403 CREATE TABLE StateZip2 (high_zip CHAR(5) NOT NULL, state CHAR(2) NOT NULL, PRIMARY KEY (high_zip, state)); We want to write a query to give us the greatest lower bound or least upper bound on those values. The greatest lower bound (glb) operator finds the largest number in one column that is less than or equal to the target value in the other column. The least upper bound (lub) operator finds the smallest number greater than or equal to the target number. Unfortunately, this is not a good trade-off, because the subquery is fairly complex and slow. The “high and low” columns are a better solution in most cases. Here is a second version of the AddressBook query, using only the high_zip column from the StateZip2 table: SELECT name, street, city, state, zip FROM StateZip2, AddressBook WHERE state = (SELECT state FROM StateZip2 WHERE high_zip = (SELECT MIN(high_zip) FROM StateZip2 WHERE Address.zip <= StateZip2.high_zip)); If you want to allow for multiple-row matches by not requiring that the lookup table have unique values, the equality subquery predicate should be converted to an IN() predicate. 19.1.2 Partition by Functions It is also possible to use a function that will partition the table into subsets that share a particular property. Consider the cases where you have to add a column with the function result to the table, because the function is too complex to be reasonably written in SQL. One common example of this technique is the Soundex function, where it is not a vendor extension; the Soundex family assigns codes to names that are phonetically alike. The complex calculations in engineering and scientific databases that involve functions SQL does not have are another example of this technique. SQL was never meant to be a computational language. However, many vendors allow a query to access functions in the libraries of other 404 CHAPTER 19: PARTITIONING DATA IN QUERIES programming languages. You must know the cost in execution time for your product before doing this. One version of SQL uses a threaded- code approach to carry parameters over to the other language’s libraries and return the results on each row—the execution time is horrible. Some versions of SQL can compile and link another language’s library into the SQL. Although this is a generalization, the safest technique is to unload the parameter values to a file in a standard format that can be read by the other language. Then use that file in a program to find the function results and create INSERT INTO statements that will load a table in the database with the parameters and the results. You can then use this working table to load the result column in the original table. 19.1.3 Partition by Sequences We are looking for patterns over a history that has a sequential ordering to it. This ordering could be temporal, or via sequence numbering. For example, given a payment history, we want to break it into groupings of behavior—say, whether or not the payments were on time or late. CREATE TABLE PaymentHistory (payment_nbr INTEGER NOT NULL PRIMARY KEY, paid_on_time CHAR(1) DEFAULT 'Y' NOT NULL CHECK(paid_on_time IN ('Y', 'N'))); INSERT INTO PaymentHistory VALUES (1006, 'Y'), (1005, 'Y'), (1004, 'N'), (1003, 'Y'), (1002, 'Y'), (1001, 'Y'), (1000, 'N'); The results we want assign a grouping number to each run of on- time/late payments, thus grp payment_nbr paid_on_time =============================== 1 1006 'Y' 1 1005 'Y' 2 1004 'N' 19.1 Coverings and Partitions 405 3 1003 'Y' 3 1002 'Y' 3 1001 'Y' 4 1000 'N' The following solution from Hugo Kornelis depends on the payments always being numbered consecutively. SELECT (SELECT COUNT(*) FROM PaymentHistory AS H2, PaymentHistory AS H3 WHERE H3.payment_nbr = H2.payment_nbr + 1 AND H3.paid_on_time <> H2.paid_on_time AND H2.payment_nbr >= H1.payment_nbr) + 1 AS grp, payment_nbr, paid_on_time FROM PaymentHistory AS H1; This is very useful when looking for patterns in a history. A more complex version of the same problem would involve more than two categories. Consider a table with a sequential numbering and a list of products that have been received. What we want is the average quality score value for a sequential grouping of the same Product. For example, I need an average of Entries 1, 2, and 3, because this is the first grouping of the same product type, but I do not want that average to include entry 8, which is also Product A, but in a different “group.” CREATE TABLE ProductTests (batch_nbr INTEGER NOT NULL PRIMARY KEY, prod_code CHAR(1) NOT NULL, prod_quality DECIMAL(8.4) NOT NULL); INSERT INTO ProductTests (batch_nbr, prod_code, prod_quality) VALUES (1, 'A', 80), (2, 'A', 70), (3, 'A', 80), (4, 'B', 60), (5, 'B', 90), (6, 'C', 80), (7, 'D', 80), (8, 'A', 50), (9, 'C', 70); 406 CHAPTER 19: PARTITIONING DATA IN QUERIES The query then becomes: SELECT X.prod_code, MIN(X.batch_nbr) AS start_batch_nbr, end_batch_nbr, AVG(B4.prod_quality) AS avg_prod_quality FROM (SELECT B1.prod_code, B1.batch_nbr, MAX(B2.batch_nbr) AS end_batch_nbr FROM ProductTests AS B1, ProductTests AS B2 WHERE B1.batch_nbr <= B2.batch_nbr AND B1.prod_code = B2.prod_code AND B1.prod_code = ALL (SELECT prod_code FROM ProductTests AS B3 WHERE B3.batch_nbr BETWEEN B1.batch_nbr AND B2.batch_nbr) GROUP BY B1.prod_code, B1.batch_nbr) AS X INNER JOIN ProductTests AS B4 join to get the quality measurements ON B4.batch_nbr BETWEEN X.batch_nbr AND X.end_batch_nbr GROUP BY X.prod_code, X.end_batch_nbr; Results prod_code start_batch_nbr end_batch_nbr avg_prod_quality ========================================================== 'A' 1 3 76.6666 'B' 4 5 75.0000 'C' 6 6 80.0000 'D' 7 7 80.0000 'A' 8 8 50.0000 'C' 9 9 70.0000 19.2 Relational Division Relational division is one of the eight basic operations in Codd’s relational algebra. The idea is that a divisor table is used to partition a dividend table and produce a quotient or results table. The quotient table is made up of those values of one column for which a second column had all of the values in the divisor. This is easier to explain with an example. We have a table of pilots and the planes they can fly (dividend); we have a table of planes in the hangar (divisor); we want the names of the pilots who can fly every plane 19.2 Relational Division 407 (quotient) in the hangar. To get this result, we divide the PilotSkills table by the planes in the hangar. CREATE TABLE PilotSkills (pilot CHAR(15) NOT NULL, plane CHAR(15) NOT NULL, PRIMARY KEY (pilot, plane)); PilotSkills pilot plane ========================= 'Celko' 'Piper Cub' 'Higgins' 'B-52 Bomber' 'Higgins' 'F-14 Fighter' 'Higgins' 'Piper Cub' 'Jones' 'B-52 Bomber' 'Jones' 'F-14 Fighter' 'Smith' 'B-1 Bomber' 'Smith' 'B-52 Bomber' 'Smith' 'F-14 Fighter' 'Wilson' 'B-1 Bomber' 'Wilson' 'B-52 Bomber' 'Wilson' 'F-14 Fighter' 'Wilson' 'F-17 Fighter' CREATE TABLE Hangar (plane CHAR(15) NOT NULL PRIMARY KEY); Hangar plane ============= 'B-1 Bomber' 'B-52 Bomber' 'F-14 Fighter' PilotSkills DIVIDED BY Hangar pilot ============================= 'Smith' 'Wilson' 408 CHAPTER 19: PARTITIONING DATA IN QUERIES In this example, Smith and Wilson are the two pilots who can fly everything in the hangar. Notice that Higgins and Celko know how to fly a Piper Cub, but we don’t have one right now. In Codd’s original definition of relational division, having more rows than are called for is not a problem. The important characteristic of a relational division is that the CROSS JOIN of the divisor and the quotient produces a valid subset of rows from the dividend. This is where the name comes from, since the CROSS JOIN acts like a multiplication operator. 19.2.1 Division with a Remainder There are two kinds of relational division. Division with a remainder allows the dividend table to have more values than the divisor, which was Dr. Codd’s original definition. For example, if a pilot can fly more planes than just those we have in the hangar, this is fine with us. The query can be written as SELECT DISTINCT pilot FROM PilotSkills AS PS1 WHERE NOT EXISTS (SELECT * FROM Hangar WHERE NOT EXISTS (SELECT * FROM PilotSkills AS PS2 WHERE (PS1.pilot = PS2.pilot) AND (PS2.plane = Hangar.plane))); The quickest way to explain what is happening in this query is to imagine a World War II movie, where a cocky pilot has just walked into the hangar, looked over the fleet, and announced, “There ain’t no plane in this hangar that I can’t fly!” We want to find the pilots for whom there does not exist a plane in the hangar for which they have no skills. The use of the NOT EXISTS() predicates is for speed. Most SQL implementations will look up a value in an index rather than scan the whole table. This query for relational division was made popular by Chris Date in his textbooks, but it is neither the only method nor always the fastest. Another version of the division can be written so as to avoid three levels of nesting. While it didn’t originate with me, I have made it popular in my books. 19.2 Relational Division 409 SELECT PS1.pilot FROM PilotSkills AS PS1, Hangar AS H1 WHERE PS1.plane = H1.plane GROUP BY PS1.pilot HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar); There is a serious difference in the two methods. Burn down the hangar, so that the divisor is empty. Because of the NOT EXISTS() predicates in Date’s query, all pilots are returned from a division by an empty set. Because of the COUNT() functions in my query, no pilots are returned from a division by an empty set. In the sixth edition of his book, Introduction to Database Systems (Date 1995), Chris Date defined another operator ( DIVIDEBY PER ), which produces the same results as my query, but with more complexity. 19.2.2 Exact Division The second kind of relational division is exact relational division. The dividend table must match exactly to the values of the divisor without any extra values. SELECT PS1.pilot FROM PilotSkills AS PS1 LEFT OUTER JOIN Hangar AS H1 ON PS1.plane = H1.plane GROUP BY PS1.pilot HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar) AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar); This query stipulates that a pilot must have the same number of certificates as there are planes in the hangar, and that these certificates all match to a plane in the hangar, not something else. The “something else” is shown by a created NULL from the LEFT OUTER JOIN . Please do not make the mistake of trying to reduce the HAVING clause with a little algebra to: HAVING COUNT(PS1.plane) = COUNT(H1.plane) 410 CHAPTER 19: PARTITIONING DATA IN QUERIES It does not work; it will tell you that the hangar has ( n ) planes in it and the pilot is certified for ( n ) planes, but not that those two sets of planes are equal to each other. 19.2.3 Note on Performance The nested EXISTS() predicates version of relational division was made popular by Chris Date’s textbooks, while the author is associated with popularizing the COUNT(*) version of relational division. The Winter 1996 edition of DB2 Magazine had an article entitled “Powerful SQL: Beyond the Basics” by Sheryl Larsen (www.db2mag.com/db_area/ archives/1996/q4/9601lar.shtml), which gave the results of testing both methods. Her conclusion for DB2 was that the nested EXISTS() version is better when the quotient has less than 25% of the dividend table’s rows, and the COUNT(*) version is better when the quotient is more than 25% of the dividend table. On the other hand, Matthew W. Spaulding at SnapOn Tools reported his test on SQL Server 2000 with the opposite results. He had a table with two million rows for the dividend and around 1,000 rows in the divisor, yielding a quotient of around 1,000 rows as well. The COUNT method completed in well under one second, where as the nested NOT EXISTS query took roughly five seconds to run. The moral to the story is to test both methods on your particular product. 19.2.4 Todd’s Division A relational division operator proposed by Stephen Todd is defined on two tables with common columns that are joined together, dropping the JOIN column and retaining only those non JOIN columns that meet a criterion. We are given a table, JobParts(job_nbr, part_nbr), and another table, SupParts(sup_nbr, part_nbr), of suppliers and the parts that they provide. We want to get the supplier-and-job pairs such that supplier s n supplies all of the parts needed for job j n . This is not quite the same thing as getting the supplier-and-job pairs such that job j n requires all of the parts provided by supplier s n . You want to divide the JobParts table by the SupParts table. A rule of thumb: the remainder comes from the dividend, but all values in the divisor are present. 19.2 Relational Division 411 JobParts SupParts Result = JobSups job pno sno pno job sno ======== ============= ============ 'j1' 'p1' 's1' 'p1' 'j1' 's1' 'j1' 'p2' 's1' 'p2' 'j1' 's2' 'j2' 'p2' 's1' 'p3' 'j2' 's1' 'j2' 'p4' 's1' 'p4' 'j2' 's4' 'j2' 'p5' 's1' 'p5' 'j3' 's1' 'j3' 'p2' 's1' 'p6' 'j3' 's2' 's2' 'p1' 'j3' 's3' 's2' 'p2' 'j3' 's4' 's3' 'p2' 's4' 'p2' 's4' 'p4' 's4' 'p5' Pierre Mullin submitted the following query to carry out the Todd division: SELECT DISTINCT JP1.job, SP1.supplier FROM JobParts AS JP1, SupParts AS SP1 WHERE NOT EXISTS (SELECT * FROM JobParts AS JP2 WHERE JP2.job = JP1.job AND JP2.part NOT IN (SELECT SP2.part FROM SupParts AS SP2 WHERE SP2.supplier = SP1.supplier)); This is really a modification of the query for Codd’s division, extended to use a JOIN on both tables in the outermost SELECT statement. The IN predicate for the second subquery can be replaced with a NOT EXISTS predicate; it might run a bit faster, depending on the optimizer. Another related query is finding the pairs of suppliers who sell the same parts. In this data, that would be the pairs (s1, p2), (s3, p1), (s4, p1), (s5, p1): SELECT S1.sup, S2.sup FROM SupParts AS S1, SupParts AS S2 . 'j3' &apos ;s1 ' 'j3' 'p2' &apos ;s1 ' 'p6' 'j3' &apos ;s2 ' &apos ;s2 ' 'p1' 'j3' &apos ;s3 ' &apos ;s2 ' 'p2'. 'p3' 'j2' &apos ;s1 ' 'j2' 'p4' &apos ;s1 ' 'p4' 'j2' &apos ;s4 ' 'j2' 'p5' &apos ;s1 ' 'p5' 'j3'. &apos ;s1 ' 'p1' 'j1' &apos ;s1 ' 'j1' 'p2' &apos ;s1 ' 'p2' 'j1' &apos ;s2 ' 'j2' 'p2' &apos ;s1 ' 'p3'