612 CHAPTER 27: SUBSETS = (SELECT MIN(keycol) + (MAX (keycol) - MIN (keycol) * RANDOM())) FROM LotteryDrawing AS L2); Here is a version which uses the COUNT(*) functions and a self-join instead. SELECT L1.* FROM LotteryDrawing AS L1 WHERE CEILING ((SELECT COUNT(*) FROM LotteryDrawing) * RANDOM()) = (SELECT COUNT(*) FROM LotteryDrawing AS L2 WHERE L1.keycol <= L2.keycol); The rounding away from zero is important, since we are in effect numbering the rows from one. The idea is to use the decimal fraction to hit the row that is far into the table when the rows are ordered by the key. Having shown you this code, I have to warn you that the pure SQL has a good number of self-joins, and they will be expensive to run. 27.3 The CONTAINS Operators Set theory has two symbols for subsets. One, , means that set A is contained within set B; this is sometimes said to denote a proper subset. The other, means “is contained in or equal to,” and is sometimes called just a subset or containment operator. Standard SQL has never had an operator to compare tables against each other for equality or containment. Several college textbooks on relational databases mention a CONTAINS predicate, which does not exist in Standard SQL. This predicate existed in the original System R, IBM’s first experimental SQL system, but it was dropped from later SQL implementations because of the expense of running it. 27.3.1 Proper Subset Operators The IN predicate is a test for membership. For those of you who remember your high school set theory, membership is shown with a stylized epsilon with the containing set on the right side: a ∈ A. Membership is for one element, whereas a subset is itself a set, not just an element. As an example of a subset predicate, consider a query to tell 27.3 The CONTAINS Operators 613 you the names of each employee who works on all of the projects in department 5. Using the System R syntax: SELECT name Not valid SQL! FROM Personnel WHERE (SELECT project_nbr FROM WorksOn WHERE Personnel.emp_nbr = WorksOn.emp_nbr) CONTAINS (SELECT project_nbr FROM Projects WHERE dept_nbr = 5); In the second SELECT statement of the CONTAINS predicate, we build a table of all the projects in department 5. In the first SELECT statement of the CONTAINS predicate, we have a correlated subquery that will build a table of all the projects each employee works on. If the table of the employee’s projects is equal to or a superset of the department 5 table, the predicate is TRUE. You must first decide what you are going to do about duplicate rows in either or both tables. That is, does the set { a, b, c } contain the multiset { a, b, b } or not? Some SQL set operations, such as SELECT and UNION, have options to remove or keep duplicates from the results (e.g., UNION ALL and SELECT DISTINCT). I would argue that duplicates should be ignored, and that the multiset is a subset of the other. For our example, let us use a table of employees and another table with the names of the company bowling team members, which should be a proper subset of the Personnel table. For the bowling team to be contained in the set of employees, each bowler must be an employee; or, to put it another way, there must be no bowler who is not an employee. NOT EXISTS (SELECT * FROM Bowling AS B1 WHERE B1.emp_nbr NOT IN (SELECT emp_nbr FROM Personnel)) 27.3.2 Table Equality How can I find out if two tables are equal to each other? This is a common programming problem, and the specification sounds obvious. When two sets, A and B, are equal, then we know that: 614 CHAPTER 27: SUBSETS 1. Both have the same number of elements 2. No elements in A are not in B 3. No elements in B are not in A 4. Set A is equal to the intersection of A and B 5. Set B is equal to the intersection of A and B 6. Set B is a subset of A 7. Set A is a subset of B as well as probably a few other things vaguely remembered from an old math class. But equality is not as easy as it sounds in SQL, because the language is based on multisets or bags, which allow duplicate elements, and the language has NULLs. Given this list of multisets, which pairs are equal to each other? S0 = {a, b, c} S1 = {a, b, NULL} S2 = {a, b, b, c, c} S3 = {a, b, NULL} S4 = {a, b, c} S5 = {x, y, z} Everyone will agree that S0 = S4, because they are identical. Everyone will agree that S5 is not equal to any other set because it has no elements in common with any of them. How do you handle redundant duplicates? If you ignore them, then S0 = S2. Should NULLs be given the benefit of the doubt and matched to any known value or not? If so, then S0 = S1 and S0 = S3. But then do you want to say that S1 = S3 because we can pair up the NULLs with each other? To make matters even worse: are two rows equal if they match on just their keys, on a particular subset of their columns, or on all their columns? The reason this question comes up in practice is that you often have to match up data from two sources that have slightly different versions of the same information (i.e., “Joe F. Celko” and “Joseph Frank Celko” are probably the same person). The good part about matching things on the keys is that you do have a true set—keys are unique and cannot have NULLs. If you go back to the list of set equality tests that I gave at the start of this article, you can see some possible ways to code a solution. 27.3 The CONTAINS Operators 615 If you use facts 2 and 3 in the list, then you might use NOT EXISTS() predicates. WHERE NOT EXISTS (SELECT * FROM A WHERE A.keycol NOT IN (SELECT keycol FROM B WHERE A.keycol = B.keycol)) AND NOT EXISTS (SELECT * FROM B WHERE B.keycol NOT IN (SELECT keycol FROM A WHERE A.keycol = B.keycol)) This query can also be written as: WHERE NOT EXISTS (SELECT * FROM A EXCEPT [ALL] SELECT * FROM B WHERE A.keycol = B.keycol) UNION SELECT * FROM B EXCEPT [ALL] SELECT * FROM A WHERE A.keycol = B.keycol)) The use of the optional EXCEPT ALL operators will determine how duplicates are handled. However, if you look at 1, 4, and 5, you might come up with this answer: WHERE (SELECT COUNT(*)FROM A) 616 CHAPTER 27: SUBSETS = (SELECT COUNT(*) FROM A INNER JOIN B ON A.keycol = B.keycol) AND (SELECT COUNT(*)FROM B) = (SELECT COUNT(*) FROM A INNER JOIN B ON A.keycol = B.keycol) This query will produce a list of the unmatched values; you might want to keep them in two columns instead of coalescing them as I have shown here. SELECT DISTINCT COALESCE(A.keycol, B.keycol) AS non_matched_key FROM A FULL OUTER JOIN B ON A.keycol = B.keycol WHERE A.keycol IS NULL OR B.keycol IS NULL; Eventually, you will be able to handle this with the INTERSECT [ALL] and UNION [ALL] operators in Standard SQL and tune the query to whatever definition of equality you wish to use. Unfortunately, these examples are for just comparing the keys. What do we do if we have tables without keys, or if we want to compare all the columns? GROUP BY, DISTINCT, and a few other things in SQL treat NULLs as if they were equal to each other. This is probably the definition of equality we would like to use. Remember that if one table has more columns or more rows than the other, we can stop right there, since they cannot possibly be equal under that definition. We have to assume that the tables have the same number of columns, of the same type, and in the same positions. But row counts look useful. Imagine that there are two children, each with a bag of candy. To determine that both bags are identical, the first children can start by pulling a piece of candy out and asking the other, “How many red ones do you have?” If the two counts disagree, we know that the bags are different. Now ask about the green pieces. We do not have to match each particular piece of candy in one bag with a particular piece of candy in the other bag. The counts are enough information, only if they differ. If the counts are the same, more work needs to be done. We could each 27.3 The CONTAINS Operators 617 have one brown piece of candy, but mine could be an M&M, and yours could be a malted milk ball. Now, generalize that idea. Let’s combine the two tables into one big table, with an extra column, x0, to show from where each row originally came. Now form groups based on all the original columns. Within each group, count the number of rows from one table and the number of rows from the second table. If the counts are different, there are unmatched rows. This will handle redundant duplicate rows within one table. This query does not require that the tables have keys. The assumption in a GROUP BY clause is that all NULLs are treated as if they were equals. Here is the final query. SELECT x1, x2, , xn, COUNT(CASE WHEN x0 = 'A' THEN 1 ELSE 0 END) AS a_tally, COUNT(CASE WHEN x0 = 'B' THEN 1 ELSE 0 END) AS b_tally FROM (SELECT 'A', A.* FROM A UNION ALL SELECT 'B', B.* FROM B) AS X (x0, x1, x2, , xn) GROUP BY x1, x2, x3, x4, xn HAVING COUNT(CASE WHEN x0 = 'A' THEN 1 ELSE 0 END) <> COUNT(CASE WHEN x0 = 'B' THEN 1 ELSE 0 END); You might want to think about the differences that changing the expression for the derived table X can make. If you use a UNION instead of a UNION ALL, then the row count for each group in both tables will be one. If you use a SELECT DISTINCT instead of a SELECT, then the row count in just that table will be one for each group. Subset Equality A surprisingly usable version of set equality is finding identical subsets within the same table. These identical subsets can build partitions that are known as equivalence classes in set theory. Let’s use Chris Date’s suppliers-and-parts table to find pairs of suppliers who provide exactly the same parts—that is, the set of parts from one supplier is equal to the set of parts from the other supplier. 618 CHAPTER 27: SUBSETS CREATE TABLE SupParts (sup_nbr CHAR(2) NOT NULL, part_nbr CHAR(2) NOT NULL, PRIMARY KEY (sup_nbr, part_nbr)); The usual way of proving that two sets are equal is to show that set A contains set B and set B contains set A. Any of the methods given above can be modified to handle two copies of the same table under aliases. Instead, consider another approach. First JOIN one supplier to another on their common parts, eliminating the situation where the first supplier is also the second supplier, so that you have the intersection of the two subsets. If the intersection has the same number of pairs as each of the two subsets has elements, the two subsets are equal. SELECT SP1.sup_nbr, SP2.sup_nbr, COUNT(*) AS part_count FROM SupParts AS SP1 INNER JOIN SupParts AS SP2 ON SP1.part_nbr = SP2.part_nbr AND SP1.sup_nbr < SP2.sup_nbr GROUP BY SP1.sup_nbr, SP2.sup_nbr HAVING COUNT(*) = (SELECT COUNT(*) FROM SupParts AS SP3 WHERE SP3.sup_nbr = SP1.sup_nbr) AND COUNT(*) = (SELECT COUNT(*) FROM SupParts AS SP4 WHERE SP4.sup_nbr = SP2.sup_nbr); If there is an index on the supplier number in the SupParts table, it can provide the counts directly, as well as helping with the JOIN operation. The only problem with this answer is that it is hard to see the groups of suppliers among the pairs. The part_count column helps a bit, but it does not assign a grouping identifier to the rows. 27.4 Picking a Representative Subset This problem and solution for it are due to Ross Presser. The problem is to find a subset of rows such that each value in each of two columns appears in at least one row. The purpose is to produce a set of samples from a large table. The table has a club_name column and an ifc column; 27.4 Picking a Representative Subset 619 I want a set of samples that contains at least one of each club_name and at least one of each ifc, but no more than necessary. CREATE TABLE Memberships (member_id INTEGER NOT NULL PRIMARY KEY, club_name CHAR(7) NOT NULL, ifc CHAR(4) NOT NULL); CREATE TABLE Samples (member_id INTEGER NOT NULL PRIMARY KEY, club_name CHAR(7) NOT NULL, ifc CHAR(4) NOT NULL); INSERT INTO Memberships VALUES (6401715, 'aarprat', 'ic17'), (1058337, 'aarprat', 'ic17'), (459443, 'aarpprt', 'ic25'), (4018210, 'aarpbas', 'ig21'), (2430656, 'aarpbas', 'ig21'), (6802081, 'aarpprd', 'ig29'), (4236511, 'aarpprd', 'ig29'), (2162104, 'aarpbas', 'ig21'), (2073679, 'aarpprd', 'ig29'), (8148891, 'aarpbas', 'ig21'), (1868445, 'aarpbas', 'ig21'), (6749213, 'aarpbas', 'ig21'), (8363621, 'aarppup', 'ig29'), (9999, 'aarppup', 'ic17'); redundant To help frame the problem better, consider this subset, which has a row with both a redundant club_name value and ifc value. Non-Minimal subset member_id club_name ifc ========================= 9999 aarppup ic17 <== redundant row 1058337 aarprat ic17 <== ifc 459443 aarpprt ic25 1868445 aarpbas ig21 2073679 aarpprd ig29 8363621 aarppup ig29 <== club_name 620 CHAPTER 27: SUBSETS There can be more than one minimal solution. But we would be happy to simply find a near-minimal solution. David Portas came up with a query that gives a near-minimal solution. This will produce a sample containing at least one row of each value in the two columns. It isn’t guaranteed to give the minimum subset, but it should contain at most (c + i − 1) rows, where (c) is the number of distinct clubs and (i) the number of distinct ifcs. SELECT member_id, club_name, ifc FROM Memberships AS M WHERE member_id IN (SELECT MIN(member_id) FROM Memberships GROUP BY club_name UNION ALL SELECT MIN(member_id) FROM Memberships AS M2 GROUP BY ifc HAVING NOT EXISTS (SELECT * FROM Memberships WHERE member_id IN (SELECT MIN(member_id) FROM Memberships GROUP BY club_name) AND ifc = M2.ifc)); I am not sure it’s possible to find the minimum subset every time, unless you use an iterative solution. The results are very dependent on the exact data involved. Ross Presser’s iterative solution used the six-step system below, and found that the number of rows resulting depended on both the order of the insert queries and on whether we used MAX() or MIN(). That said, the resulting row count only varied from 403 to 410 rows on a real run of 52,776 invoices for a set where (c = 325) and (i = 117). Portas’s query gave a result of 405 rows, which is worse but not fatally worse. first step: unique clubs INSERT INTO Samples (member_id, club_name, ifc) SELECT MIN(Randommid), club_name, MIN(ifc) FROM Memberships 27.4 Picking a Representative Subset 621 GROUP BY club_name HAVING COUNT(*) = 1; second step: unique ifcs where club_name not already there INSERT INTO Samples (member_id, club_name, ifc) SELECT MIN(Memberships.Member_id), MIN(Memberships.club_name), Memberships.ifc FROM Memberships GROUP BY Memberships.ifc HAVING MIN(Memberships.club_name) NOT IN (SELECT club_name FROM Samples) AND COUNT(*) = 1; intermezzo: views for missing ifcs, missing clubs CREATE VIEW MissingClubs (club_name) AS SELECT Memberships.club_name FROM Memberships LEFT OUTER JOIN Samples ON Memberships.club_name = Samples.club_name WHERE Samples.club_name IS NULL GROUP BY Memberships.club_name; CREATE VIEW MissingIfcs (ifc) AS SELECT Memberships.ifc FROM Memberships LEFT OUTER JOIN Samples ON Memberships.ifc = Samples.ifc WHERE Samples.ifc IS NULL GROUP BY Memberships.ifc; third step: distinct missing clubs that are also missing ifcs INSERT INTO Samples (member_id, club_name, ifc) SELECT MIN(Memberships.Member_id), Memberships.club_name, MIN(Memberships.ifc) FROM Memberships, MissingClubs, MissingIfcs WHERE Memberships.club_name = MissingClubs.club_name AND Memberships.ifc = MissingIfcs.ifc GROUP BY Memberships.club_name; . the same table. These identical subsets can build partitions that are known as equivalence classes in set theory. Let s use Chris Date s suppliers-and-parts table to find pairs of suppliers who. JOIN Samples ON Memberships.ifc = Samples.ifc WHERE Samples.ifc IS NULL GROUP BY Memberships.ifc; third step: distinct missing clubs that are also missing ifcs INSERT INTO Samples (member_id,. equal. SELECT SP1.sup_nbr, SP2.sup_nbr, COUNT(*) AS part_count FROM SupParts AS SP1 INNER JOIN SupParts AS SP2 ON SP1.part_nbr = SP2.part_nbr AND SP1.sup_nbr < SP2.sup_nbr GROUP BY SP1.sup_nbr,