Joe Celko s SQL for Smarties - Advanced SQL Programming P63 ppsx

10 120 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P63 ppsx

Đang tải... (xem toàn văn)

Thông tin tài liệu

592 CHAPTER 26: SET OPERATIONS For the rest of this discussion, let us create two tables with the same structure, which we can use for examples. CREATE TABLE S1 (a1 CHAR(1)); INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c'); CREATE TABLE S2 (a2 CHAR(1)); INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'); 26.1 UNION and UNION ALL UNION s have been supported since SQL-86 with this infixed syntax: <table expression> UNION [ALL] <table expression> The two versions of the UNION statement take two tables and build a result table from them. The two tables must be union-compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type (or automatically cast to it) as the column in the same position in the second table. That is, their rows must have the same structure, so they can be put in the same final result table. Most implementations will do some data type conversions to create the result table, but this can depend on your implementation, and you should check it out for yourself. There are two forms of the UNION statement: the UNION and the UNION ALL . The simple UNION is the same operator you had in high school set theory; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table. The phrase “redundant duplicates” sounds funny, but it means that you leave one copy of the row in the table. The sample tables will yield: (SELECT a1 FROM S1 UNION SELECT a2 FROM S2) ============ a b c d 26.1 UNION and UNION ALL 593 In many early SQL implementations, merge-sorting the two tables and discarding duplicates during the sorting did this removal. This had the side effect that the result table was sorted, but you could not depend on that. Later implementations use hashing, indexing, and parallel processing to find the duplicates. The UNION ALL preserves the duplicates from both tables in the result table. Most early implementations simply appended one table to the other in physical storage. They used file systems based on physically contiguous storage, so this was easy and used the file system code. But, again, you cannot depend on any ordering in the results of either version of the UNION statement. Again, the sample tables will yield: (SELECT a1 FROM S1 UNION ALL SELECT a2 FROM S2) ==== 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'd' You can assign names to the columns by using the AS operator to make the result set into a derived table, thus: SELECT rent, utilities, phone FROM (SELECT a, b, c FROM OldLocations WHERE city = 'Boston' UNION SELECT x, y, z FROM NewLocations WHERE city = 'New York') AS Cities (rent, utilities, phone); A few SQL products will attempt to optimize UNION s if they are made on the same table. Those UNION s can often be replaced with OR ed predicates. For example: 594 CHAPTER 26: SET OPERATIONS SELECT city_name, 'Western' FROM Cities WHERE market_code = 't' UNION ALL SELECT city_name, 'Eastern' FROM Cities WHERE market_code = 'v'; This could be rewritten (probably more efficiently) as: SELECT city_name, CASE market_code WHEN 't' THEN 'Western' WHEN 'v' THEN 'Eastern' END FROM Cities WHERE market_code IN ('v', 't'); A system architecture based on domains rather than tables is necessary to optimize UNION s if they are made on different tables. Doing a UNION to the same table is the same as a SELECT DISTINCT , but the SELECT DISTINCT will probably run faster and preserve the column names too. 26.1.1 Order of Execution UNION and UNION ALL operators are executed from left to right, unless parentheses change the order of execution. Since the UNION operator is associative and commutative, the order of a chain of UNION s will not affect the results. However, order and grouping can affect performance. Consider two small tables that have many duplicates between them. If the optimizer does not consider table sizes, use this query: ( SELECT * FROM SmallTable1) UNION ( SELECT * FROM BigTable) UNION ( SELECT * FROM SmallTable2); It will merge SmallTable1 into BigTable, then merge SmallTable2 into that first result. If the rows of SmallTable1 are spread out in the first 26.1 UNION and UNION ALL 595 result table, locating duplicates from SmallTable2 will take longer than if we had written the query thus: ( SELECT * FROM SmallTable1) UNION ( SELECT * FROM SmallTable2) UNION ( SELECT * FROM BigTable); Again, optimization of UNION s is highly product-dependent, so you should experiment with it. 26.1.2 Mixed UNION and UNION ALL Operators If you know that there are no duplicates, or that duplicates are not a problem in your situation, use UNION ALL, instead of UNION, for speed. For example, if we are sure that BigTable has no duplicates in common with SmallTable1 and SmallTable2, this query will produce the same results as before, but should run much faster: (( SELECT * FROM SmallTable1) UNION ( SELECT * FROM SmallTable2)) UNION ALL ( SELECT * FROM BigTable); But be careful when mixing UNION and UNION ALL operators. The left-to-right order of execution will cause the last operator in the chain to have an effect on the results. 26.1.3 UNION of Columns from the Same Table A useful trick for building the union of columns from the same table is to use a CROSS JOIN and a CASE expression: SELECT CASE WHEN S1.seq_nbr = 1 THEN F1.col1 WHEN S1.seq_nbr = 2 THEN F1.col2 ELSE NULL END FROM Foobar AS F1 CROSS JOIN Sequence AS S1(seq_nbr) WHERE S1.seq_nbr IN (1, 2) 596 CHAPTER 26: SET OPERATIONS This query acts like the UNION ALL statement, but change the SELECT to SELECT DISTINCT and you have a UNION . The advantage of this statement over the more obvious UNION is that it makes only one pass through the table. If you are working with a large table, that can be important for good performance. 26.2 INTERSECT and EXCEPT Intersection and set difference are part of Standard SQL, but few products have implemented them yet. The INTERSECT and EXCEPT set operators take two tables and build a new table from them. The two tables must be “union-compatible,” which means that they have the same number of columns, and that each column in the first table has the same data type (or automatically casts to it) as the column in the same position in the second table. That is, their rows have the same structure, so they can be put in the same final result table. Most implementations will do some data type conversions to create the result table, but this is very implementation dependent, and you should check it out for yourself. Like the UNION , the result of an INTERSECT or EXCEPT should use an AS operator if you want to have names for the result table and its columns. Oracle was the first major vendor to have the EXCEPT operator with the keyword MINUS. The set difference is the rows in the first table, except for those that also appear in the second table. It answers requests like “Give me all the employees except the salesmen” in a natural manner. Let’s take our two multisets and use them to explain the basic model, by making a mapping between them: S1 = {a, a, b, b, c } | | | | S2 = {a, b, b, b, c, d} The INTERSECT and EXCEPT operators remove all duplicates from both sets, so we would have: S1 = {a, b, c } | | | S2 = {a, b, c, d} Therefore, 26.2 INTERSECT and EXCEPT 597 S1 INTERSECT S2 = {a, b, c} and S2 EXCEPT S1 = {d} S1 EXCEPT S2 = {} When you add the ALL option, things are trickier. The mapped pairs become the unit of work. The INTERSECT ALL keeps each pairing, so that: S1 INTERSECT ALL S2 = {a, b, b, c} The EXCEPT ALL throws them away, retaining what is left in the first set, thus: S2 EXCEPT ALL S1 = {b, d} Trying to write the INTERSECT and EXCEPT with other operators is trickier than it looks. It must be general enough to handle situations where there is no key available and the number of columns is not known. Standard SQL defines the actions for duplicates in terms of the count of duplicates of matching rows. Let (m) be the number of rows of one kind in S1 and (n) be the number in S2. The UNION ALL will have (m+n) copies of the row. The INTERSECT ALL will have LEAST(m, n) copies. EXCEPT ALL will have the greater of either the first table’s count minus the second table’s count, or zero copies . The immediate impulse of a programmer is to write the code with EXISTS() predicates. The bad news is that it does not work because of NULLs. This is easier to show with code. Let’s redo our two sample tables. CREATE TABLE S1 (a1 CHAR(1)); INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c'), (NULL), (NULL); CREATE TABLE S2 (a2 CHAR(1)); INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'), (NULL); 598 CHAPTER 26: SET OPERATIONS Now build a view to hold the tally of each value in each table. CREATE VIEW DupCounts (a, s1_dup, s2_dup) AS SELECT S.a, SUM(s1_dup), SUM(s2_dup) FROM (SELECT S1.a1, 1, 0 FROM S1 UNION ALL SELECT S2.a2, 0, 1 FROM S2) AS S(a, s1_dup, s2_dup) GROUP BY S.a, s1_dup, s2_dup; The GROUP BY will put the NULLs into a separate group, giving them the right tallies. Now code is a straightforward implementation of the definitions in Standard SQL. S1 EXCEPT ALL S2 SELECT DISTINCT D1.a, (s1_dup - s2_dup) AS dups FROM DupCounts AS D1, Sequence AS S1 WHERE S1.seq_nbr <= (s1_dup - s2_dup); S1 INTERSECT ALL S2 SELECT DISTINCT D1.a, CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END AS tally FROM DupCounts AS D1, Sequence AS S1 WHERE S1.seq_nbr <= CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END; Notice that we had to use SELECT DISTINCT. Without it, the sample data will produce this table. a tally =========== NULL 1 a 1 b 2 b 2 <== redundant row c 1 26.2 INTERSECT and EXCEPT 599 The nonduplicated versions are easy to write from the definitions in the Standards. In effect, their duplication tallies are set to one. S1 INTERSECT S2 SELECT D1.a FROM DupCounts AS D1 WHERE s1_dup > 0 AND s2_dup > 0; S1 EXCEPT S2 SELECT D1.a FROM DupCounts AS D1 WHERE s1_dup > 0 AND s2_dup = 0; S2 EXCEPT S1 SELECT D1.a FROM DupCounts AS D1 WHERE s2_dup > 0 AND s1_dup = 0; 26.2.1 INTERSECT and EXCEPT without NULLs and Duplicates INTERSECT and EXCEPT are much easier if each of the two tables does not have NULLs and duplicate values in them. Intersection is simply done thus: SELECT * FROM S1 WHERE EXISTS (SELECT * FROM S2 WHERE S1.a1 = S2.a2); or SELECT * FROM S2 WHERE EXISTS (SELECT * 600 CHAPTER 26: SET OPERATIONS FROM S1 WHERE S1.a1 = S2.a2); You can also use the following: SELECT DISTINCT S2.* FROM (S2 INNER JOIN S1 ON S1.a1 = S2.a2); This is given as a motivation for the next piece of code, but you may find that some SQL engines do joins faster than EXISTS() predicates, and vice versa, so it is a good idea to have more than one trick in your bag. The set difference can be written with an OUTER JOIN operator. This code is due to Jim Panttaja. SELECT DISTINCT S2.* FROM (S2 LEFT OUTER JOIN S1 ON S1.a1 = S2.a2) WHERE S1.a1 IS NULL; 26.2.2 INTERSECT and EXCEPT with NULLs and Duplicates These versions of INTERSECT and EXCEPT are due to Itzik Ben-Gan. They make very good use of the UNION and DISTINCT operators to implement set theory definitions. S1 INTERSECT S2 SELECT D.a FROM (SELECT DISTINCT a1 FROM S1 UNION ALL SELECT DISTINCT a2 FROM S2) AS D(a) GROUP BY D.a HAVING COUNT(*) > 1; S1 INTERSECT ALL S2 SELECT D2.a FROM (SELECT D1.a, MIN(cnt) AS mincnt FROM (SELECT a1, COUNT(*) FROM S1 GROUP BY a1 26.3 A Note on ALL and SELECT DISTINCT 601 UNION ALL SELECT a2, COUNT(*) FROM S2 GROUP BY a2) AS D1(a, cnt) GROUP BY D1.a HAVING COUNT(*) > 1) AS D2 INNER JOIN Sequence ON seq_nbr <= mincnt; S1 EXCEPT ALL S2 SELECT D2.a FROM (SELECT D1.a, SUM(cnt) FROM (SELECT a1, COUNT(*) FROM S1 GROUP BY a1 UNION ALL SELECT a2, -COUNT(*) FROM S2 GROUP BY a2) AS D1(a, cnt) GROUP BY D1.a HAVING SUM(cnt) > 0) AS D2(a, dups) INNER JOIN Sequence ON seq_nbr <= D2.dups; The Sequence table is discussed in other places in this book. It is a table of integers from 1 to (n) that is used to replace iteration and counting in SQL. Obviously, (n) must be large enough for these statements to work. 26.3 A Note on ALL and SELECT DISTINCT Here is a series of observations about the relationship between the ALL option in set operations and the SELECT DISTINCT options in a query from Beught Gunne. Given two tables with duplicate values: CREATE TABLE A (i INTEGER NOT NULL); INSERT INTO A VALUES (1), (1), (2), (2), (4), (4); . 'b' 'b' 'b' 'b' 'c' 'c' 'd' You can assign names to the columns by using the AS operator to make the result set. tallies. Now code is a straightforward implementation of the definitions in Standard SQL. S1 EXCEPT ALL S2 SELECT DISTINCT D1.a, (s1 _dup - s2 _dup) AS dups FROM DupCounts AS D1, Sequence AS S1 . WHERE S1 .seq_nbr <= (s1 _dup - s2 _dup); S1 INTERSECT ALL S2 SELECT DISTINCT D1.a, CASE WHEN s1 _dup <= s2 _dup THEN s1 _dup ELSE s2 _dup END AS tally FROM DupCounts AS D1, Sequence AS S1

Ngày đăng: 06/07/2014, 09:20

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan