Joe Celko s SQL for Smarties - Advanced SQL Programming P64 pps

602 CHAPTER 26: SET OPERATIONS CREATE TABLE B (i INTEGER NOT NULL); INSERT INTO B VALUES (2), (2), (3), (3); The UNION and INTERSECT operations have regular behavior in that: (A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3)) and (A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2) However, (A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B) Or, more literally, (1) <> ((1), (2)) for the tables given in the example. Likewise, we have: (B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3) by a coincidence of the particular values used in these tables. 26.4 Equality and Proper Subsets At one point, when SQL was still in the laboratory at IBM, there was a CONTAINS operator that would tell you if one table was a subset of another. It disappeared in later versions of the language and no vendor picked it up. Set equality was never part of SQL as an operator, so you would have to have used the two expressions ((A CONTAINS B) AND (B CONTAINS A)) to find out. Today, you can use the methods shown in the section on Relational Division to determine containment or equality. However, Itzik Ben-Gan came up with a novel approach for finding containment and equality that is worth a mention. SELECT SUM(DISTINCT match_col) FROM (SELECT CASE WHEN S1.col IN (SELECT S2.col FROM S2) THEN 1 ELSE -1 END FROM S1) AS X(match_col) HAVING SUM(DISTINCT match_col) = :n; 26.4 Equality and Proper Subsets 603 You can set (:n) to 1, 0, or −1 for each particular test. When I find a matching row in S1, I get a +1; when I find a mismatched row in S1, get a −1 and they sum together to give me a zero. Therefore, S1 is a proper subset of S2. If they sum to +1, then they are equal. If they sum to −1, they are disjoint. CHAPTER 27 Subsets I AM DEFINING SUBSET operations as queries, which extract a particular subset from a given set, as opposed to set operations, which work among sets. The obvious way to extract a subset from a table is just to use a WHERE clause, which will pull out the rows that meet that criterion. But not all the subsets we want are easily defined by such a simple predicate. This chapter is a collection of tricks for constructing useful, but not obvious, subsets from a table. 27.1 Every n th Item in a Table SQL is a set-oriented language, which cannot identify individual rows by their physical positions in a disk file that holds a table. Instead, a unique logical key is detected by logical expressions, and a row is retrieved. If you are given a file of employees in which the ordering of the file is based on their employee numbers, and you want to pick out every n th employee record for a survey, the job is easy. You write a procedure that loops through the file and writes every n th one to a second file. The immediate thought of how this should be done in SQL is to simply compute MOD (emp_nbr, :n) , where MOD() is the modulo function found in most SQL implementations, and save those employee rows where this function is zero. The trouble is that 606 CHAPTER 27: SUBSETS employees are not issued consecutive identification numbers. The identification numbers are unique. Vendor extensions often include an exposed physical row locator that gives a sequential numbering to the physical records; this sequential numbering can be used to perform these functions. This practice is a complete violation of Dr. Codd’s definition of a relational database, and it requires that the underlying physical implementation use a contiguous sequential record for each row. Such things are highly proprietary, but because these features are so low-level, they will run very fast on that one particular product. Row numbers have more problems than being nonstandard. If the physical storage is rearranged, then the row numbers have to change. Users logged on and looking at the same base table through different VIEW s may or may not get the same row number for the same physical row. One of the advantages of an RDBMS was supposed to be that the logical view of the data would be consistent, even when the physical storage changed. You can get similar results with a self- JOIN on the Personnel table to partition it into a nested series of grouped tables, just as we did for the “to top n ” problem. You then pick out the largest value in each group. There may be an index or a uniqueness constraint on the emp_nbr column to ensure uniqueness, so the EXISTS predicate will get a performance boost. SELECT P1.emp_nbr FROM Personnel AS P1 WHERE EXISTS (SELECT MAX(emp_nbr) FROM Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr HAVING MOD (COUNT(*), :n) = 0); A nonnested version of the same query looks like this: SELECT P1.emp_nbr FROM Personnel AS P1, Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr GROUP BY P1.emp_nbr HAVING MOD (COUNT(*), :n) = 0; 27.2 Picking Random Rows from a Table 607 Both queries count the number of P2 rows with a value less than the P1 row. 27.2 Picking Random Rows from a Table The answer is that, basically, you cannot directly pick a set of random rows from a table in SQL. There is no randomize operator in the standard, and you don’t often find the same pseudo-random number generator function in various vendor extensions, either. Picking random rows from a table for a statistical sample is a handy thing, and you do it in other languages with a pseudo-random number generator. There are two kinds of random drawings from a set, with or without replacement. If SQL had random number functions, I suppose they would be shown as RANDOM(x) and RANDOM(DISTINCT x) . But there is no such function in SQL, and none is planned. Examples from the real world include dealing a poker hand (a random with no replacement situation) and shooting craps (a random with replacement situation). If two players in a poker game get identical cards, you are using a pinochle deck. In a craps game, each roll of the dice is independent of the previous one and can repeat it. The problem is that SQL is a set-oriented language, and wants to do an operation “all at once” on a well-defined set of rows. Random sets are defined by a nondeterministic procedure by definition, instead of a deterministic logic expression. The SQL/PSM language does have an option to declare or create a procedure that is DETERMINISTIC or NOT DETERMINISTIC . The DETERMINISTIC option means that the optimizer can compute this function once for a set of input parameter values and then use that result everywhere in the current SQL statement that a call to the procedure with those parameters appears. The NOT DETERMINISTIC option means given the same parameters, you might not get the same results for each call to the procedure within the same SQL statement. Unfortunately, most SQL products do not have this feature in their proprietary procedural languages. Thus, the random number function in Oracle is nondeterministic and the one in SQL Server is deterministic. For example, CREATE TABLE RandomNbrs (seq_nbr INTEGER NOT NULL PRIMARY KEY, randomizer FLOAT NOT NULL); 608 CHAPTER 27: SUBSETS INSERT INTO RandomNbrs VALUES (1, RANDOM()), (2, RANDOM()), (3, RANDOM()); This query will result in the three rows all getting the same value in the randomizer column in a version of SQL Server, but three different numbers in a version of Oracle. While subqueries are not allowed in DEFAULT clauses, system-related functions such as CURRENT_TIMESTAMP and CURRENT_USER are allowed. In some SQL implementations, this includes the RANDOM() function. CREATE TABLE RandomNbrs2 (seq_nbr INTEGER PRIMARY KEY, randomizer FLOAT warning !! not standard SQL DEFAULT ( (CASE (CAST(RANDOM() + 0.5 AS INTEGER) * -1) WHEN 0.0 THEN 1.0 ELSE -1.0 END) * MOD (CAST(RANDOM() * 100000 AS INTEGER), 10000) * RANDOM()) NOT NULL); INSERT INTO RandomNbrs2 VALUES (1, DEFAULT); (2, DEFAULT), (3, DEFAULT), (4, DEFAULT), (5, DEFAULT), (6, DEFAULT), (7, DEFAULT), (8, DEFAULT), (9, DEFAULT), (10, DEFAULT); Here is a sample output from an SQL Server 7.0 implementation. seq_nbr randomizer ============================ 1 -121.89758452446999 2 -425.61113508053933 27.2 Picking Random Rows from a Table 609 3 3918.1554683876675 4 9335.2668286173412 5 54.463890640027664 6 -5.0169085346410522 7 -5430.63417246276 8 915.9835973796487 9 28.109161998753301 10 741.79452047043048 The best way to do this is to add a column to the table to hold a random number, then use an external language with a good pseudo- random number generator in its function library to load the new column with random values with a cursor in a host language. You have to do it this way, because random number generators work differently from other function calls. They start with an initial value called a “seed” (shown as Random[0] in the rest of this discussion) provided by the user or the system clock. The seed is used to create the first number in the sequence, Random[1]. Then each call, Random [ n ], to the function uses the previous number to generate the next one, Random[ n +1]. There is no way to do a sequence of actions in SQL without a cursor, so you are in procedural code. The term “pseudo-random number generator” is often referred to as a just “random number generator,” but this is technically wrong. All of the generators will eventually return a value that appeared in the sequence earlier and the procedure will hang in a cycle. Procedures are deterministic, and we are living in a mathematical heresy when we try to use them to produce truly random results. However, if the sequence has a very long cycle and meets some other tests for randomness over the range of the cycle, then we can use it. There are many kinds of generators. The linear congruence pseudo- random number generator family has generator formulas of the form: Random[n+1] := MOD ((x * Random[n] + y), m); There are restrictions on the relationships among x , y, and m that deal with their relative primality. Knuth gives a proof that if Random[0] is not a multiple of 2 or 5 m = 10^e where (e >= 5) y = 0 MOD (x, 200) is in the set (3, 11, 13, 19, 21, 27, 29, 37, 53, 610 CHAPTER 27: SUBSETS 59, 61, 67, 77, 83, 91, 109, 117, 123, 131, 133, 139, 141, 147, 163, 171, 173, 179, 181, 187, 189, 197) then the period will be 5 * 10^( e -2). There are old favorites that many C programmers use from this family, such as: Random(n+1) := (Random(n) * 1103515245) + 12345; Random(n+1) := MOD ((16807 * Random(n)), ((2^31) - 1)); The first formula has the advantage of not requiring a MOD function, so it can be written in standard SQL. However, the simplest generator that can be recommended (Park and Miller) uses: Random(n+1) := MOD ((48271 * Random(n)), ((2^31) - 1)); Notice that the modulus is a prime number; this is important. The period of this generator is ((2^31) − 2), which is 2,147,483,646, or more than two billion numbers before this generator repeats. You must determine whether this is long enough for your application. If you have an XOR function in your SQL, then you can also use shift register algorithms. The XOR is the bitwise exclusive OR that works on an integer as it is stored in the hardware; I would assume 32 bits on most small computers. Some usable shift register algorithms are: Random(n+1) := Random(n-103) XOR Random(n-250); Random(n+1) := Random(n-1063) XOR Random(n-1279); One method for writing a random number generator on the fly when the vendor’s library does not have one is to pick a seed using one or more key columns and a call to the system clock’s fractional seconds, such as RANDOM(keycol + EXTRACT (SECOND FROM CURRENT_TIME)) * 1000 . This avoids problems with patterns in the keys, while the key column values ensure uniqueness of the seed values. Another method is to use a PRIMARY KEY or UNIQUE column(s) and apply a hashing algorithm. You can pick one of the random number generator functions already discussed and use the unique value, as if it were the seed, as a quick way to get a hashing function. Hashing algorithms try to be uniformly distributed, so if you can find a good one, you will approach nearly unique random selection. The trick is that the 27.2 Picking Random Rows from a Table 611 hashing algorithm has to be simple enough to be written in the limited math available in SQL. Once you have a column of random numbers, you can convert the random numbers into a randomly ordered sequence with this statement: UPDATE RandomNbrs SET randomizer = (SELECT COUNT(*) FROM Sequence AS S1 WHERE S1.randomizer <= Sequence.seq_nbr); To get one random row from a table, you can use this approach: CREATE VIEW LotteryDrawing (keycol, , spin) AS SELECT LotteryTickets.*, (RANDOM(<keycol> + <fractional seconds from clock>)) FROM LotteryTickets GROUP BY spin HAVING COUNT(*) = 1; Then simply use this query: SELECT * FROM LotteryDrawing WHERE spin = (SELECT MAX(spin) FROM LotteryDrawing) The pseudo-random number function is not standard SQL, but it is common enough. Using the keycol as the seed probably means that you will get a different value for each row, but we can avoid duplicates with the GROUP BY HAVING . Adding the fractional seconds will change the result every time, but it might be illegal in some SQL products, which disallow variable elements in VIEW definitions. Let’s assume you have a function called RANDOM() that returns a random number between 0.00 and 1.00. If you just want one random row out of the table, and you have a numeric key column, Tom Moreau proposed that you could find the MAX() and MIN() , then calculate a random number between them. SELECT L1.* FROM LotteryDrawing AS L1 WHERE col_1 . XOR is the bitwise exclusive OR that works on an integer as it is stored in the hardware; I would assume 32 bits on most small computers. Some usable shift register algorithms are: . DETERMINISTIC option means given the same parameters, you might not get the same results for each call to the procedure within the same SQL statement. Unfortunately, most SQL products do not. Therefore, S1 is a proper subset of S2 . If they sum to +1, then they are equal. If they sum to −1, they are disjoint. CHAPTER 27 Subsets I AM DEFINING SUBSET operations as queries,

Định dạng
Số trang	10
Dung lượng	239,27 KB