Joe Celko s SQL for Smarties - Advanced SQL Programming P55 pptx

512 CHAPTER 23: STATISTICS IN SQL This would return the result set (‘red’, ‘green’) for the example table, and would not change to (‘green’) until the ratio of ‘red’ to ‘green’ tipped by two percentage points. Likewise, you can use a derived table to get the mode. WITH (SELECT salary, COUNT(*) FROM Payroll GROUP BY salary) AS P1 (salary, occurs) SELECT salary FROM P1 WHERE P1.occurs = (SELECT MAX(occurs) IN P1); This is probably the best approach, since the WITH clause will materialize the P1 table and can locate the MAX() while doing so. 23.2 The AVG() Function One problem is that SQLs likes to maintain the data types, so if x is an INTEGER , you may get an integer result. You can avoid this by writing AVG(1.0 * x) or AVG(CAST (X AS FLOAT)) or AVG(CAST (X AS DECIMAL (s,p))) to be safe. This is implementation-defined, so check your product first. Newbies tend to forget that the built-in aggregate functions drop the rows with NULL s before doing the computations. This means that (SUM(x)/COUNT(*)) is not the same as AVG(x) . Consider (x * 1.0)/COUNT(*) versus AVG(COALESCE(x * 1.0, 0.0)) as versions of the mean that handle NULL s differently. Sample and population means are slightly different. A sample needs to use frequencies to adjust the estimate of the mean. The formula SUM(x * 1.0 * abs_perc/100) AS mean_p needs the VIEW we had at the start of this section. The name “mean_p” is to remind us that it is a population mean and not the simple AVG() of the sample data in the table. 23.3 The Median The median is defined as the value for which there are just as many cases with a value below it as above it. If such a value exists in the data set, this value is called the statistical median by some authors. If no such value exists in the data set, the usual method is to divide the data set into two 23.3 The Median 513 halves of equal size such that all values in one half are lower than any value in the other half. The median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial median by some authors. The financial median is the most common term used for this median, so we will stick to it. Let us use Date’s famous Parts table, from several of his textbooks (Date 1983, 1995a), which has a column for weight in it, like this: Parts part_nbr part_name part_color weight city_name ================================================= 'p1' 'Nut' 'Red' '12' 'London' 'p2' 'Bolt' 'Green' '17' 'Paris' 'p3' 'Cam' 'Blue' '12' 'Paris' 'p4' 'Screw' 'Red' '14' 'London' 'p5' 'Cam' 'Blue' '12' 'Paris' 'p6' 'Cog' 'Red' '19' 'London' First, sort the table by weights and find the three rows in the lower half of the table. The greatest value in the lower half is 12; the smallest value in the upper half is 14; their average, and therefore the median, is 13. If the table had an odd number of rows, we would have looked at only one row after the sorting. The median is a better measure of central tendency than the average, but it is also harder to calculate without sorting. This is a disadvantage of SQL, compared with procedural languages, and it might be the reason that the median is not a common vendor extension in SQL implementations. The variance and standard deviation are quite common, probably because they require no sorting and are therefore much easier to calculate; however, they are less useful to commercial users. 23.3.1 Date’s First Median Date proposed two different solutions for the median (Date 1992a; Celko and Date 1993). His first solution was based on the fact that if you duplicate every row in a table, the median will stay the same. The duplication will guarantee that you always work with a table that has an even number of rows. The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions. Here is a corrected version of his first solution: 514 CHAPTER 23: STATISTICS IN SQL CREATE VIEW Temp1 AS SELECT weight FROM Parts UNION ALL SELECT weight FROM Parts; CREATE VIEW Temp2 AS SELECT weight FROM Temp1 WHERE (SELECT COUNT(*) FROM Parts) <= (SELECT COUNT(*) FROM Temp1 AS T1 WHERE T1.weight >= Temp1.weight) AND (SELECT COUNT(*) FROM Parts) <= (SELECT COUNT(*) FROM Temp1 AS T2 WHERE T2.weight <= Temp1.weight); SELECT AVG(DISTINCT weight) AS median FROM Temp2; This involves the construction of a doubled table of values, which can be expensive in terms of both time and storage space. The use of AVG(DISTINCT x) is important, because leaving it out would return the simple average instead of the median. Consider the set of weights (12, 17, 17, 14, 12, 19). The doubled table, Temp1, is then (12, 12, 12, 12, 14, 14, 17, 17, 17, 17, 19, 19). But because of the duplicated values, Temp2 becomes (14, 14, 17, 17, 17, 17), not just (14, 17). The simple average is (96 / 6.0) = 16; it should be (31/2.0) = 15.5 instead. 23.3.2 Celko’s First Median A slight modification of Date’s solution will avoid the use of a doubled table, but it depends on a CEILING () function. SELECT MIN(weight) smallest value in upper half FROM Parts WHERE weight IN (SELECT P1.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight >= P1.weight GROUP BY P1.weight HAVING COUNT(*) 23.3 The Median 515 <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts)) UNION SELECT MAX(weight) largest value in lower half FROM Parts WHERE weight IN (SELECT P1.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight <= P1.weight HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts)); Alternately, using the same idea and a CASE expression: SELECT AVG(DISTINCT CAST(weight AS FLOAT)) AS median FROM (SELECT MAX(weight) FROM Parts AS B1 WHERE (SELECT COUNT(*) + 1 FROM Parts WHERE weight < B1.weight) <= (SELECT CEILING (COUNT(*)/2.0) FROM Parts) UNION ALL SELECT MAX(weight) FROM Parts AS B WHERE (SELECT COUNT(*) + 1 FROM Parts WHERE weight < B.weight) <= CASE (SELECT MOD (COUNT(*), 2) FROM Parts) WHEN 0 THEN (SELECT CEILING (COUNT(*)/2.0) + 1 FROM Parts) ELSE (SELECT CEILING (COUNT(*)/2.0) FROM Parts) END) AS Medians(weight); Older versions of SQL allow a HAVING clause only with a GROUP BY ; this may not work with your SQL. The CEILING() function is included to be sure that if there is an odd number of rows in Parts, the two halves will overlap on that value. Again, truncation and rounding in division 516 CHAPTER 23: STATISTICS IN SQL are implementation-defined, so you will need to experiment with your product. 23.3.3 Date’s Second Median Date’s second solution (Date 1995b) was based on Celko’s median, folded into one query: SELECT AVG(DISTINCT Parts.weight) AS median FROM Parts WHERE Parts.weight IN (SELECT MIN(weight) FROM Parts WHERE Parts.weight IN (SELECT P2.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight <= P1.weight GROUP BY P2.weight HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts)) UNION SELECT MAX(weight) FROM Parts WHERE Parts.weight IN (SELECT P2.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight >= P1.weight GROUP BY P2.weight HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts))); Date mentions that this solution will return a NULL for an empty table and that it assumes there are no NULL s in the column. If there are NULL s, the WHERE clauses should be modified to remove them. 23.3.4 Murchison’s Median Rory Murchison of the Aetna Institute has a solution that modifies Date’s first method by concatenating the key to each value to make sure that 23.3 The Median 517 every value is seen as a unique entity. Selecting the middle values is then a special case of finding the n th item in the table. SELECT AVG(weight) FROM Parts AS P1 WHERE EXISTS (SELECT COUNT(*) FROM Parts AS P2 WHERE CAST(weight AS CHAR(5)) || P2.part_nbr >= CAST(weight AS CHAR(5)) || P1.part_nbr HAVING COUNT(*) = (SELECT FLOOR(COUNT(*) / 2.0) FROM Parts) OR COUNT(*) = (SELECT CEILING((COUNT(*) / 2.0) FROM Parts)); This method depends on being able to have a HAVING clause without a GROUP BY , which is part of the ANSI standard but often missed by new programmers. Another handy trick, if you don’t have FLOOR() and CEILING() functions, is to use (COUNT(*) + 1) / 2.0 and COUNT(*) / 2.0 + 1 to handle the odd-and-even-elements problem. Just to work it out, consider the case where the COUNT(*) returns 8 for an answer: (8 + 1) / 2.0 = (9 / 2.0) = 4.5 and (8 / 2.0) + 1 = 4 + 1 = 5. The 4.5 will round to 4 in DB2 and other SQL implementations. The case where the COUNT(*) returns 9 would work like this: (9 + 1) / 2.0 = (10 / 2.0) = 5 and (9 / 2.0) + 1 = 4.5 + 1 = 5.5, which will likewise round to 5 in DB2. 23.3.5 Celko’s Second Median This is another method for finding the median that uses a working table with the values, as well as a tally of their occurrences from the original table. This working table should be quite a bit smaller than the original table, and it should be very fast to construct if there is an index on the target column. The Parts table will serve as an example, thus: construct Working table of occurrences by weight CREATE TABLE Working (weight REAL NOT NULL, occurs INTEGER NOT NULL); INSERT INTO Working (weight, occurs) 518 CHAPTER 23: STATISTICS IN SQL SELECT weight, COUNT(*) FROM Parts GROUP BY weight; Now that we have this table, we want to use it to construct a summary table that has the number of occurrences of each weight and the total number of data elements before and after we add them to the working table. construct table of cumulative tallies CREATE TABLE Summary (weight REAL NOT NULL, occurs INTEGER NOT NULL, number of occurrences pre_tally INTEGER NOT NULL, cumulative tally before post_tally INTEGER NOT NULL); cumulative tally after INSERT INTO Summary SELECT S2.weight, S2.occurs, SUM(S1.occurs) - S2.occurs, SUM(S1.occurs) FROM Working AS S1, Working AS S2 WHERE S1.weight <= S2.weight GROUP BY S2.weight, S2.occurs; Let (n / 2.0) be the middle position in the table. There are two mutually exclusive situations. In the first case, the median lies in a position between the pre_tally and post_tally of one weight value. In the second case, the median lies on the pre_tally of one row and the post_tally of another. The middle position can be calculated by the scalar subquery (SELECT MAX(post_tally) / 2.0 FROM Summary). SELECT AVG(S3.weight) AS median FROM Summary AS S3 WHERE (S3.post_tally > (SELECT MAX(post_tally) / 2.0 FROM Summary) AND S3.pre_tally < (SELECT MAX(post_tally) / 2.0 FROM Summary)) OR S3.pre_tally = (SELECT MAX(post_tally) / 2.0 FROM Summary) OR S3.post_tally = (SELECT MAX(post_tally) / 2.0 FROM Summary); 23.3 The Median 519 The first predicate, with the AND operator, handles the case where the median falls inside one weight value; the other two predicates handle the case where the median is between two weights. A BETWEEN predicate will not work in this query. These tables can be used to compute percentiles, deciles, and quartiles simply by changing the scalar subquery. For example, to find the highest tenth (first dectile), the subquery would be (SELECT 9 * MAX(post_tally) / 10 FROM Summary); to find the highest two- tenths, (SELECT 8 * MAX(post_tally) / 10 FROM Summary). In general, to find the highest n-tenths, (SELECT (10 - n) * MAX(post_tally) / 10 FROM Summary). 23.3.6 Vaughan’s Median with VIEWs Philip Vaughan of San Jose, California proposed a simple median technique based on all of these methods. It derives a VIEW with unique weights and number of occurrences, and then derives a VIEW of the middle set of weights. CREATE VIEW ValueSet(weight, occurs) AS SELECT weight, COUNT(*) FROM Parts GROUP BY weight; The MiddleValues VIEW is used to get the median by taking an average. The clever part of this code is the way it handles empty result sets in the outermost WHERE clause that result from having only one value for all weights in the table. Empty sets sum to NULL, because there is no element to map the index. CREATE VIEW MiddleValues(weight) AS SELECT weight FROM ValueSet AS VS1 WHERE (SELECT SUM(VS2.occurs)/2.0 + 0.25 FROM ValueSet AS VS2) > (SELECT SUM(VS2.occurs) FROM ValueSet AS VS2 WHERE VS1.weight <= VS2.weight) - VS1.occurs AND (SELECT SUM(VS2.occurs)/2.0 + 0.25 FROM ValueSet AS VS2) > (SELECT SUM(VS2.occurs) FROM ValueSet AS VS2 520 CHAPTER 23: STATISTICS IN SQL WHERE VS1.weight >= VS2.weight) - VS1.occurs; SELECT AVG(weight) AS median FROM MiddleValues; 23.3.7 Median with Characteristic Function Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the median (SQL Forum 1993, 1994). They define a characteristic function, which they call delta, using the Sybase SIGN() function. The delta or characteristic function accepts a Boolean expression as an argument, and returns one if it is TRUE and zero if it is FALSE or UNKNOWN. We can construct the delta function easily with a CASE expression. The authors also distinguish between the statistical median, whose value must be a member of the set, and the financial median, whose value is the average of the middle two members of the set. A statistical median exists when the number of items in the set is odd. If the number of items is even, you must decide whether you want to use the highest value in the lower half (they call this the left median) or the lowest value in the upper half (they call this the right median). The left statistical median of a unique column can be found with this query, if you assume that we have a column called bin that represents the storage location of a part. SELECT P1.bin FROM Parts AS P1, Parts AS P2 GROUP BY P1.bin HAVING SUM(CASE WHEN (P2.bin <= P1.bin) THEN 1 ELSE 0 END) = (COUNT(*) / 2.0); Changing the direction of the theta test in the HAVING clause will allow you to pick the right statistical median if a central element does not exist in the set. You will also notice something else about the median of a set of unique values: it is usually meaningless. What does the median bin number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a median. The statistical median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap; thus, the left statistical median is found by: SELECT P1.weight FROM Parts AS P1, Parts AS P2 23.3 The Median 521 GROUP BY P1.weight HAVING SUM(CASE WHEN P2.weight <= P1.weight THEN 1 ELSE 0 END) >= (COUNT(*) / 2.0) AND SUM(CASE WHEN P2.weight >= P1.weight THEN 1 ELSE 0 END) >= (COUNT(*) / 2.0); Notice that here the left and right medians can be the same, so there is no need to pick one over the other in many of the situations where you have an even number of items. Switching the comparison operators in the two CASE expressions will give you the right statistical median. The authors’ query for the financial median depends on some Sybase features that cannot be found in other products, so I would recommend using a combination of the right and left statistical medians to return a set of values about the center of the data, and then averaging them. Using a derived table, we can write the query as: SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.weight HAVING (SUM(CASE WHEN P2.weight <= P1.weight THEN 1 ELSE 0 END) >= ((COUNT(*)) / 2.0) AND SUM(CASE WHEN P2.weight >= P1.weight THEN 1 ELSE 0 END) >= (COUNT(*)/2.0))); In doing this, we can gain some additional control over the calculations. This version will use one copy of the left and right median to compute the statistical median. However, by simply changing the AVG(DISTINCT weight) to AVG(weight), the median will favor the direction with the most occurrences. This might be easier to see with an example. Assume that we have weights (13, 13, 13, 14) in the Parts table. A pure statistical median would be (13 + 14) /2.0 = 13.5; however, weighting it would give (13 + 13 + 13 + 14) / 4.0 = 13.25, which is more representative of central tendency. Another version of the financial median, which uses the CASE expression in both of its forms, is: . 'p1' 'Nut' 'Red' '12' 'London' 'p2' 'Bolt' 'Green' '17' 'Paris' 'p3' 'Cam'. 'Blue' '12' 'Paris' 'p4' 'Screw' 'Red' '14' 'London' 'p5' 'Cam' 'Blue' '12'. 0.25 FROM ValueSet AS VS2) > (SELECT SUM(VS2.occurs) FROM ValueSet AS VS2 520 CHAPTER 23: STATISTICS IN SQL WHERE VS1.weight >= VS2.weight) - VS1.occurs; SELECT AVG(weight) AS median FROM

Định dạng
Số trang	10
Dung lượng	133,92 KB