522 CHAPTER 23: STATISTICS IN SQL SELECT CASE MOD(COUNT(*),2) WHEN 0 even sized table THEN (P1.weight + MIN(CASE WHEN P1.weight > P2.weight THEN P1.weight ELSE NULL END))/2.0 ELSE P2.weight odd sized table END FROM Parts AS P1, Parts AS P2 GROUP BY P1.weight HAVING COUNT(CASE WHEN P1.weight >= P2.weight THEN 1 ELSE NULL END) = (COUNT(*) + 1) / 2; This answer is due to Ken Henderson. 23.3.8 Celko’s Third Median Another approach involves looking at a picture of a line of sorted values and seeing where the median would fall. Every value in the column weight partitions the table into three sections: values that are less than weight, values that are equal to weight, and values that are greater than weight. We can get a profile of each value with a tabular subquery expression. Now the question is how to define a median in terms of the partitions. Clearly, the definition of a median means that if (lesser = greater) then weight is the median. Now look at Figure 23.1 for the other situations. If there are more elements in the greater values than half the size of the table, then weight cannot be a median. Likewise, if there are more elements in the lesser values than half the size of the table, then weight cannot be a median. If (lesser + equal) = greater, then weight is a left-hand median. Likewise, if (greater + equal) = lesser, then weight is a right-hand median. However, if weight is the median, then both lesser and greater must have tallies of less than half the size of the table. That translates into the following SQL: SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight = P1.weight 23.3 The Median 523 THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight > P1.weight THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight) AS Partitions (part_nbr, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0 AND greater <= (SELECT COUNT(*) FROM Parts)/2.0); The reason for not expanding the VIEW in the FROM clause into a tabular subquery expression is that the table can be used for other partitions of the table, such as quintiles. It is also worth noting that you can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause. The AVG(DISTINCT i) will return the usual median when there are two values. This happens when you have an even number of rows and a partition in the middle, such as, (1, 2, 2, 3, 3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median. The AVG(i) will return the weighted median instead. The weighted median looks at the set of middle values and skews in favor of the more common of the two values. The table with (1, 2, 2, 3, 3, 3) would return (2, 2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median. The weighted median is a more accurate description of the data. I sent this first attempt to Richard Romley, who invented the method of first working with groups when designing a query. It made it quite a Figure 23.1 Defining a Median. 524 CHAPTER 23: STATISTICS IN SQL bit simpler, but let me take you through the steps so you can see the reasoning. Look at the WHERE clause. It could use some algebra, and since it deals only with aggregate functions and scalar subqueries, you could move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but it is not always possible. First, though, let’s do some algebra on the expression in the WHERE clause. lesser <= (SELECT COUNT(*) FROM Parts)/2.0 Since we already have lesser, equal, and greater for every row in the derived table Partitions, and since the sum of lesser, equal, and greater must always be exactly equal to the total number of rows in the Parts table, we can replace the scalar subquery with this expression: lesser <= (lesser + equal + greater)/2.0 But this is the same as: 2.0 * lesser <= lesser + equal + greater which becomes: 2.0 * lesser - lesser <= equal + greater which becomes: lesser <= equal + greater So the query becomes: SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight > P1.weight 23.3 The Median 525 THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight) AS Partitions (part_nbr, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= equal + greater AND greater <= equal + lesser); We can rewrite the WHERE clause with DeMorgan’s law. WHERE lesser = greater OR (equal >= lesser - greater AND equal >= greater - lesser) But this is the same as: WHERE lesser = greater OR equal >= ABS(lesser - greater) But if the first condition was true (lesser = greater), the second must necessarily also be true (i.e., equal >= 0), so the first clause is redundant and can be eliminated completely. WHERE equal >= ABS(lesser - greater) So much for algebra. Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query that derives Partitions? This eliminates all but one column from the derived table, it will run much faster, and it simplifies the query to this: SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight HAVING SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END) >= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1 WHEN P2.weight > P1.weight THEN -1 ELSE 0 END))) AS Partitions; 526 CHAPTER 23: STATISTICS IN SQL If you prefer to use functions instead of a CASE expression, then use this version of the query: SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight)) >= ABS(SUM(SIGN (P1.weight - P2.weight))) AS Partitions; 23.3.9 Ken Henderson’s Median In many SQL products, the fastest way to find the median is to use a cursor and just go to the middle of the sorted table. Ken Henderson published a version of this solution with a cursor that can be translated in SQL/PSM. Assume that we wish to find the median of column “x” in table Foobar. BEGIN DECLARE idx INTEGER; DECLARE median NUMERIC(20,5); DECLARE median2 NUMERIC(20,5); DECLARE Median_Cursor CURSOR FOR SELECT x FROM Foobar ORDER BY x FOR READ ONLY; SET idx = CASE WHEN (MOD(SELECT COUNT(*) FROM Foobar), 2) = 0 THEN (SELECT COUNT(*) FROM Foobar)/2 ELSE ((SELECT COUNT(*) FROM Foobar)/2) + 1 END; OPEN Median_Cursor; FETCH ABSOLUTE idx FROM Median_Cursor INTO median; IF MOD(idx, 2) = 0 THEN FETCH Median_Cursor INTO median2; SET median = (median + median2)/2; END IF; CLOSE Median_Cursor; END; 23.4 Variance and Standard Deviation 527 This might not be true in other products. Some of them have a median function that uses the balanced tree indexes to locate the middle of the distribution. If the distribution is symmetrical and has only a single peak, then the mode, median, and mean are the same value. If not, then the distribution is somehow skewed. If (mode < median < mean) then the distribution is skewed to the right. If (mode > median > mean) then the distribution is skewed to the left. 23.4 Variance and Standard Deviation The standard deviation is a measure of how far away from the average the values in a normally distributed population are. It is hard to calculate in SQL, because it involves a square root, and standard SQL has only the basic four arithmetic operators. Many vendors will allow you to use other math functions, but in all fairness, most SQL databases are in commercial applications and have little or no need for engineering or statistical calculations. The usual trick is to load the raw data into an appropriate host language, such as FORTRAN, and do the work there. The formula for the standard deviation is: where (n) is the number of items in the sample set, and the xs are the values of the items. The variance is defined as the standard deviation squared, so we can avoid taking a square root and keep the calculations in pure SQL. The queries look like this: CREATE TABLE Samples (x REAL NOT NULL); INSERT INTO Samples (x) VALUES (64.0), (48.0), (55.0), (68.0), (72.0), (59.0), (57.0), (61.0), (63.0), (60.0), (60.0), (43.0), (67.0), (70.0), (65.0), (55.0), (56.0), (64.0), (61.0), (60.0); SELECT ((COUNT(*) * SUM(x*x)) - (SUM(x) * SUM(x))) /(COUNT(*) * (COUNT(*)-1)) AS variance FROM Samples; 528 CHAPTER 23: STATISTICS IN SQL If you want to check this on your own SQL product, the correct answer is 48.9894 . . . or just 49 depending how you handle rounding. If your SQL product has a standard deviation operator, use it instead. 23.5 Average Deviation If you have a version of SQL with an absolute value function, ABS(), you can also compute the average deviation following this pattern: BEGIN SELECT AVG(x) INTO :average FROM Samples; SELECT SUM(ABS(x - :average)) / COUNT(x) AS AverDeviation FROM Samples; END; This is a measure of how far data values drift away from the average, without any consideration of the direction of the drift. 23.6 Cumulative Statistics A cumulative or running statistic looks at each data value and how it is related to the whole data set. The most common examples involve changes in an aggregate value over time or on some other well-ordered dimension. A bank balance, which changes with each deposit or withdrawal, is a running total over time. The total weight of a delivery truck as we add each package is a running total over the set of packages. But since two packages can have the same weight, we need a way to break ties—for example, use the arrival dates of the packages, and if that fails, use the alphabetical order of the last names of the shippers. In SQL, this means that we need a table with a key that we can use to order the rows. Computer people classify reports as one-pass reports or two-pass reports, a terminology that comes from the number of times the computer used to have to read the data file to produce the desired results. These are really cumulative aggregate statistics. Most report writers can produce a listing with totals and other aggregated descriptive statistics after each grouping (e.g., “Give me the total amount of sales broken down by salesmen within territories”). Such reports are called banded reports or control-break reports, depending on the vendor. The closest thing to such reports that the SQL language has is the GROUP BY clause used with aggregate functions. 23.6 Cumulative Statistics 529 The two-pass report involves finding out something about the group as a whole in the first pass, then using that information in the second pass to produce the results for each row in the group. The most common two-pass reports order the groups against each other (“Show me the total sales in each territory, ordered from high to low”) or show the cumulative totals or cumulative percentages within a group (“Show me what percentage each customer contributes to total sales”). 23.6.1 Running Totals Running totals keep track of changes, which usually occur over time, but these could be changes on some other well-ordered dimension. A common example we all know is a bank account, for which we record withdrawals and deposits in a checkbook register. The running total is the balance of the account after each transaction. The query for the checkbook register is simply: SELECT B0.transaction, B0.trans_date, SUM(B1.amount) AS balance FROM BankAccount AS B0, BankAccount AS B1 WHERE B1.trans_date <= B0.trans_date GROUP BY B0.transaction, B0.trans_date; You can use a scalar subquery instead: SELECT B0.transaction, B0.trans_date, (SELECT SUM(B1.amount) FROM BankAccount AS B1 WHERE B1.trans_date <= B0.trans_date) AS balance FROM BankAccount AS B0; Which version will work better is dependent on your SQL product. Notice that this query handles both deposits (positive numbers) and withdrawals (negative numbers). There is a problem with running totals when two items occur at the same time. In this example, the transaction code keeps the transactions unique, but it is possible to have a withdrawal and a deposit on the same day that will be aggregated together. If we showed the withdrawals before the deposits on that day, the balance could fall below zero, which might trigger some actions we don’t want. The rule in banking is that deposits are credited before withdrawals on the same day, so simply extend the transaction date to show all 530 CHAPTER 23: STATISTICS IN SQL deposits with a time before all withdrawals to fool the query. But remember that not all situations have a clearly defined policy like this. Here is another version of the cumulative total problem that attempts to reduce the work done in the outermost query. Assume we have a table with data on the amount of sales to customers. We want to see each amount and the cumulative total, in order by the amount, at which the customer gave us more than $500.00 in business. SELECT C1.cust_id, C1.sales_amt, SUM(C2.sales_amt) AS cumulative_amt FROM Customers AS C1 INNER JOIN Customers AS C2 ON C1.sales_amt <= C2.sales_amt WHERE C1.sales_amt >= (SELECT MAX(X.sales_amt) FROM (SELECT C3. FROM Customers AS C3 INNER JOIN Customers AS C4 ON C3.sales_amt <= C4.sales_amt GROUP BY C3.cust_id, C3.sales_amt HAVING SUM(C4.sales_amt) >= 500.00) AS X(sales_amt)) GROUP BY C1.cust_id, C1.sales_amt; This query limits the processing that must be done in the outer query by first calculating the cutoff point for each customer. This sort of trick is best for larger tables where the self-join is often very slow. 23.6.2 Running Differences Another kind of statistic, related to running totals, is running differences. In this case, we have the actual amount of something at various points in time and we want to compute the change since the last reading. Here is a quick scenario: we have a clipboard and a paper form on which we record the quantities of a chemical in a tank at different points in time from a gauge. We need to report the time, the gauge reading, and the difference between each reading and the preceding one. Here is some sample result data, showing the calculation we need: 23.6 Cumulative Statistics 531 tank reading quantity difference ==================================================== '50A' '2005-02-01-07:30' 300 NULL starting data '50A' '2005-02-01-07:35' 500 200 '50A' '2005-02-01-07:45' 1200 700 '50A' '2005-02-01-07:50' 800 -400 '50A' '2005-02-01-08:00' NULL NULL '50A' '2005-02-01-09:00' 1300 500 '51A' '2005-02-01-07:20' 6000 NULL starting data '51A' '2005-02-01-07:22' 8000 2000 '51A' '2005-02-01-09:30' NULL NULL '51A' '2005-02-01-00:45' 5000 -3000 '51A' '2005-02-01-01:00' 2500 -2500 The NULL values mean that we missed taking a reading. The trick is a correlated subquery expression that computes the difference between the quantity in the current row and the quantity in the row with the largest known time value that is less than the time in the current row on the same date and on the same tank. SELECT tank, time, (quantity - (SELECT quantity FROM Deliveries AS D1 WHERE D1.tank = D0.tank same tank AND D1.time = (SELECT (MAX D2.time) most recent delivery FROM Deliveries AS D2 WHERE D2.tank = D0.tank same tank AND D2.time < D0.time))) AS difference FROM Deliveries AS D0; This is a modification of the running-totals query, but it is more elaborate, since it cannot use the sum of the prior history. 23.6.3 Cumulative Percentages Cumulative percentages are a bit more complex than running totals or differences. They show what percentage of the whole set of data values the current subset of data values is. Again, this is easier to show with an example than to say in words. You are given a table of the sales made by your sales force, which looks like this: . '200 5-0 2-0 1-0 7:30' 300 NULL starting data '50A' '200 5-0 2-0 1-0 7:35' 500 200 '50A' '200 5-0 2-0 1-0 7:45' 1200 700 '50A' '200 5-0 2-0 1-0 7:50'. '51A' '200 5-0 2-0 1-0 7:22' 8000 2000 '51A' '200 5-0 2-0 1-0 9:30' NULL NULL '51A' '200 5-0 2-0 1-0 0:45' 5000 -3 000 '51A' '200 5-0 2-0 1-0 1:00'. '200 5-0 2-0 1-0 7:50' 800 -4 00 '50A' '200 5-0 2-0 1-0 8:00' NULL NULL '50A' '200 5-0 2-0 1-0 9:00' 1300 500 '51A' '200 5-0 2-0 1-0 7:20' 6000 NULL starting