442 CHAPTER 21: AGGREGATE FUNCTIONS 'Chester' 'A.' 'Arthur' 'R' 1881 1885 'Grover' ' ' 'Cleveland' 'D' 1885 1889 'Benjamin' ' ' 'Harrison' 'R' 1889 1893 'Grover' ' ' 'Cleveland' 'D' 1893 1897 'William' ' ' 'McKinley' 'R' 1897 1901 'Theodore' ' ' 'Roosevelt' 'R' 1901 1909 'William' 'H.' 'Taft' 'R' 1909 1913 'Woodrow' ' ' 'Wilson' 'D' 1913 1921 'Warren' 'G.' 'Harding' 'R' 1921 1923 'Calvin' ' ' 'Coolidge' 'R' 1923 1929 'Herbert' 'C.' 'Hoover' 'R' 1929 1933 'Franklin' 'D.' 'Roosevelt' 'D' 1933 1945 'Harry' 'S.' 'Truman' 'D' 1945 1953 'Dwight' 'D.' 'Eisenhower' 'R' 1953 1961 'John' 'F.' 'Kennedy' 'D' 1961 1963 'Lyndon' 'B.' 'Johnson' 'D' 1963 1969 'Richard' 'M.' 'Nixon' 'R' 1969 1974 'Gerald' 'R.' 'Ford' 'R' 1974 1977 'James' 'E.' 'Carter' 'D' 1977 1981 'Ronald' 'W.' 'Reagan' 'R' 1981 1989 'George' 'H.W.' 'Bush' 'R' 1989 1993 'William' 'J.' 'Clinton' 'D' 1993 2001 'George' 'W. ' 'Bush' 'R' 2001 NULL Your civics teacher has just asked you to tell her how many people have been President of the United States. So you write the query as SELECT COUNT(*) FROM Presidents; and get the wrong answer. For those of you who have been out of high school too long, more than one Adams, more than one John, and more than one Roosevelt have served as president. Many people have had more than one term in office, and Grover Cleveland served two discontinuous terms. In short, this database is not a simple one-row, one-person system. What you really want is not COUNT(*) , but something that is able to look at unique combinations of multiple columns. You cannot do this in one column, so you need to construct an expression that is unique. The point is that you need to be very sure that the expression you are using as a parameter is really what you wanted to count. The COUNT([ALL] <value expression>) returns the number of members in the <value expression> set. The NULL s were thrown away before the counting took place, and an empty set returns zero. The best way to read this is: “Count the number of known values in this 21.2 SUM() Functions 443 expression,” with stress on the word known. In this example you might use COUNT(first_name || ' ' || initial || ' ' || last_name) . The COUNT(DISTINCT <value expression>) returns the number of unique members in the <value expression> set. The NULL s were thrown away before the counting took place, and then all redundant duplicates are removed (i.e., we keep one copy). Again, an empty set returns a zero, just as with the other counting functions. Applying this function to a key or a unique column is the same as using the COUNT(*) function, but the optimizer may not be smart enough to spot it. Notice that the use of the keywords ALL and DISTINCT follows the same pattern here as they did in the [ ALL | DISTINCT ] options in the SELECT clause of the query expressions. 21.2 SUM() Functions This function works only with numeric values. You should also consult your particular product’s manuals to find out the precision of the results for exact and approximate numeric data types. SUM([ALL] <value expression>) returns the numeric total of all known values. The NULL s are removed before the summation takes place. An empty set returns an empty result set, not a zero. If there are other columns in the SELECT list, then that empty set will be converted into a NULL . SUM(DISTINCT <value expression>) returns the numeric total of all known, unique values. The NULL s and all redundant duplicates were removed before the summation took place. Again, an empty set returns an empty result set, not a zero. That last rule is hard for people to understand. If there are other columns in the SELECT list, then that empty result set will be converted into a NULL . This is true for the rest of the Standard aggregate functions: no rows SELECT SUM(x) FROM EmptyTable; one row with (0, NULL) in it SELECT COUNT(*), SUM(x) FROM EmptyTable; 444 CHAPTER 21: AGGREGATE FUNCTIONS The summation of a set of numbers looks as though it should be easy, but it is not. Make two tables with the same set of positive and negative approximate numeric values, but put one in random order and have the other sorted by absolute value. The sorted table will give more accurate results. The reason is simple: positive and negative values of the same magnitude will be added together and will get a chance to cancel each other out. There is also less chance of an overflow or underflow error during calculations. Most PC SQL implementations and a lot of mainframe implementations do not bother with this trick, because it would require a sort for every SUM() statement, which would take a long time. Whenever an exact or approximate numeric value is assigned to exact numeric, it may not fit into the storage allowed for it. SQL says that the database engine will use an approximation that preserves leading significant digits of the original number after rounding or truncating. The choice of whether to truncate or round is implementation-defined, however. This can lead to some surprises when you have to shift data among SQL implementations, or move storage values from a host language program into an SQL table. It is probably a good idea to create the columns with one more decimal place than you think you need. Truncation is defined as truncation toward zero; this means that 1.5 would truncate to 1, and − 1.5 would truncate to − 1. This is not true for all programming languages; everyone agrees on truncation toward zero for the positive numbers, but you will find that negative numbers may truncate away from zero (e.g., − 1.5 would truncate to − 2). SQL is also wishy-washy on rounding, leaving the implementation free to determine its method. There are two major types of rounding, the scientific method and the commercial method, which are discussed in Section 3.2.1 on rounding and truncation math in SQL. 21.3 AVG() Functions AVG([ALL] <value expression>) returns the average of the values in the value expression set. An empty set returns an empty result set. A set of all NULL s will become an empty set. Remember that in general, AVG(x) is not the same as (SUM(x)/COUNT(*)) ; the SUM(x) function has thrown away the NULL s, but the COUNT(*) has not. Likewise, AVG(DISTINCT <value expression>) returns the average of the distinct known values in the <value expression> set. Applying this function to a key or a unique column is the same as the using AVG(<value expression>) function. 21.3 AVG() Functions 445 Remember that in general AVG(DISTINCT x) is not the same as AVG(x) or (SUM(DISTINCT x)/COUNT(*)). The SUM(DISTINCT x) function has thrown away the duplicate values and NULLs, but the COUNT(*) has not. An empty set returns an empty result set. The SQL engine is probably using the same code for the totaling in the AVG() that it used in the SUM() function. This leads to the same problems with rounding and truncation, so you should experiment a little with your particular product to find out what happens. But even more troublesome than those problems is the problem with the average itself, because it does not really measure central tendency and can be very misleading. Consider the chart below, from Darrell Huff’s superlative little book, How to Lie with Statistics (Huff 1954). The Sample Company has 25 employees, earning the following salaries: Number of Employees Salary Statistic =================================== 12 $2,000 Mode, Minimum 1 $3,000 Median 4 $3,700 3 $5,000 1 $5,700 Average 2 $10,000 1 $15,000 1 $45,000 Maximum The average salary (or, more properly, the arithmetic mean) is $5,700. When the boss is trying to look good to the unions, he uses this figure. When the unions are trying to look impoverished, they use the mode, which is the most frequently occurring value, to show that the exploited workers are making $2,000 (which is also the minimum salary in this case). A better measure in this case is the median, which will be discussed later; that is, the employee with just as many cases above him as below him. That gives us $3,000. The rule for calculating the median is that if there is no actual entity with that value, you fake it. Most people take an average of the two values on either side of where the median would be; others jump to the higher or lower value. The mode also has a problem, because not every distribution of values has one mode. Imagine a country in which there are as many very poor people as there are very rich people, and there is nobody in between. 446 CHAPTER 21: AGGREGATE FUNCTIONS This would be a bimodal distribution. If there were sharp classes of incomes, that would be a multimodal distribution. Some SQL products have median and mode aggregate functions as extensions, but they are not part of the standard. We will discuss in detail how to write them in pure SQL in Chapter 23. 21.3.1 Averages with Empty Groups The query used here is a bit tricky, so this section can be skipped on your first reading. Sometimes you need to count an empty set as part of the population when computing an average. This is easier to explain with an example that was posted on CompuServe. A fish and game warden is sampling different bodies of water for fish populations. Each sample falls into one or more groups (muddy bottoms, clear water, still water, and so on) and she is trying to find the average of something that is not there. This is neither quite as strange as it first sounds, nor quite as simple, either. She is collecting sample data on fish in a table like this: CREATE TABLE Samples (sample_id INTEGER NOT NULL, fish CHAR(20) NOT NULL, found_cnt INTEGER NOT NULL, PRIMARY KEY (sample_id, fish)); CREATE TABLE SampleGroups (group_id INTEGER NOT NULL, sample_id INTEGER NOT NULL, PRIMARY KEY (group_id, sample_id); Assume some of the data looks like this: Samples sample_id fish found_cnt ============================ 1 'Seabass' 14 1 'Minnow' 18 2 'Seabass' 19 21.3 AVG() Functions 447 SampleGroups group_id sample_id ===================== 1 1 1 2 2 2 She needs to get the average number of each species of fish in the sample groups. For example, using sample group 1 as shown, which has samples 1 and 2, we could use the parameters :my_fish =‘Minnow’ and :my_group = 1 to find the average number of minnows in sample group 1, thus: SELECT fish, AVG(found_cnt) FROM Samples WHERE sample_id IN (SELECT sample_id FROM SampleGroups WHERE group_id = :my_group) AND fish = :my_fish GROUP BY fish; But this query will give us an average of 18 minnows, which is wrong. There were no minnows for sample_id = 2, so the average is ((18 + 0)/2) = 9. The other way is to do several steps to get the correct answer—first use a SELECT statement to get the number of samples involved, then another SELECT to get the sum, and then manually calculate the average. The obvious answer is to enter a count of zero for each animal under each sample_id, instead of letting it be missing, so you can use the original query. You can create the missing rows with: INSERT INTO Samples SELECT M1.sample_id, M2.fish, 0 FROM Samples AS M1, Samples AS M2 WHERE NOT EXISTS (SELECT * FROM Samples AS M3 WHERE M1.sample_id = M3.sample_id AND M2.fish = M3.fish); 448 CHAPTER 21: AGGREGATE FUNCTIONS Unfortunately, it turns out that we have over 100,000 different species of fish and thousands of samples. This trick will fill up more disk space than we have on the machine. The best trick is to use this statement: SELECT fish, SUM(found_cnt)/ (SELECT COUNT(sample_id) FROM SampleGroups WHERE group_id = :my_group) FROM Samples WHERE fish = :my_fish GROUP BY fish; This query is using the rule that the average is the sum of values divided by the count of the set. Another way to do this would be to use an OUTER JOIN and preserve all the group IDs, but that would create NULLs for the fish that are not in some of the sample groups, and you would have to handle them. 21.3.2 Averages across Columns The sum of several columns can be done with COALESCE() function to effectively remove the NULLs by replacing them with zeros: SELECT (COALESCE(c1, 0.0) + COALESCE(c2, 0.0) + COALESCE(c3, 0.0)) AS c_total FROM Foobar; Likewise, the minimum and maximum values of several columns can be done with a CASE expression, or the GREATEST() and LEAST() functions. Taking an average across several columns is easy if none of the columns are NULL. You simply add the values and divide by the number of columns. However, getting rid of NULLs is a bit harder. The first trick is to count the NULLs: SELECT (COALESCE(c1-c1, 1) + COALESCE(c2-c2, 1) + COALESCE(c3-c3, 1)) AS null_cnt FROM Foobar; 21.4 Extrema Functions 449 The trick is to watch out for a row with all NULLs in it. This could lead to a division by zero error. SELECT CASE WHEN COALESCE(c1, c2, c3) IS NULL THEN NULL ELSE (COALESCE(c1, 0.0) + COALESCE(c2, 0.0) + COALESCE(c3, 0.0)) / (3 - (COALESCE(c1-c1, 1) + COALESCE(c2-c2, 1) + COALESCE(c3-c3, 1)) END AS hortizonal_avg FROM Foobar; 21.4 Extrema Functions The MIN() and MAX() functions are known as extrema functions in mathematics. They assume that the elements of the set have an ordering, so it makes sense to select a first or last element based on its value. SQL provides two simple extrema functions, and you can write queries to generalize these to (n) elements. 21.4.1 Simple Extrema Functions MAX([ALL | DISTINCT] <value expression>) returns the greatest known value in the <value expression> set. This function will work on character and temporal values, as well as numeric values. An empty set returns an empty result set. Technically, you can write MAX(DISTINCT <value expression>), but it is the same as MAX(<value expression>); this form exists only for completeness, and nobody ever uses it. MIN([ALL | DISTINCT] <value expression>) returns the smallest known value in the <value expression> set. This function will also work on character and temporal values, as well as numeric values. An empty set returns a NULL. Likewise, MIN(DISTINCT <value expression>) exists, but it is defined only for completeness and nobody ever uses it. The MAX() for a set of numeric values is the largest. The MAX() for a set of temporal data types is the one closest to 9999-12-31, which is the final data in the ISO-8601 Standard. The MAX() for a set of character strings is the last one in the ascending sort order. Likewise, the MIN() for a set of numeric values is the smallest. The MIN() for a set of 450 CHAPTER 21: AGGREGATE FUNCTIONS temporal data types is the one furthest from 9999-12-31. The MIN() for a set of character strings is the first one in the ascending sort order, but you have to know the collation used. People have a hard time understanding the MAX() and MIN() aggregate functions when they are applied to temporal data types. They seem to expect the MAX() to return the date closest to the current date. Likewise, if the set has no dates before the current date, they seem to expect the MIN() function to return the date closest to the current date. Human psychology wants to use the current time as an origin point for temporal reasoning. Consider the predicate “ billing_date < (CURRENT_DATE - INTERVAL '90' DAY)” as an example. Most people have to stop and figure out that this is looking for billings that are over 90 days past due. This same thing happens with MIN() and MAX() functions. SQL also has funny rules about comparing VARCHAR strings, which can cause problems. When two strings are compared for equality, the shortest one is right-padded with blanks; then they are compared position for position. Thus, the strings ‘ John ’ and ‘John ’ are equal. You will have to check your implementation of SQL to see which string is returned as the MAX() and which as the MIN(), or whether there is any pattern to it at all. There are some tricks with extrema functions in subqueries that differ from product to product. For example, to find the current employee status in a table of Salary Histories, the obvious query is: SELECT * FROM SalaryHistory AS S0 WHERE S0.change_date = (SELECT MAX(S1.change_date) FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id); But you can also write the query as: SELECT * FROM SalaryHistory AS S0 WHERE NOT EXISTS (SELECT * FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id AND S0.change_date < S1.change_date); 21.4 Extrema Functions 451 The correlated subquery with a MAX() will be implemented by going to the subquery and building a working table, which is grouped by emp_id. Then for each group you will keep track of the maximum and save it for the final result. However, the NOT EXISTS version will find the first row that meets the criteria and, when found, return TRUE. Therefore, the NOT EXISTS() predicate might run faster. 21.4.2 Generalized Extrema Functions This is known as the Top (or Bottom) (n) values problem, and it originally appeared in Explain magazine; it was submitted by Jim Wankowski of Hawthorne, CA (Wankowski n.d.). You are given a table of Personnel and their salaries. Write a single SQL query that will display the three highest salaries from that table. It is easy to find the maximum salary with the simple query SELECT MAX(salary) FROM Personnel; but SQL does not have a maximum function that will return a group of high values from a column. The trouble with this query is that the specification is bad, for several reasons. 1. How do we define “best salary” in terms of an ordering? Is it base pay or does it include commissions? For the rest of this section, assume that we are using a simple table with a column that has the salary for each employee. 2. What if we have three or fewer personnel in the company? Do we report all the personnel we do have? Or do we return a NULL, empty result set or error message? This is the equivalent of calling the contest for lack of entries. 3. How do we handle two personnel who tied? Include them all and allow the result set to be bigger than three? Pick an arbitrary subset and exclude someone? Or do we return a NULL, empty result set, or error message? To make these problems more explicit, consider this table: Personnel emp_name salary ================== 'Able' 1000.00 'Baker' 900.00 . 1974 'Gerald' 'R.' 'Ford' 'R' 1974 1977 'James' 'E.' 'Carter' 'D' 1977 1981 'Ronald' 'W.' 'Reagan'. &apos ;S. ' 'Truman' 'D' 1945 1953 'Dwight' 'D.' 'Eisenhower' 'R' 1953 1961 'John' 'F.' 'Kennedy' 'D'. ' ' 'Wilson' 'D' 1913 1921 'Warren' 'G.' 'Harding' 'R' 1921 1923 'Calvin' ' ' 'Coolidge' 'R'