532 CHAPTER 23: STATISTICS IN SQL CREATE TABLE Sales (salesman CHAR(10), client_name CHAR(10), sales_amount DECIMAL (9,2) NOT NULL, PRIMARY KEY (salesman, client_name)); The problem is to show each salesman, his client, the amount of that sale, what percentage of his total sales volume that one sale represents, and the cumulative percentage of his total sales we have reached at that point. We will sort the clients from the largest amount to the smallest. This problem is based on a salesman’s report originally written for a small commercial printing company. The idea was to show the salesmen where their business was coming from and to persuade them to give up their smaller accounts (defined as the lower 20%) to new salesmen. The report lets the salesman run his finger down the page and see which customers represented the top 80% of his income. We can use derived tables to build layers of aggregation in the same query. SELECT S0.salesman, S0.client_name, S0.sales_amt, ((S0.sales_amt * 100)/ ST.salesman_total) AS percent_of_total, (SUM(S1.sales_amt)/((S0.sales_amt * 100)/ ST.salesman_total)) AS cum_percent FROM Sales AS S0 INNER JOIN Sales AS S1 ON (S0.salesman, S0.client_name) <= (S1.salesman, S1.client_name) INNER JOIN (SELECT S2.salesman, SUM(S1.sales_amt) FROM Sales AS S2 GROUP BY S2.salesman) AS ST(salesman, salesman_total) ON S0.salesman = ST.salesman GROUP BY S0.salesman, S0.client_name, S0.sales_amt; However, if your SQL allows subqueries in the SELECT clause but not in the FROM clause, you can fake it with this query: 23.6 Cumulative Statistics 533 SELECT S0.salesman, S0.client_name, S0.sales_amt (S0.sales_amt * 100.00/ (SELECT SUM(S1.sales_amt) FROM Sales AS S1 WHERE S0.salesman = S1.salesman)) AS percentage_of_total, (SELECT SUM(S3.sales_amt) FROM Sales AS S3 WHERE S0.salesman = S3.salesman AND (S3.sales_amt > S0.sales_amt OR (S3.sales_amt = S0.sales_amt AND S3.client_name >= S0.client_name))) * 100.00 / (SELECT SUM(S2.sales_amt) FROM Sales AS S2 WHERE S0.salesman = S2.salesman) AS cum_percent FROM Sales AS S0; This query will probably run like glue. 23.6.4 Rankings and Related Statistics Martin Tillinger posted this problem on the MSACCESS forum of CompuServe in early 1995. How do you rank your salesmen in each territory, given a SalesReport table that looks like this? CREATE TABLE SalesReport (salesman CHAR(20) NOT NULL PRIMARY KEY REFERENCES Salesforce(salesman), territory INTEGER NOT NULL, sales_tot DECIMAL (8,2) NOT NULL); This statistic is called a ranking. A ranking is shown as integers that represent the ordinal values (first, second, third, and so on) of the elements of a set based on one of the values. In this case, sales personnel are ranked by their total sales within a territory. The one with the highest total sales is in first place, the next highest is in second place, and so forth. The hard question is how to handle ties. The rule is that if two salespersons have the same value, they have the same ranking, and there are no gaps in the rankings. This is the nature of ordinal numbers—there cannot be a third place without a first and a second place. A query that will do this for us is: 534 CHAPTER 23: STATISTICS IN SQL SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1; You might also remember that this is really a version of the generalized extrema functions we already discussed. Another way to write this query is thus: SELECT S1.salesman, S1.territory, MAX(S1.sales_tot), SUM (CASE WHEN (S1.sales_tot || S1.name) <= (S2.sales_tot || S2.name) THEN 1 ELSE 0 END) AS rank FROM SalesReport AS S2, SalesReport AS S2 WHERE S1.salesman <> S2.salesman AND S1.territory = S2.territory GROUP BY S1.salesman, S1.territory; This query uses the MAX() function on the nongrouping columns in the SalesReport to display them so that the aggregation will work. It is worth looking at the four possible variations on this basic query to see what each change does to the result set. Version 1: COUNT(DISTINCT) and >= yields a ranking. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1; salesman territory sales_tot rank ============================================= 'Wilson' 1 990.00 1 'Smith' 1 950.00 2 'Richards' 1 800.00 3 'Quinn' 1 700.00 4 'Parker' 1 345.00 5 'Jones' 1 345.00 5 23.6 Cumulative Statistics 535 'Hubbard' 1 345.00 5 'Date' 1 200.00 6 'Codd' 1 200.00 6 'Blake' 1 100.00 7 Version 2: COUNT(DISTINCT) and > yields a ranking, but it starts at zero. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot > S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1; salesman territory sales_tot rank ============================================= 'Wilson' 1 990.00 0 'Smith' 1 950.00 1 'Richard' 1 800.00 2 'Quinn' 1 700.00 3 'Parker' 1 345.00 4 'Jones' 1 345.00 4 'Hubbard' 1 345.00 4 'Date' 1 200.00 5 'Codd' 1 200.00 5 'Blake' 1 100.00 6 Version 3: COUNT(ALL) and >= yields a standing which starts at one. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS standing FROM SalesReport AS S1; salesman territory sales_tot standing ============================================= 'Wilson' 1 990.00 1 'Smith' 1 950.00 2 536 CHAPTER 23: STATISTICS IN SQL 'Richard' 1 800.00 3 'Quinn' 1 700.00 4 'Parker' 1 345.00 7 'Jones' 1 345.00 7 'Hubbard' 1 345.00 7 'Date' 1 200.00 9 'Codd' 1 200.00 9 'Blake' 1 100.00 10 Version 4: COUNT(ALL) and > yields a standing that starts at zero. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot > S1.sales_tot AND S2.territory = S1.territory) AS standing FROM SalesReport AS S1; salesman territory sales_tot standing ============================================== 'Wilson' 1 990.00 0 'Smith' 1 950.00 1 'Richard' 1 800.00 2 'Quinn' 1 700.00 3 'Parker' 1 345.00 4 'Jones' 1 345.00 4 'Hubbard' 1 345.00 4 'Date' 1 200.00 7 'Codd' 1 200.00 7 'Blake' 1 100.00 9 Another system, used in some British schools and in horse racing, will also leave gaps in the numbers, but in a different direction. For example given this set of Marks: Marks class_standing ====================== 100 1 90 2 90 2 70 4 23.6 Cumulative Statistics 537 Both students with 90 were second because only one person had a higher mark. The student with 70 was fourth because there were three people ahead of him. With our data, that would be: SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(S2. sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot > S1.sales_tot AND S2.territory = S1.territory) + 1 AS british FROM SalesReport AS S1; salesman territory sales_tot british ============================================= 'Wilson' 1 990.00 1 'Smith' 1 950.00 2 'Richard' 1 800.00 3 'Quinn' 1 700.00 4 'Parker' 1 345.00 5 'Jones' 1 345.00 5 'Hubbard' 1 345.00 5 'Date' 1 200.00 8 'Codd' 1 200.00 8 'Blake' 1 100.00 10 As an aside for the mathematicians among the readers, I always use the heuristics that it helps solve an SQL problem to think in terms of sets. What we are looking for in these ranking queries is how to assign an ordinal number to a subset of the SalesReport table. This subset is the rows that have an equal or higher sales volume than the salesman at whom we are looking. Or in other words, one copy of the SalesReport table provides the elements of the subsets, and the other copy provides the boundary of the subsets. This count is really a sequence of nested subsets. If you happen to have had a good set theory course, you would remember John von Neumann’s definition of the nth ordinal number; it is the set of all ordinal numbers less than the nth number. 23.6.5 Quintiles and Related Statistics Once you have the ranking, it is fairly easy to classify the data set into percentiles, quintiles, or dectiles. These are coarser versions of a ranking that use subsets of roughly equal size. A quintile is 1/5 of the population, 538 CHAPTER 23: STATISTICS IN SQL a dectile is 1/10 of the population, and a percentile is 1/100 of the population. I will present quintiles here, since whatever we do for them can be generalized to other partitionings. This statistic is popular with schools, so I will use the SAT scores for an imaginary group of students for my example. SELECT T1.student_id, T1.score, T1.rank, CASE WHEN T1.rank <= 0.2 * T2.population_size THEN 1 WHEN T1.rank <= 0.4 * T2.population_size THEN 2 WHEN T1.rank <= 0.6 * T2.population_size THEN 3 WHEN T1.rank <= 0.8 * T2.population_size THEN 4 ELSE 5 END AS quintile FROM (SELECT S1.student_id, S1.score, (SELECT COUNT(*) FROM SAT_Scores AS S2 WHERE S2.score >= S1.score) FROM SAT_Scores AS S1) AS T1(student_id, score, rank) CROSS JOIN (SELECT COUNT(*) FROM SAT_Scores) AS T2(population_size); The idea is straightforward: compute the rank for each element and then put it into a bucket whose size is determined by the population size. There are the same problems with ties that we had with rankings, as well as problems about what to do when the population is skewed. 23.7 Cross Tabulations A cross tabulation, or crosstab for short, is a common statistical report. It can be done in IBM’s QMF tool, using the ACROSS summary option, and in many other SQL-based reporting packages. SPSS, SAS, and other statistical packages have library procedures or language constructs for crosstabs. Many spreadsheets can load the results of SQL queries and perform a crosstab within the spreadsheet. If you can use a reporting package on the server in a client/server system instead of the following method, do so. It will run faster and in less space than the method discussed here. However, if you have to use the reporting package on the client side, the extra time required to transfer data will make these methods on the server side much faster. 23.7 Cross Tabulations 539 A one-way crosstab “flattens out” a table to display it in a report format. Assume that we have a table of sales by product and the dates the sales were made. We want to print out a report of the sales of products by years for a full decade. The solution is to create a table and populate it to look like an identity matrix (all elements on the diagonal are one, all others zero) with a rightmost column of all ones to give a row total, then JOIN the Sales table to it. CREATE TABLE Sales (product_name CHAR(15) NOT NULL, product_price DECIMAL(5,2) NOT NULL, qty INTEGER NOT NULL, sales_year INTEGER NOT NULL); CREATE TABLE Crosstabs (year INTEGER NOT NULL, year1 INTEGER NOT NULL, year2 INTEGER NOT NULL, year3 INTEGER NOT NULL, year4 INTEGER NOT NULL, year5 INTEGER NOT NULL, row_total INTEGER NOT NULL); The table would be populated as follows: Sales_year year1 year2 year3 year4 year5 row_total ======================================================== 1990 1 0 0 0 0 1 1991 0 1 0 0 0 1 1992 0 0 1 0 0 1 1993 0 0 0 1 0 1 1994 0 0 0 0 1 1 The query to produce the report table is SELECT S1.product_name, SUM(S1.qty * S1.product_price * C1.year1), SUM(S1.qty * S1.product_price * C1.year2), SUM(S1.qty * S1.product_price * C1.year3), SUM(S1.qty * S1.product_price * C1.year4), SUM(S1.qty * S1.product_price * C1.year5), 540 CHAPTER 23: STATISTICS IN SQL SUM(S1.qty * S1.product_price * C1.row_total) FROM Sales AS S1, Crosstabs AS C1 WHERE S1.year = C1.year GROUP BY S1.product_name; Obviously, (S1.product_price * S1.qty) is the total dollar amount of each product in each year. The year n column will be either a one or a zero. If it is a zero, the total dollar amount in the SUM() is zero; if it is a one, the total dollar amount in the SUM() is unchanged. This solution lets you adjust the time frame being shown in the report by replacing the values in the year column to whatever consecutive years you wish. A two-way crosstab takes two variables and produces a spreadsheet with all values of one variable on the rows and all values of the other represented by the columns. Each cell in the table holds the COUNT of entities that have those values for the two variables. NULLs will not fit into a crosstab very well, unless you decide to make them a group of their own or to remove them. Another trick is to use the POSITION() function to convert a string into a one or a zero. For example, assume we have a “day of the week” function that returns a three-letter abbreviation and we want to report the sales of items by day of the week in a horizontal list. CREATE TABLE Weekdays (day_name CHAR(3) NOT NULL PRIMARY KEY, mon INTEGER NOT NULL, tue INTEGER NOT NULL, wed INTEGER NOT NULL, thu INTEGER NOT NULL, fri INTEGER NOT NULL, sat INTEGER NOT NULL, sun INTEGER NOT NULL); INSERT INTO WeekDays VALUES ('MON', 1, 0, 0, 0, 0, 0, 0), ('TUE', 0, 1, 0, 0, 0, 0, 0), ('WED', 0, 0, 1, 0, 0, 0, 0), ('THU', 0, 0, 0, 1, 0, 0, 0), ('FRI', 0, 0, 0, 0, 1, 0, 0), ('SAT', 0, 0, 0, 0, 0, 1, 0), ('SUN', 0, 0, 0, 0, 0, 0, 1); 23.7 Cross Tabulations 541 SELECT item, SUM(amt * qty * * mon * POSITION('MON' IN DOW(sales_date))) AS mon_tot, SUM(amt * qty * tue * POSITION('TUE' IN DOW(sales_date))) AS tue_tot, SUM(amt * qty * wed * POSITION('WED' IN DOW(sales_date))) AS wed_tot, SUM(amt * qty * thu * POSITION('THU' IN DOW(sales_date))) AS thu_tot, SUM(amt * qty * fri * POSITION('FRI' IN DOW(sales_date))) AS fri_tot, SUM(amt * qty * sat * POSITION('SAT' IN DOW(sales_date))) AS sat_tot, SUM(amt * qty * sun * POSITION('SUN' IN DOW(sales_date))) AS sun_tot FROM Weekdays, Sales; There are also totals for each column and each row, as well as a grand total. Crosstabs of (n) variables are defined by building an n-dimensional spreadsheet. But you cannot easily print (n) dimensions on two- dimensional paper. The usual trick is to display the results as a two- dimensional grid with one or both axes as a tree structure. The way the values are nested on the axis is usually under program control; thus, “race within sex” shows sex broken down by race, whereas “sex within race” shows race broken down by sex. Assume that we have a table, Personnel (emp_nbr, sex, race, job_nbr, salary_amt), keyed on employee number, with no NULLs in any columns. We wish to write a crosstab of employees by sex and race, which would look like this: asian black caucasian latino Other TOTALS =========================================================== Male 3 2 12 5 5 27 Female 1 10 20 2 9 42 TOTAL 4 12 32 7 14 69 The first thought is to use a GROUP BY and write a simple query, thus: SELECT sex, race, COUNT(*) FROM Personnel GROUP BY sex, race; . Cumulative Statistics 533 SELECT S0 .salesman, S0 .client_name, S0 .sales_amt (S0 .sales_amt * 100.00/ (SELECT SUM (S1 .sales_amt) FROM Sales AS S1 WHERE S0 .salesman = S1 .salesman)) AS percentage_of_total, . percentage_of_total, (SELECT SUM (S3 .sales_amt) FROM Sales AS S3 WHERE S0 .salesman = S3 .salesman AND (S3 .sales_amt > S0 .sales_amt OR (S3 .sales_amt = S0 .sales_amt AND S3 .client_name >= S0 .client_name))). (S0 .salesman, S0 .client_name) <= (S1 .salesman, S1 .client_name) INNER JOIN (SELECT S2 .salesman, SUM (S1 .sales_amt) FROM Sales AS S2 GROUP BY S2 .salesman) AS ST(salesman, salesman_total) ON S0 .salesman