Joe Celko s SQL for Smarties - Advanced SQL Programming P58 pps

10 76 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P58 pps

Đang tải... (xem toàn văn)

Thông tin tài liệu

542 CHAPTER 23: STATISTICS IN SQL This approach works fine for two variables and would produce a table that could be sent to a report writer program to give a final version. But where are your column and row totals? This means you also need to write these two queries: SELECT race, COUNT(*) FROM Personnel GROUP BY race; SELECT sex, COUNT(*) FROM Personnel GROUP BY sex; However, what I wanted was a table with a row for males and a row for females, with columns for each of the racial groups, just as I drew it. But let us assume that we want to get this information broken down within a third variable, such as a job code. I want to see the job_nbr and the total by sex and race within each job code. Our query set starts to get bigger and bigger. A crosstab can also include other summary data, such as total or average salary within each cell of the table. 23.7.1 Crosstabs by Cross Join A solution proposed by John M. Baird of Datapoint in San Antonio, Texas involves creating a matrix table for each variable in the crosstab, thus: SexMatrix sex Male Female ================== 'M' 1 0 'F' 0 1 RaceMatrix race asian black caucasian latino Other ======================================================== asian 1 0 0 0 0 black 0 1 0 0 0 caucasian 0 0 1 0 0 latino 0 0 0 1 0 Other 0 0 0 0 1 The query then constructs the cells by using a CROSS JOIN (Cartesian product) and summation for each one, thus: 23.7 Cross Tabulations 543 SELECT job_nbr, SUM(asian * male) AS AsianMale, SUM(asian * female) AS AsianFemale, SUM(black * male) AS BlackMale, SUM(black * female) AS BlackFemale, SUM(cauc * male) AS CaucMale, SUM(cauc * female) AS CaucFemale, SUM(latino * male) AS LatinoMale, SUM(latino * female) AS LatinoFemale, SUM(other * male) AS OtherMale, SUM(other * female) AS OtherFemale FROM Personnel, SexMatrix, RaceMatrix WHERE (RaceMatrix.race = Personnel.race) AND (SexMatrix.sex = Personnel.sex) GROUP BY job_nbr; Numeric summary data can be obtained from this table. For example, the total salary for each cell can be computed by SUM(<race> * <sex> * salary) AS <cell name> in place of what we have here. 23.7.2 Crosstabs by Outer Joins Another method, due to Jim Panttaja, uses a series of temporary tables or VIEWs and then combines them with OUTER JOINs. CREATE VIEW Guys (race, maletally) AS SELECT race, COUNT(*) FROM Personnel WHERE sex = 'M' GROUP BY race; Correspondingly, you could have written: CREATE VIEW Dolls (race, femaletally) AS SELECT race, COUNT(*) FROM Personnel WHERE sex = 'F' GROUP BY race; But they can be combined for a crosstab, without column and row totals, like this: 544 CHAPTER 23: STATISTICS IN SQL SELECT Guys.race, maletally, femaletally FROM Guys LEFT OUTER JOIN Dolls ON Guys.race = Dolls.race; The idea is to build a starting column in the crosstab, then progressively add columns to it. You use the LEFT OUTER JOIN to avoid missing-data problems. 23.7.3 Crosstabs by Subquery Another method takes advantage of the orthogonality of correlated subqueries in SQL-92. Think about what each row or column in the crosstab wants. SELECT DISTINCT race, (SELECT COUNT(*) FROM Personnel AS P1 WHERE P0.race = P1.race AND sex = 'M') AS MaleTally, (SELECT COUNT(*) FROM Personnel AS P2 WHERE P0.race = P2.race AND sex = 'F') AS FemaleTally FROM Personnel AS P0; An advantage of this approach is that you can attach another column to get the row tally by adding (SELECT COUNT(*) FROM Personnel AS P3 WHERE P0.race = P3.race) AS RaceTally Likewise, to get the column tallies, union the previous query with: SELECT 'Summary', (SELECT COUNT(*) FROM Personnel WHERE sex = 'M') AS GrandMaleTally, (SELECT COUNT(*) FROM Personnel WHERE sex = 'F') AS GrandFemaleTally, 23.8 Harmonic Mean and Geometric Mean 545 (SELECT COUNT(*) FROM Personnel) AS GrandTally FROM Personnel; 23.7.4 Crosstabs by CASE Expression Probably the best method is to use the CASE expression. If you need to get the final row of the traditional crosstab, you can add: SELECT sex, SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END) AS caucasian, SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END) AS black, SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END) AS asian, SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END) AS latino, SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END) AS other, COUNT(*) AS row_total FROM Personnel GROUP BY sex UNION ALL SELECT ' ', SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END), SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END), SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END), SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END), SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END), COUNT(*) AS column_total FROM Personnel; 23.8 Harmonic Mean and Geometric Mean The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of the values of a set. It is appropriate when dealing with rates and prices. Of limited use, it is found mostly in averaging rates. SELECT COUNT(*)/SUM(1.0/x) AS harmonic_mean FROM Foobar; The geometric mean is the exponential of the mean of the logs of the data items. You can also express it as the nth root of the product of the (n) data items. This second form is more subject to rounding errors than 546 CHAPTER 23: STATISTICS IN SQL the first. The geometric mean is sometimes a better measure of central tendency than the simple arithmetic mean when you are analyzing change over time. SELECT EXP (AVG (LOG (nbr))) AS geometric_mean FROM NumberTable; If you have negative numbers this will blow up, because the logarithm is not defined for values less than or equal to zero. 23.9 Multivariable Descriptive Statistics in SQL More and more SQL products are adding more complicated descriptive statistics to their aggregate function library. For example, CA-Ingres comes with a very nice set of such tools. Many of the single-column aggregate functions for which we just gave code are built-in functions. If you have that advantage, then use them. They will have corrections for floating-point rounding errors and be more accurate. Descriptive statistics are not all single-column computations. You often want to know relationships among several variables for prediction and description. Let’s pick one statistic that is representative of this class of functions and see what problems we have writing our own aggregate function for it. 23.9.1 Covariance The covariance is defined as a measure of the extent to which two variables move together. Financial analysts use it to determine the degree to which return on two securities is related over time. A high covariance indicates similar movements. This code is due to Steve Kass: CREATE TABLE Samples (sample_nbr INTEGER NOT NULL PRIMARY KEY, x FLOAT NOT NULL, y FLOAT NOT NULL); INSERT INTO Samples VALUES (1, 3, 9), (2, 2, 7), (3, 4, 12), (4, 5, 15), (5, 6, 17); SELECT sample_nbr, x, y, ((1.0/n) * SUM((x - xbar)*(y - ybar))) AS covariance 23.9 Multivariable Descriptive Statistics in SQL 547 FROM Samples CROSS JOIN (SELECT COUNT(*), AVG(x), AVG(y) FROM Samples) AS A (n, xbar, ybar) GROUP BY n; 23.9.2 Pearson’s r One of the most useful covariants is Pearson’s r, or the linear correlation coefficient. It measures the strength of the linear association between two variables. In English, given a set of observations (x1, y1), (x2, y2), . . . , (xn, yn), I want to know: when one variable goes up or down, how well does the other variable follow it? The correlation coefficient always takes a value between +1 and -1. Positive one means that they match to each other exactly. Negative one means that increasing values in one variable correspond to decreasing values in the other variable. A correlation value close to zero indicates no association between the variables. In the real world, you will not see +1 or −1 very often—this would mean that you are looking at a natural law, and not a statistical relationship. The values in between are much more realistic, with 0.70 or greater being a strong correlation. The formula translates into SQL in a straightforward manner. CREATE TABLE Samples (sample_name CHAR(3) NOT NULL PRIMARY KEY, x REAL, y REAL); INSERT INTO Samples VALUES ('a', 1.0, 2.0), ('b', 2.0, 5.0), ('c', 3.0, 6.0); r= 0.9608 SELECT (SUM(x - AVG(x))*(y - AVG(y))) / SQRT(SUM((x - AVG(x))^2) * SUM((y - AVG(y))^2)) AS pearson_r FROM Samples; SQRT() is the square root function, which is quite common in SQL today, and ^2 is the square of the number. Some products use POWER(x, n) instead of the exponent notation. Alternately, or you can use repeated multiplication. 548 CHAPTER 23: STATISTICS IN SQL 23.9.3 NULLs in Multivariable Descriptive Statistics If (x, y) = (NULL, NULL), then the query will drop the pair in the aggregate functions, as per the usual rules of SQL. But what is the correct (or reasonable) behavior if (x, y) has one and only one NULL in the pair? We can make several arguments. 1. Drop the pairs that contain any NULLs. That is quick and easy with a “ WHERE x IS NOT NULL AND y IS NOT NULL” clause added to the query. The argument is that if you don’t know one or both values, how can you know what their rela- tionship is? 2. Convert (x, NULL) to (x, AVG(y)) and (NULL, y) to (AVG(x), y). The idea is to “smooth out” the missing values with a reasonable replacement that is based on the whole set from which known values were drawn. There might be better replacement values in a particular situation, but that idea would still hold. 3. Replace ( NULL, NULL) with (a, a) for some value to say that the NULLs are in the same grouping. This kind of “pseudo- equality” is the basis for putting NULLs into one group in a GROUP BY operation. I am not sure what the correct practice for the (x, NULL) and (y, NULL) pairs are. 4. First calculate a linear regression with the known pairs, say y = (a + b*x), and then fill in the expected values. If you forgot your high school algebra, that would be y[i] = a + b * x[i] for the pair (x[i], NULL), and x[i] = (y - a) / b. 5. Catch the SQLSTATE warning code message (found in Standard SQL) to show that an aggregate function has dropped NULLs before doing the computations, and use the message to report to the user about the missing data. I can also use COUNT(*) and COUNT(x+y) to determine how much data is missing. I think we would all agree that if I have a small subset of non- NULL pairs, then my correlation is less reliable than if I obtained it from a large subset of non- NULL pairs. There is no right answer to this question. You will need to know the nature of your data to make a good decision. CHAPTER 24 Regions, Runs, Gaps, Sequences, and Series T ABLES DO NOT HAVE an ordering to their rows. Yes, the physical storage of the rows in many SQL products might be ordered if the product is built on an old file system. More modern implementations might not construct and materialize the result set rows until the end of the query execution. The first rule in a relational database is that all relationships are shown in tables by values in columns. This means that things involving an ordering must have a table with at least two columns. One column, the sequence number, is the primary key; the other column has the value that holds that position in the sequence. The sequence column has consecutive unique integers, without any gaps in the numbering. Examples of this sort of data would be ticket numbers, time series data taken at fixed intervals, and the like. The ordering of those identifiers carries some information, such as physical or temporal location. A subsequence is a set of consecutive unique identifiers within a larger containing sequence that has some property. This property is usually consecutive numbering. For example, given the data CREATE TABLE List (seq_nbr INTEGER NOT NULL UNIQUE, val INTEGER NOT NULL UNIQUE); 550 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES INSERT INTO List VALUES (1, 99), (2, 10), (3, 11), (4, 12), (5, 13), (6, 14), (7, 0); You can find subsequences of size three that follow the rule—(10, 11, 12), (11, 12, 13), and (12, 13, 14)—but the longest sequence is (10, 11, 12, 13, 14), and it is of size five. A run is like a sequence, but the numbers do not have to be consecutive, just increasing and contiguous. For example, given the run {(1, 1), (2, 2), (3, 12), (4, 15), (5, 23)}, you can find subruns of size three: (1, 2, 12), (2, 12, 15), and (12, 15, 23). A region is contiguous, and all the values are the same. For example, {(1, 1), (2, 0), (3, 0), (4, 0), (5, 25)} has a region of zeros that is three items long. In procedural languages, you would simply sort the data and scan it. In SQL, you have to define everything in terms of sets and nested sets. Some of these queries can be done with the OLAP addition to SQL-99, but they are not yet common in SQL products. 24.1 Finding Subregions of Size ( n ) This example is adapted from SQL and Its Applications (Lorie and Daudenarde 1991). You are given a table of theater seats: CREATE TABLE Theater (seat_nbr INTEGER NOT NULL PRIMARY KEY, sequencing number occupancy_status CHAR(1) NOT NULL values CONSTRAINT valid_occupancy_status CHECK (occupancy_status IN ('A', 'S')); In this table, an occupancy_status code of ‘A’ means available, and ‘S’ means sold. Your problem is to write a query that will return the subregions of ( n ) consecutive seats still available. Assume that consecutive seat_nbrs means that the seats are also consecutive for a moment, ignoring rows of seating where seat_nbr( n ) and seat_nbr(( n ) + 1) might be on different physical theater rows. For ( n ) = 3, we can write a self- JOIN query, thus: SELECT T1.seat_nbr, T2.seat_nbr, T3.seat_nbr FROM Theater AT T1, Theater AT T2, Theater AT T3 WHERE T1.occupancy_status = 'A' 24.2 Numbering Regions 551 AND T2.occupancy_status = 'A' AND T3.occupancy_status = 'A' AND T2.seat_nbr = T1.seat_nbr + 1 AND T3.seat_nbr = T2.seat_nbr + 1; The trouble with this answer is that it works only for ( n = 3). This pattern can be extended for any ( n ), but what we really want is a generalized query where we can use ( n ) as a parameter to the query. The solution given by Lorie and Daudenarde starts with a given seat_nbr and looks at all the available seats between it and (( n ) - 1) seats further up. The real trick is switching from the English-language statement “All seats between here and there are available” to the passive- voice version, “Available is the occupancy_status of all the seats between here and there,” so that you can see the query. SELECT seat_nbr, ' thru ', (seat_nbr + (:(n) - 1)) FROM Theater AS T1 WHERE occupancy_status = 'A' AND 'A' = ALL (SELECT occupancy_status FROM Theater AS T2 WHERE T2.seat_nbr > T1.seat_nbr AND T2.seat_nbr <= T1.seat_nbr + (:(n) - 1)); Please notice that this returns subregions. That is, if seats (1, 2, 3, 4, 5) are available, this query will return (1, 2, 3), (2, 3, 4), and (3, 4, 5) as its result set. 24.2 Numbering Regions Instead of looking for a region, we want to number the regions in the order in which they appear. For example, given a view or table with a payment history, we want to break it into groupings of behavior—for example, whether or not the payments were on time or late. CREATE TABLE PaymentHistory (payment_nbr INTEGER NOT NULL PRIMARY KEY, paid_on_time CHAR(1) DEFAULT 'Y' NOT NULL CHECK(paid_on_time IN ('Y', 'N'))); INSERT INTO PaymentHistory VALUES (1006, 'Y'), (1005, 'Y'), . traditional crosstab, you can add: SELECT sex, SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END) AS caucasian, SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END) AS black, SUM(CASE race. 'asian' THEN 1 ELSE 0 END) AS asian, SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END) AS latino, SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END) AS other, COUNT(*) AS. thus: 23.7 Cross Tabulations 543 SELECT job_nbr, SUM(asian * male) AS AsianMale, SUM(asian * female) AS AsianFemale, SUM(black * male) AS BlackMale, SUM(black * female) AS BlackFemale, SUM(cauc

Ngày đăng: 06/07/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan