Joe Celko s SQL for Smarties - Advanced SQL Programming P46 potx

10 122 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P46 potx

Đang tải... (xem toàn văn)

Thông tin tài liệu

422 CHAPTER 19: PARTITIONING DATA IN QUERIES SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost FROM LIFO AS L1 WHERE stock_date = (SELECT MAX(stock_date) FROM LIFO AS L2 WHERE tot_qty_on_hand >= :order_qty_on_hand); This is straight algebra and a little logic. Find the most recent date that we had enough (or more) quantity on hand to meet the order. If, by dumb blind luck, there is a day when the quantity on hand exactly matched the order, return the total cost as the answer. If the order was for more than we have in stock, then return nothing. If we go back to a day when we had more in stock than the order was for, then look at the unit price on that day, multiply by the overage and subtract it. Alternatively, you can use a derived table and a CASE expression. The CASE expression computes the cost of units that have a running total quantity less than the :order_qty_on_hand, and then it does algebra on the final block of inventory, which would put the running total over the limit. The outer query does a sum on these blocks. SELECT SUM(R3.v) AS cost FROM (SELECT R1.unit_price * CASE WHEN SUM(R2.qty_on_hand) <= :order_qty_on_hand THEN R1.qty_on_hand ELSE :order_qty_on_hand - (SUM(R2.qty_on_hand) - R1.qty_on_hand) END FROM InventoryReceipts AS R1, InventoryReceipts AS R2 WHERE R1.purchase_date <= R2.purchase_date GROUP BY R1.purchase_date, R1.qty_on_hand, R1.unit_price HAVING (SUM(R2.qty_on_hand) - R1.qty_on_hand) <= :order_qty_on_hand) AS R3(v); FIFO can be done with a similar VIEW or derived table: CREATE VIEW FIFO (stock_date, unit_price, tot_qty_on_hand, tot_cost) AS 19.5 FIFO and LIFO Subsets 423 SELECT R1.purchase_date, R1.unit_price, SUM(R2.qty_on_hand), SUM(R2.qty_on_hand * R2.unit_price) FROM InventoryReceipts AS R1, InventoryReceipts AS R2 WHERE R2.purchase_date <= R1.purchase_date GROUP BY R1.purchase_date, R1.unit_price; With the corresponding query: SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost FROM FIFO AS F1 WHERE stock_date = (SELECT MIN (stock_date) FROM FIFO AS F2 WHERE tot_qty_on_hand >= :order_qty_on_hand); CHAPTER 20 Grouping Operations I I AM SEPARATING THE partitions and grouping operations based on the idea that a group has group properties that we are trying to find, so we get an answer back for each group. A partition is simply a way of subsetting the original table so that we get a table back as a result. 20.1 GROUP BY Clause The GROUP BY clause is based on simple partitions. A partition of a set divides the set into subsets such that the union of the subsets returns the original set, and the intersection of the subsets is empty. Think of it as cutting up a pizza pie—each piece of pepperoni belongs to one and only one slice of pizza. When you get to the section on SQL-99 OLAP extensions, you will see “variations on a theme” in the ROLLUP and CUBE operators, but this is where it all starts. The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause. Each group is reduced to a single row in the result table. This result table is called a grouped table, and all operations are now defined on groups rather than on rows. By convention, the NULL s are treated as one group. The order of the grouping columns in the GROUP BY clause does not matter, but 426 CHAPTER 20: GROUPING OPERATIONS since all or some of the column names have to appear in the SELECT list, you should probably use the same order in both lists for readability. Please note the SELECT column names might be a subset of the GROUP BY clause column names, but never the other way around. Let us construct a sample table called Villes to explain in detail how this works. The table is declared as: CREATE TABLE Villes (state_code CHAR(2) NOT NULL, usps codes city_name CHAR(25) NOT NULL, PRIMARY KEY (city_name, state_code)); We populate it with the names of cities that end in “-ville” in each state. The first problem is to find a count of the number of such cities by state_code. The immediate naïve query might be: SELECT state_code, city_name, COUNT(*) FROM Villes GROUP BY state_code; The groups for Tennessee would have the rows ('TN', 'Nashville') and ('TN', 'Knoxville') . The first position in the result is the grouping column, which has to be constant within the group. The third column in the SELECT clause is the COUNT(*) for the group, which is clearly two. The city_name column is a problem. Since the table is grouped by states, there can be at most 50 groups, one for each state_code. The COUNT(*) is clearly a single value, and it applies to the group as a whole. But what possible single value could I output for a city_name in each group? Pick a typical city_name and use it? If all the cities have the same name, use that name; otherwise, output a NULL ? The worst possible choice would be to output both rows with the COUNT(*) of 2, since each row would imply that there are two cities named Nashville and two cities named Knoxville in Tennessee. Each row represents a single group, so anything in it must be a characteristic of the group, not of a single row in the group. This is why there is a rule that the SELECT list must be made up only of grouping columns with optional aggregate function expressions. 20.2 GROUP BY and HAVING 427 20.1.1 NULLs and Groups SQL puts the NULL s into a single group, as if they were all equal. The other option, which was used in some of the first SQL implementations before the standard, was to put each NULL into a group by itself. That is not an unreasonable choice. But to make a meaningful choice between the two options, you would have to know the semantics of the data you are trying to model. SQL is a language based on syntax, not semantics. For example, if a NULL is being used for a missing diagnosis in a medical record, you know that each patient will probably have a different disease when the NULL s are resolved. Putting the NULL s in one group would make sense if you wanted to consider unprocessed diagnosis reports as one group in a summary. Putting each NULL in its own group would make sense if you wanted to consider each unprocessed diagnosis report as an action item for treatment of the relevant class of diseases. Another example was a traffic ticket database that used NULL for a missing auto tag. Obviously, there is more than one car without a tag in the database. The general scheme for getting separate groups for each NULL is straightforward: SELECT x, FROM Table1 WHERE x IS NOT NULL GROUP BY x UNION ALL SELECT x, FROM Table1 WHERE x IS NULL; There will also be cases, such as the traffic tickets, where you can use another GROUP BY clause to form groups where the principal grouping columns are NULL . For example, the VIN (Vehicle Identification Number) is taken when the car is missing a tag, and it would provide a grouping column. 20.2 GROUP BY and HAVING One of the biggest problems in working with the GROUP BY clause lies in not understanding how the WHERE and HAVING clauses work. Consider this query to find all departments with fewer than five programmers: 428 CHAPTER 20: GROUPING OPERATIONS SELECT dept_nbr FROM Personnel WHERE job_title = 'Programmer' GROUP BY dept_nbr HAVING COUNT(*) < 5; The result of this query does not have a row for any departments with no programmers. The order of execution of the clauses does WHERE first, so those employees whose jobs are not equal to 'Programmer' are never passed to the GROUP BY clause. You have missed data that you might want to trap. The next query will also pick up those departments that have no programmers, because the COUNT(DISTINCT x) function will return a zero for an empty set. SELECT DISTINCT dept_nbr FROM Personnel AS P1 WHERE 5 > (SELECT COUNT(DISTINCT P2.emp_nbr) FROM Personnel AS P2 WHERE P1.dept_nbr = P2. dept_nbr AND P2.job_title = 'Programmer'); If there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group. Many early implementations of SQL required that the HAVING clause belong to a GROUP BY clause, so you might see old code written under that assumption. Since the HAVING clause applies only to the rows of a grouped table, it can reference only the grouping columns and aggregate functions that apply to the group. That is why this query would fail: SELECT dept_nbr Invalid Query! FROM Personnel GROUP BY dept_nbr HAVING COUNT(*) < 5 AND job_title = 'Programmer'; When the HAVING clause is executed, job is not in the grouped table as a column—it is a property of a row, not of a group. Likewise, this query would fail for much the same reason: 20.2 GROUP BY and HAVING 429 SELECT dept_nbr Invalid Query! FROM Personnel WHERE COUNT(*) < 5 AND job_title = 'Programmer' GROUP BY dept_nbr; The COUNT(*) does not exist until after the departmental groups are formed. 20.2.1 Group Characteristics and the HAVING Clause You can use the aggregate functions and the HAVING clause to determine certain characteristics of the groups formed by the GROUP BY clause. For example, given a simple grouped table with three columns: SELECT col1, col2 FROM Foobar GROUP BY col1, col2 HAVING ; You can determine the following properties of the groups with these HAVING clauses: HAVING COUNT (DISTINCT col_x) = COUNT (col_x) col_x has all distinct values HAVING COUNT(*) = COUNT(col_x); there are no NULLs in the column HAVING MIN(col_x - <const>) = -MAX(col_x - <const>) col_x deviates above and below const by the same amount HAVING MIN(col_x) * MAX(col_x) < 0 either MIN or MAX is negative, not both HAVING MIN(col_x) * MAX(col_x) > 0 col_x is either all positive or all negative HAVING MIN(SIGN(col_x)) = MAX(SIGN(col_x)) col_x is all positive, all negative or all zero 430 CHAPTER 20: GROUPING OPERATIONS HAVING MIN(ABS(col_x)) = 0; col_x has at least one zero HAVING MIN(ABS(col_x)) = MIN(col_x) col_x >= 0 (although the where clause can handle this, too) HAVING MIN(col_x) = -MAX(col_x) col_x deviates above and below zero by the same amount HAVING MIN(col_x) * MAX(col_x) = 0 either one or both of MIN or MAX is zero HAVING MIN(col_x) < MAX(col_x) col_x has more than one value (may be faster than count (*) > 1) HAVING MIN(col_x) = MAX(col_x) col_x has one value or NULLs HAVING (MAX(seq) - MIN(seq)+1) = COUNT(seq) the sequential numbers in seq have no gaps Tom Moreau contributed most of these suggestions. Let me remind you again that if there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group. This means that if you wish to apply one of the tests given above to the whole table, you will need to use a constant in the SELECT list. This will be easier to see with an example. You are given a table with a column of unique sequential numbers that start at 1. When you go to insert a new row, you must use a sequence number that is not currently in the column—that is, fill the gaps. If there are no gaps, then and only then can you use the next highest integer in the sequence. CREATE TABLE Foobar (seq_nbr INTEGER NOT NULL PRIMARY KEY CHECK (seq > 0), junk CHAR(5) NOT NULL); INSERT INTO Foobar VALUES (1, 'Tom'), (2, 'Dick'), (4, 'Harry'), (5, 'Moe'); How do I find if I have any gaps? 20.3 Multiple Aggregation Levels 431 EXISTS (SELECT 'gap' FROM Foobar HAVING COUNT(*) = MAX(seq_nbr)) You could not use “SELECT seq_nbr” because the column values will not be identical within the single group made from the table, so the subquery fails with a cardinality violation. Likewise, “ SELECT *” fails because the asterisk is converted into a column name picked by the SQL engine. Here is the insertion statement: INSERT INTO Foobar (seq_nbr, junk) VALUES (CASE WHEN EXISTS no gaps (SELECT 'no gaps' FROM Foobar HAVING COUNT(*) = MAX(seq_nbr)) THEN (SELECT MAX(seq_nbr) FROM Foobar) + 1 ELSE (SELECT MIN(seq_nbr) gaps FROM Foobar WHERE (seq_nbr - 1) NOT IN (SELECT seq_nbr FROM Foobar) AND seq_nbr > 0) - 1, 'Celko'); The ELSE clause has to handle a special situation when 1 is in the seq_nbr column, so that it does not return an illegal zero. The only tricky part is waiting for the entire scalar subquery expression to compute before subtracting one; writing “ MIN(seq_nbr -1)” or “ MIN(seq_nbr) -1” in the SELECT list could disable the use of indexes in many SQL products. 20.3 Multiple Aggregation Levels The rule in SQL is that you cannot nest aggregate functions, such as SELECT department, MIN(COUNT(items)) illegal syntax! FROM Foobar GROUP BY department; The usual intent of this is to get multiple levels of aggregation; this example probably wanted the smallest count of items within each department. But this makes no sense, because a department (i.e., a . partition of a set divides the set into subsets such that the union of the subsets returns the original set, and the intersection of the subsets is empty. Think of it as cutting up a pizza pie—each. and Groups SQL puts the NULL s into a single group, as if they were all equal. The other option, which was used in some of the first SQL implementations before the standard, was to put. make sense if you wanted to consider each unprocessed diagnosis report as an action item for treatment of the relevant class of diseases. Another example was a traffic ticket database that used

Ngày đăng: 06/07/2014, 09:20

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan