Joe Celko s SQL for Smarties - Advanced SQL Programming P32 ppt

10 53 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P32 ppt

Đang tải... (xem toàn văn)

Thông tin tài liệu

282 CHAPTER 13: BETWEEN AND OVERLAPS PREDICATES celebration. A little algebra tells you that the length of an event is (Event.finish_date - Event.start_date + INTERVAL '1' DAY) and that the length of a guest’s stay is(Guest.depart_date - Guest.arrival_date + INTERVAL '1' DAY). Let’s do one of those timeline charts again: What we want is the part of the Guests interval that is inside the Celebrations interval. Guests 1 and 2 spent only part of their time at the celebration; Guest 3 spent all of his time at the celebration and Guest 4 stayed even longer than the celebration. That interval is defined by the two points (GREATEST(arrival_date, start_date), LEAST(depart_date, finish_date)). Instead, you can use the aggregate functions in SQL to build a VIEW on a VIEW, like this: CREATE VIEW Working (guest_name, celeb_name, entered, exited) AS SELECT GE.guest_name, GE.celeb_name, start_date, finish_date FROM GuestCelebrations AS GE, Celebrations AS E1 WHERE E1.celeb_name = GE.celeb_name UNION SELECT GE.guest_name, GE.celeb_name, arrival_date, depart_date FROM GuestCelebrations AS GE, Guests AS G1 WHERE G1.guest_name = GE.guest_name; VIEW Working guest_name celeb_name entered exited ================================================================ 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Dorothy Gale' 'St. Fred's Day' '2005-02-01' '2005-11-01' Figure 13.3 Timeline Diagram. 13.2 OVERLAPS Predicate 283 'Dorothy Gale' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-28' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Apple Month' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-10-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'St. Fred's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T. Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T. Kirk' 'Garlic Festival' '2005-02-01' '2005-02-28' 'James T. Kirk' 'Garlic Festival' '2005-01-15' '2005-02-15' 'James T. Kirk' 'St. Fred's Day' '2005-02-01' '2005-02-28' 'James T. Kirk' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'James T. Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'James T. Kirk' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-01-01' '2005-12-31' This will put the earliest and latest points in both intervals into one column. Now we can construct a VIEW like this: CREATE VIEW Attendees (guest_name, celeb_name, entered, exited) AS SELECT guest_name, celeb_name, MAX(entered), MIN(exited) FROM Working GROUP BY guest_name, celeb_name; VIEW Attendees 284 CHAPTER 13: BETWEEN AND OVERLAPS PREDICATES guest_name celeb_name entered exited =============================================================== 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-02-15' 'Dorothy Gale' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T. Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T. Kirk' 'Garlic Festival' '2005-02-01' '2005-02-15' 'James T. Kirk' 'St. Fred's Day' '2005-02-24' '2005-02-24' 'James T. Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25' The Attendees VIEW can be used to compute the total number of room days for each celebration. Assume that the difference between two dates will return an integer that is the number of days between them: SELECT celeb_name, SUM(exited - entered + INTERVAL '1' DAY) AS roomdays FROM Attendees GROUP BY celeb_name; Result celeb_name roomdays ============================ 'Apple Month' 85 'Christmas Season' 25 'Garlic Festival' 63 'National Pear Week' 7 'New Year's Day' 1 'St. Fred's Day' 3 'Year of the Prune' 602 13.2 OVERLAPS Predicate 285 If you would like to get a count of the room days sold in the month of January, you could use this query, which avoids a BETWEEN or OVERLAPS predicate completely: SELECT SUM(CASE WHEN depart > DATE '2005-01-31' THEN DATE '2005-01-31' ELSE depart END - CASE WHEN arrival_date < DATE '2005-01-01' THEN DATE '2005-01-01' ELSE arrival_date END + INTERVAL '1' DAY) AS room_days FROM Guests WHERE depart > DATE '2005-01-01' AND arrival_date <= DATE '2005-01-31'; CHAPTER 14 The [NOT] IN() Predicate T HE IN() PREDICATE IS very natural. It takes a value and sees whether that value is in a list of comparable values. Standard SQL allows value expressions in the list, or for you to use a query to construct the list. The syntax is: <in predicate> ::= <row value constructor> [NOT] IN <in predicate value> <in predicate value> ::= <table subquery> | (<in value list>) <in value list> ::= <row value expression> { <comma> <row value expression> } The expression <row value constructor> NOT IN <in predicate value> has the same effect as NOT (<row value constructor> IN <in predicate value>) . This pattern for the use of the keyword NOT is found in most of the other predicates. The expression <row value constructor> IN <in predicate value> has, by definition, the same effect as <row value constructor> = ANY <in predicate value> . Most optimizers will recognize this and execute the same code for both 288 CHAPTER 14: THE [NOT] IN() PREDICATE expressions. This means that if the <in predicate value> is empty, such as one you would get from a subquery that returns no rows, the results will be equivalent to (<row value constructor> = (NULL, , NULL)) , which is always evaluated to UNKNOWN . Likewise, if the <in predicate value> is an explicit list of NULL s, the results will be UNKNOWN . However, please remember that there is a difference between an empty table and a table with rows of all NULL s. IN() predicates with a subquery can sometimes be converted into EXISTS predicates, but there are some problems and differences in the predicates. The conversion to an EXISTS predicate is often a good way to improve performance, but it will not be as easy to read as the original IN() predicate. An EXISTS predicate can use indexes to find (or fail to find) a single value that confirms (or denies) the predicate, whereas the IN() predicate often has to build the results of the subquery in a working table. 14.1 Optimizing the IN() Predicate Most database engines have no statistics about the relative frequency of the values in a list of constants, so they will scan them in the order in which they appear in the list. People like to order lists alphabetically or by magnitude, but it would be better to order the list from most frequently occurring values to least frequent. It is also pointless to have duplicate values in the constant list, since the predicate will return TRUE if it matches the first duplicate it finds, and never get to the second occurrence. Likewise, if the predicate is FALSE for that value, it wastes computer time to traverse a needlessly long list. Many SQL engines perform an IN() predicate with a subquery by building the result set of the subquery first as a temporary working table, then scanning that result table from left to right. This can be expensive in many cases; for example, in a query to find employees in a city with a major sport team (we want them to get tickets for us), we could write (assuming that city names are unique): SELECT * FROM Personnel WHERE city_name IN (SELECT city_name _name FROM SportTeams); 14.1 Optimizing the IN() Predicate 289 But let us further assume that our personnel are located in ( n ) cities and the sports teams are in ( m ) cities, where ( m ) is much greater than ( n ). If the matching cities appear near the front of the list generated by the subquery expression, it will perform much faster than if they appear at the end of the list. In the case of a subquery expression, you have no control over how the subquery is presented back in the containing query. However, you can order the expressions in a list in the order in which they are most likely to occur, such as: SELECT * FROM Personnel WHERE city_name IN ('New York', 'Chicago', 'Atlanta', , 'Austin'); Incidentally, Standard SQL allows row expression comparisons, so if you have a Standard SQL implementation with separate columns for the city and state, you could write: SELECT * FROM Personnel WHERE (city_name , state) IN (SELECT city_name , state FROM SportTeams); Teradata did not get correlated subqueries until 1996, so they often used this syntax as a workaround. I am not sure if you should count them as being ahead or behind the technology for that. Today, all major versions of SQL remove duplicates in the result table of the subquery, so you do not have to use a SELECT DISTINCT in the subquery. You might see this in legacy code. A trick that can work for large lists on some products is to force the engine to construct a list ordered by frequency. This involves first constructing a VIEW that has an ORDER BY clause; this practice is not part of the SQL standard, which does not allow a VIEW to have an ORDER BY clause. For example, a paint company wants to find all the products offered by their competitors who use the same color as one of their products. First construct a VIEW that orders the colors by frequency of appearance: CREATE VIEW PopColor (color, tally) AS SELECT color, COUNT(*) AS tally 290 CHAPTER 14: THE [NOT] IN() PREDICATE FROM Paints GROUP BY color ORDER BY tally DESC; Then go to the Competitor data and do a simple column SELECT on the VIEW , thus: SELECT * FROM Competitor WHERE color IN (SELECT color FROM PopColor); The VIEW is grouped, so it will be materialized in sort order. The subquery will then be executed and (we hope) the sort order will be maintained and passed along to the IN() predicate. Another trick is to replace the IN() predicate with a JOIN operation. For example, you have a table of restaurant telephone numbers and a guidebook, and you want to pick out the four-star places, so you write this query: SELECT restaurant_name, phone_nbr FROM Restaurants WHERE restaurant_name IN (SELECT restaurant_name FROM QualityGuide WHERE stars = 4); If there is an index on QualityGuide.stars, the SQL engine will probably build a temporary table of the four-star places and pass it on to the outer query. The outer query will then handle it as if it were a list of constants. However, this is not the sort of column that you would normally index. Without an index on stars, the engine will simply do a sequential search of the QualityGuide table. This query can be replaced with a JOIN query, thus: SELECT restaurant_name, phone_nbr FROM Restaurants, QualityGuide WHERE stars = 4 AND Restaurants.restaurant_name = QualityGuide.restaurant_name; 14.1 Optimizing the IN() Predicate 291 This query should run faster, since restaurant_name is a key for both tables and will be indexed to ensure uniqueness. However, this can return duplicate rows in the result table that you can handle with a SELECT DISTINCT . Consider a more budget-minded query, where we want places with a meal that costs less than $10, and the menu guidebook lists all the meals. The query looks about the same: SELECT restaurant_name, phone_nbr FROM Restaurants WHERE restaurant_name IN (SELECT restaurant_name FROM MenuGuide WHERE price <= 10.00); And you would expect to be able to replace it with: SELECT restaurant_name, phone_nbr FROM Restaurants, MenuGuide WHERE price <= 10.00 AND Restaurants.restaurant_name = MenuGuide.restaurant_name; Every item in Murphy’s Two-Dollar Hash House will get a line in the results of the JOIN ed version. However, this can be fixed by changing SELECT restaurant_name, phone_nbr to SELECT DISTINCT restaurant_name, phone_nbr , but it will cost more time to do a sort to remove the duplicates. There is no good general advice, except to experiment with your particular product. The NOT IN() predicate is probably better replaced with a NOT EXISTS predicate. Using the restaurant example again, our friend John has a list of eateries and we want to see those that are not in the guidebook. The natural formation of the query is: SELECT * FROM JohnsBook WHERE restaurant_name NOT IN (SELECT restaurant_name FROM QualityGuide); But you can write the same query with a NOT EXISTS predicate and it will probably run faster: . Kirk' 'St. Fred&apos ;s Day' '200 5-0 2-0 1' '200 5-0 2-2 8' 'James T. Kirk' 'St. Fred&apos ;s Day' '200 5-0 2-2 4' '200 5-0 2-2 4' . '200 5-1 2-3 1' 'Santa Claus' 'Christmas Season' '200 5-1 2-0 1' '200 5-1 2-2 5' 'Santa Claus' 'Year of the Prune' '200 5-1 2-0 1'. 'James T. Kirk' 'Year of the Prune' '200 5-0 2-0 1' '200 5-0 2-2 8' 'Santa Claus' 'Christmas Season' '200 5-1 2-0 1' '200 5-1 2-2 5'

Ngày đăng: 06/07/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan