Database Management systems phần 5 docx

94 888 0
Database Management systems phần 5 docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Evaluation of Relational Operators 353 have the same value in the join attribute, there is a repeated pattern of access on the inner relation; we can maximize the repetition by sorting the outer relation on the join attributes. 12.9 POINTS TO REVIEW Queries are composed of a few basic operators whose implementation impacts performance. All queries need to retrieve tuples from one or more input relations. The alternative ways of retrieving tuples from a relation are called access paths. An index matches selection conditions in a query if the index can be used to only retrieve tuples that satisfy the selection conditions. The selectivity of an access path with respect to a query is the total number of pages retrieved using the access path for this query. (Section 12.1) Consider a simple selection query of the form σ R.attr op value (R). If there is no index and the file is not sorted, the only access path is a file scan. If there is no index but the file is sorted, a binary search can find the first occurrence of a tuple in the query. If a B+ tree index matches the selection condition, the selectivity depends on whether the index is clustered or unclustered and the number of result tuples. Hash indexes can be used only for equality selections. (Section 12.2) General selection conditions can be expressed in conjunctive normal form,where each conjunct consists of one or more terms. Conjuncts that contain ∨ are called disjunctive. A more complicated rule can be used to determine whether a general selection condition matches an index. There are several implementation options for general selections. (Section 12.3) The projection operation can be implemented by sorting and duplicate elimina- tion during the sorting step. Another, hash-based implementation first partitions the file according to a hash function on the output attributes. Two tuples that belong to different partitions are guaranteed not to be duplicates because they have different hash values. In a subsequent step each partition is read into main memory and within-partition duplicates are eliminated. If an index contains all output attributes, tuples can be retrieved solely from the index. This technique is called an index-only scan. (Section 12.4) Assume that we join relations R and S.Inanested loops join, the join condition is evaluated between each pair of tuples from R and S.Ablock nested loops join performs the pairing in a way that minimizes the number of disk accesses. An index nested loops join fetches only matching tuples from S for each tuple of R by using an index. A sort-merge join sorts R and S on the join attributes using an external merge sort and performs the pairing during the final merge step. A hash join first partitions R and S using a hash function on the join attributes. Only partitions with the same hash values need to be joined in a subsequent step. A hybrid hash join extends the basic hash join algorithm by making more efficient 354 Chapter 12 use of main memory if more buffer pages are available. Since a join is a very expensive, but common operation, its implementation can have great impact on overall system performance. The choice of the join implementation depends on the number of buffer pages available and the sizes of R and S. (Section 12.5) The set operations R ∩ S, R × S, R ∪ S,andR−Scan be implemented using sorting or hashing. In sorting, R and S are first sorted and the set operation is performed during a subsequent merge step. In a hash-based implementation, R and S are first partitioned according to a hash function. The set operation is performed when processing corresponding partitions. (Section 12.6) Aggregation can be performed by maintaining running information about the tu- ples. Aggregation with grouping can be implemented using either sorting or hash- ing with the grouping attribute determining the partitions. If an index contains sufficient information for either simple aggregation or aggregation with grouping, index-only plans that do not access the actual tuples are possible. (Section 12.7) The number of buffer pool pages available —influenced by the number of operators being evaluated concurrently—and their effective use has great impact on the performance of implementations of relational operators. If an operation has a regular pattern of page accesses, choice of a good buffer pool replacement policy can influence overall performance. (Section 12.8) EXERCISES Exercise 12.1 Briefly answer the following questions: 1. Consider the three basic techniques, iteration, indexing,andpartitioning, and the re- lational algebra operators selection, projection,andjoin. For each technique –operator pair, describe an algorithm based on the technique for evaluating the operator. 2. Define the term most selective access path for a query. 3. Describe conjunctive normal form, and explain why it is important in the context of relational query evaluation. 4. When does a general selection condition match an index? What is a primary term in a selection condition with respect to a given index? 5. How does hybrid hash join improve upon the basic hash join algorithm? 6. Discuss the pros and cons of hash join, sort-merge join, and block nested loops join. 7. If the join condition is not equality, can you use sort-merge join? Can you use hash join? Can you use index nested loops join? Can you use block nested loops join? 8. Describe how to evaluate a grouping query with aggregation operator MAX using a sorting- based approach. 9. Suppose that you are building a DBMS and want to add a new aggregate operator called SECOND LARGEST, which is a variation of the MAX operator. Describe how you would implement it. Evaluation of Relational Operators 355 10. Give an example of how buffer replacement policies can affect the performance of a join algorithm. Exercise 12.2 Consider a relation R(a,b,c,d,e) containing 5,000,000 records, where each data page of the relation holds 10 records. R is organized as a sorted file with dense secondary indexes. Assume that R.a is a candidate key for R, with values lying in the range 0 to 4,999,999, and that R is stored in R.a order. For each of the following relational algebra queries, state which of the following three approaches is most likely to be the cheapest: Access the sorted file for R directly. Use a (clustered) B+ tree index on attribute R.a. Use a linear hashed index on attribute R.a. 1. σ a<50,000 (R) 2. σ a=50,000 (R) 3. σ a>50,000∧a<50,010 (R) 4. σ a=50,000 (R) Exercise 12.3 Consider processing the following SQL projection query: SELECT DISTINCT E.title, E.ename FROM Executives E You are given the following information: Executives has attributes ename, title, dname,andaddress; all are string fields of the same length. The ename attribute is a candidate key. The relation contains 10,000 pages. There are 10 buffer pages. Consider the optimized version of the sorting-based projection algorithm: The initial sorting pass reads the input relation and creates sorted runs of tuples containing only attributes ename and title. Subsequent merging passes eliminate duplicates while merging the initial runs to obtain a single sorted result (as opposed to doing a separate pass to eliminate duplicates from a sorted result containing duplicates). 1. How many sorted runs are produced in the first pass? What is the average length of these runs? (Assume that memory is utilized well and that any available optimization to increase run size is used.) What is the I/O cost of this sorting pass? 2. How many additional merge passes will be required to compute the final result of the projection query? What is the I/O cost of these additional passes? 3. (a) Suppose that a clustered B+ tree index on title is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? (b) Suppose that a clustered B+ tree index on ename is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? 356 Chapter 12 (c) Suppose that a clustered B+ tree index on ename, title is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? 4. Suppose that the query is as follows: SELECT E.title, E.ename FROM Executives E That is, you are not required to do duplicate elimination. How would your answers to the previous questions change? Exercise 12.4 Consider the join R R.a=S.b S, given the following information about the relations to be joined. The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the result should be uniformly ignored. Relation R contains 10,000 tuples and has 10 tuples per page. Relation S contains 2,000 tuples and also has 10 tuples per page. Attribute b of relation S is the primary key for S. Both relations are stored as simple heap files. Neither relation has any indexes built on it. 52 buffer pages are available. 1. What is the cost of joining R and S using a page-oriented simple nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged? 2. What is the cost of joining R and S using a block nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged? 3. What is the cost of joining R and S using a sort-merge join? What is the minimum number of buffer pages required for this cost to remain unchanged? 4. What is the cost of joining R and S using a hash join? What is the minimum number of buffer pages required for this cost to remain unchanged? 5. What would be the lowest possible I/O cost for joining R and S using any join algorithm, and how much buffer space would be needed to achieve this cost? Explain briefly. 6. How many tuples will the join of R and S produce, at most, and how many pages would be required to store the result of the join back on disk? 7. Would your answers to any of the previous questions in this exercise change if you are told that R.a is a foreign key that refers to S.b? Exercise 12.5 Consider the join of R and S described in Exercise 12.4. 1. With 52 buffer pages, if unclustered B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using an index nested loops join) than a block nested loops join? Explain. (a) Would your answer change if only five buffer pages were available? (b) Would your answer change if S contained only 10 tuples instead of 2,000 tuples? 2. With 52 buffer pages, if clustered B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using the index nested loops algorithm) than a block nested loops join? Explain. Evaluation of Relational Operators 357 (a) Would your answer change if only five buffer pages were available? (b) Would your answer change if S contained only 10 tuples instead of 2,000 tuples? 3. If only 15 buffers were available, what would be the cost of a sort-merge join? What would be the cost of a hash join? 4. If the size of S were increased to also be 10,000 tuples, but only 15 buffer pages were available, what would be the cost of a sort-merge join? What would be the cost of a hash join? 5. If the size of S were increased to also be 10,000 tuples, and 52 buffer pages were available, what would be the cost of sort-merge join? What would be the cost of hash join? Exercise 12.6 Answer each of the questions—if some question is inapplicable, explain why— in Exercise 12.4 again, but using the following information about R and S: Relation R contains 200,000 tuples and has 20 tuples per page. Relation S contains 4,000,000 tuples and also has 20 tuples per page. Attribute a of relation R is the primary key for R. Each tuple of R joins with exactly 20 tuples of S. 1,002 buffer pages are available. Exercise 12.7 We described variations of the join operation called outer joins in Section 5.6.4. One approach to implementing an outer join operation is to first evaluate the corre- sponding (inner) join and then add additional tuples padded with null values to the result in accordance with the semantics of the given outer join operator. However, this requires us to compare the result of the inner join with the input relations to determine the additional tuples to be added. The cost of this comparison can be avoided by modifying the join al- gorithm to add these extra tuples to the result while input tuples are processed during the join. Consider the following join algorithms: block nested loops join, index nested loops join, sort-merge join, and hash join. Describe how you would modify each of these algorithms to compute the following operations on the Sailors and Reserves tables discussed in this chapter: 1. Sailors NATURAL LEFT OUTER JOIN Reserves 2. Sailors NATURAL RIGHT OUTER JOIN Reserves 3. Sailors NATURAL FULL OUTER JOIN Reserves PROJECT-BASED EXERCISES Exercise 12.8 (Note to instructors: Additional details must be provided if this exercise is assigned; see Appendix B.) Implement the various join algorithms described in this chapter in Minibase. (As additional exercises, you may want to implement selected algorithms for the other operators as well.) 358 Chapter 12 BIBLIOGRAPHIC NOTES The implementation techniques used for relational operators in System R are discussed in [88]. The implementation techniques used in PRTV, which utilized relational algebra trans- formations and a form of multiple-query optimization, are discussed in [303]. The techniques used for aggregate operations in Ingres are described in [209]. [275] is an excellent survey of algorithms for implementing relational operators and is recommended for further reading. Hash-based techniques are investigated (and compared with sort-based techniques) in [93], [187], [276], and [588]. Duplicate elimination was discussed in [86]. [238] discusses secondary storage access patterns arising in join implementations. Parallel algorithms for implementing relational operations are discussed in [86, 141, 185, 189, 196, 251, 464]. 13 INTRODUCTIONTO QUERYOPTIMIZATION This very remarkable man Commends a most practical plan: You can do what you want If you don’t think you can’t, So don’t think you can’t if you can. —Charles Inge Consider a simple selection query asking for all reservations made by sailor Joe. As we saw in the previous chapter, there are many ways to evaluate even this simple query, each of which is superior in certain situations, and the DBMS must consider these alternatives and choose the one with the least estimated cost. Queries that consist of several operations have many more evaluation options, and finding a good plan represents a significant challenge. A more detailed view of the query optimization and execution layer in the DBMS architecture presented in Section 1.8 is shown in Figure 13.1. Queries are parsed and then presented to a query optimizer, which is responsible for identifying an efficient execution plan for evaluating the query. The optimizer generates alternative plans and chooses the plan with the least estimated cost. To estimate the cost of a plan, the optimizer uses information in the system catalogs. This chapter presents an overview of query optimization, some relevant background information, and a case study that illustrates and motivates query optimization. We discuss relational query optimizers in detail in Chapter 14. Section 13.1 lays the foundation for our discussion. It introduces query evaluation plans, which are composed of relational operators; considers alternative techniques for passing results between relational operators in a plan; and describes an iterator interface that makes it easy to combine code for individual relational operators into an executable plan. In Section 13.2, we describe the system catalogs for a relational DBMS. The catalogs contain the information needed by the optimizer to choose be- tween alternate plans for a given query. Since the costs of alternative plans for a given query can vary by orders of magnitude, the choice of query evaluation plan can have a dramatic impact on execution time. We illustrate the differences in cost between alternative plans through a detailed motivating example in Section 13.3. 359 360 Chapter 13 Generator Estimator Plan CostPlan Query Plan Evaluator Query Optimizer Query Parser Manager Catalog Evaluation plan Parsed query Query Figure 13.1 Query Parsing, Optimization, and Execution We will consider a number of example queries using the following schema: Sailors(sid: integer , sname: string, rating: integer, age: real) Reserves(sid: integer, bid: integer, day: dates , rname: string) As in Chapter 12, we will assume that each tuple of Reserves is 40 bytes long, that a page can hold 100 Reserves tuples, and that we have 1,000 pages of such tuples. Similarly, we will assume that each tuple of Sailors is 50 bytes long, that a page can hold 80 Sailors tuples, and that we have 500 pages of such tuples. 13.1 OVERVIEW OF RELATIONAL QUERY OPTIMIZATION The goal of a query optimizer is to find a good evaluation plan for a given query. The space of plans considered by a typical relational query optimizer can be understood by recognizing that a query is essentially treated as a σ − π −×algebra expression, with the remaining operations (if any, in a given query) carried out on the result of the σ −π −× expression. Optimizing such a relational algebra expression involves two basic steps: Enumerating alternative plans for evaluating the expression; typically, an opti- mizer considers a subset of all possible plans because the number of possible plans is very large. Estimating the cost of each enumerated plan, and choosing the plan with the least estimated cost. Introduction to Query Optimization 361 Commercial optimizers: Current RDBMS optimizers are complex pieces of software with many closely guarded details and typically represent 40 to 50 man- years of development effort! In this section we lay the foundation for our discussion of query optimization by in- troducing evaluation plans. We conclude this section by highlighting IBM’s System R optimizer, which influenced subsequent relational optimizers. 13.1.1 Query Evaluation Plans A query evaluation plan (or simply plan) consists of an extended relational algebra tree, with additional annotations at each node indicating the access methods to use for each relation and the implementation method to use for each relational operator. Consider the following SQL query: SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.rating > 5 This query can be expressed in relational algebra as follows: π sname (σ bid=100∧rating>5 (Reserves sid=sid Sailors)) This expression is shown in the form of a tree in Figure 13.2. The algebra expression partially specifies how to evaluate the query—we first compute the natural join of Reserves and Sailors, then perform the selections, and finally project the sname field. Reserves Sailors sid=sid bid=100 rating > 5 sname Figure 13.2 Query Expressed as a Relational Algebra Tree To obtain a fully specified evaluation plan, we must decide on an implementation for each of the algebra operations involved. For example, we can use a page-oriented 362 Chapter 13 simple nested loops join with Reserves as the outer relation and apply selections and projections to each tuple in the result of the join as it is produced; the result of the join before the selections and projections is never stored in its entirety. This query evaluation plan is shown in Figure 13.3. Reserves Sailors sid=sid bid=100 rating > 5 sname (On-the-fly) (On-the-fly) (Simple nested loops) (File scan)(File scan) Figure 13.3 Query Evaluation Plan for Sample Query In drawing the query evaluation plan, we have used the convention that the outer relation is the left child of the join operator. We will adopt this convention henceforth. 13.1.2 Pipelined Evaluation When a query is composed of several operators, the result of one operator is sometimes pipelined to another operator without creating a temporary relation to hold the intermediate result. The plan in Figure 13.3 pipelines the output of the join of Sailors and Reserves into the selections and projections that follow. Pipelining the output of an operator into the next operator saves the cost of writing out the intermediate result and reading it back in, and the cost savings can be significant. If the output of an operator is saved in a temporary relation for processing by the next operator, we say that the tuples are materialized. Pipelined evaluation has lower overhead costs than materialization and is chosen whenever the algorithm for the operator evaluation permits it. There are many opportunities for pipelining in typical query plans, even simple plans that involve only selections. Consider a selection query in which only part of the se- lection condition matches an index. We can think of such a query as containing two instances of the selection operator: The first contains the primary, or matching, part of the original selection condition, and the second contains the rest of the selection condition. We can evaluate such a query by applying the primary selection and writ- ing the result to a temporary relation and then applying the second selection to the temporary relation. In contrast, a pipelined evaluation consists of applying the second selection to each tuple in the result of the primary selection as it is produced and adding tuples that qualify to the final result. When the input relation to a unary [...]... equiwidth and equidepth, respectively 9.0 Equiwidth Equidepth 5. 0 2.67 2. 25 1.33 0 1 5. 0 5. 0 2 3 4 2 .5 1. 75 1.0 5 6 7 8 9 10 11 12 13 14 Bucket 1 Bucket 2 Bucket 3 Bucket 4 Count=8 Count=4 Count= 15 Count=3 Figure 14.4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Bucket 5 Bucket 1 Bucket 2 Bucket 3 Count= 15 Count=9 Count=10 Count=10 Count=7 Bucket 4 Bucket 5 Count=9 Histograms Approximating Distribution D Consider... Consider the selection query age > 13 again and the first (equiwidth) histogram We can estimate the size of the result to be 5 because the selected range includes a third of the range for Bucket 5 Since Bucket 5 represents a total of 15 tuples, the selected range corresponds to 1 ∗ 15 = 5 tuples As this example shows, we assume 3 that the distribution within a histogram bucket is uniform Thus, when we simply... which has 250 pages The cost is 2 ∗ 4 ∗ 250 = 2, 000 page I/Os To merge the sorted versions of T1 and T2, we need to scan these relations, and the cost of this step is 10 + 250 = 260 The final projection is done on-the-fly, and by convention we ignore the cost of writing the final result The total cost of the plan shown in Figure 13.6 is the sum of the cost of the selection (1, 000 + 10 + 50 0 + 250 = 1,... greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names 388 Chapter 14 The SQL version of this query is shown in Figure 14 .5 Using the extended algebra SELECT FROM WHERE GROUP BY HAVING S.rating, COUNT (*) Sailors S S.rating > 5 AND S.age = 20 S.rating COUNT DISTINCT (S.sname) > 2 Figure 14 .5 A Single-Relation... the low and high values for the age range (0 and 14 respectively) and the total count of all frequencies (which is 45 in our example) 9 8 Distribution D Uniform distribution approximating D 4 3 3 4 3 3 2 2 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 14.3 Uniform vs Nonuniform Distributions Consider the selection age > 13 From the... the plan shown in Figure 13.3 The cost of the join is 1, 000 + 1, 000 ∗ 50 0 = 50 1, 000 page I/Os The selections and the projection are done on-the-fly and do not incur additional I/Os Following the cost convention described in Section 12.1.2, we ignore the cost of writing out the final result The total cost of this plan is therefore 50 1,000 page I/Os This plan is admittedly naive; however, it is possible... that the number of pages in T1 is indeed 10 The cost of applying rating> 5 to Sailors is the cost of scanning Sailors (50 0 pages) plus the cost of writing out the result to a temporary relation, say T2 If we assume that ratings are uniformly distributed over the range 1 to 10, we can approximately estimate the size of T2 as 250 pages To do a sort-merge join of T1 and T2, let us assume that a straightforward... choice of catalog relations and their schemas is not unique and is made by the implementor of the DBMS Real systems vary in their catalog schema design, but the catalog is always implemented as a collection of relations, and it essentially describes all the data stored in the database. 1 1 Some systems may store additional information in a non-relational form For example, a system with a sophisticated... Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.rating > 5 AND R.day = ‘8/9/94’ A slight variant of the plan shown in Figure 13.7, designed to answer this query, is shown in Figure 13.8 The selection day=‘8/9/94’ is applied on-the-fly to the result of the selection bid=100 on the Reserves relation sname rating > 5 sid=sid (On-the-fly) (Use hash index; do not write result to temp) Hash... optimization? Why is it important? 2 Describe the advantages of pipelining 3 Give an example in which pipelining cannot be used 4 Describe the iterator interface and explain its advantages 5 What role do statistics gathered from the database play in query optimization? 6 What information is stored in the system catalogs? 7 What are the benefits of making the system catalogs be relations? 8 What were the important . attribute R.a. Use a linear hashed index on attribute R.a. 1. σ a< ;50 ,000 (R) 2. σ a =50 ,000 (R) 3. σ a> ;50 ,000∧a< ;50 ,010 (R) 4. σ a =50 ,000 (R) Exercise 12.3 Consider processing the following SQL. Relational Operators 355 10. Give an example of how buffer replacement policies can affect the performance of a join algorithm. Exercise 12.2 Consider a relation R(a,b,c,d,e) containing 5, 000,000 records,. only 15 buffer pages were available, what would be the cost of a sort-merge join? What would be the cost of a hash join? 5. If the size of S were increased to also be 10,000 tuples, and 52 buffer

Ngày đăng: 08/08/2014, 18:22

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan