Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 21 Query Processing Chapter Objectives In this chapter you will learn: n The objectives of query processing and optimization n Static versus dynamic query optimization n How a query is decomposed and semantically analyzed n How to create a relational algebra tree to represent a query n The rules of equivalence for the relational algebra operations n How to apply heuristic transformation rules to improve the efficiency of a query n The types of database statistics required to estimate the cost of operations n The different strategies for implementing the relational algebra operations n How to evaluate the cost and size of the relational algebra operations n How pipelining can be used to improve the efficiency of queries n The difference between materialization and pipelining n The advantages of left-deep trees n Approaches for finding the optimal execution strategy n How Oracle handles query optimization When the relational model was first launched commercially, one of the major criticisms often cited was inadequate performance of queries Since then, a significant amount of research has been devoted to developing highly efficient algorithms for processing queries There are many ways in which a complex query can be performed, and one of the aims of query processing is to determine which one is the most cost effective In first generation network and hierarchical database systems, the low-level procedural query language is generally embedded in a high-level programming language such as COBOL, and it is the programmer’s responsibility to select the most appropriate execution strategy In contrast, with declarative languages such as SQL, the user specifies what data is required rather than how it is to be retrieved This relieves the user of the responsibility of determining, or even knowing, what constitutes a good execution strategy and makes the language more universally usable Additionally, giving the DBMS the responsibility 21.1 Overview of Query Processing | Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com for selecting the best strategy prevents users from choosing strategies that are known to be inefficient and gives the DBMS more control over system performance There are two main techniques for query optimization, although the two strategies are usually combined in practice The first technique uses heuristic rules that order the operations in a query The other technique compares different strategies based on their relative costs and selects the one that minimizes resource usage Since disk access is slow compared with memory access, disk access tends to be the dominant cost in query processing for a centralized DBMS, and it is the one that we concentrate on exclusively in this chapter when providing cost estimates Structure of this Chapter In Section 21.1 we provide an overview of query processing and examine the main phases of this activity In Section 21.2 we examine the first phase of query processing, namely query decomposition, which transforms a high-level query into a relational algebra query and checks that it is syntactically and semantically correct In Section 21.3 we examine the heuristic approach to query optimization, which orders the operations in a query using transformation rules that are known to generate good execution strategies In Section 21.4 we discuss the cost estimation approach to query optimization, which compares different strategies based on their relative costs and selects the one that minimizes resource usage In Section 21.5 we discuss pipelining, which is a technique that can be used to further improve the processing of queries Pipelining allows several operations to be performed in a parallel way, rather than requiring one operation to be complete before another can start We also discuss how a typical query processor may choose an optimal execution strategy In the final section, we briefly examine how Oracle performs query optimization In this chapter we concentrate on techniques for query processing and optimization in centralized relational DBMSs, being the area that has attracted most effort and the model that we focus on in this book However, some of the techniques are generally applicable to other types of system that have a high-level interface Later, in Section 23.7 we briefly examine query processing for distributed DBMSs In Section 28.5 we see that some of the techniques we examine in this chapter may require further consideration for the ObjectRelational DBMS, which supports queries containing user-defined types and user-defined functions The reader is expected to be familiar with the concepts covered in Section 4.1 on the relational algebra and Appendix C on file organizations The examples in this chapter are drawn from the DreamHome case study described in Section 10.4 and Appendix A Overview of Query Processing Query processing The activities involved in parsing, validating, optimizing, and executing a query 21.1 631 632 | Chapter 21 z Query Processing Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com The aims of query processing are to transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy expressed in a low-level language (implementing the relational algebra), and to execute the strategy to retrieve the required data Query optimization The activity of choosing an efficient execution strategy for processing a query An important aspect of query processing is query optimization As there are many equivalent transformations of the same high-level query, the aim of query optimization is to choose the one that minimizes resource usage Generally, we try to reduce the total execution time of the query, which is the sum of the execution times of all individual operations that make up the query (Selinger et al., 1979) However, resource usage may also be viewed as the response time of the query, in which case we concentrate on maximizing the number of parallel operations (Valduriez and Gardarin, 1984) Since the problem is computationally intractable with a large number of relations, the strategy adopted is generally reduced to finding a near optimum solution (Ibaraki and Kameda, 1984) Both methods of query optimization depend on database statistics to evaluate properly the different options that are available The accuracy and currency of these statistics have a significant bearing on the efficiency of the execution strategy chosen The statistics cover information about relations, attributes, and indexes For example, the system catalog may store statistics giving the cardinality of relations, the number of distinct values for each attribute, and the number of levels in a multilevel index (see Appendix C.5.4) Keeping the statistics current can be problematic If the DBMS updates the statistics every time a tuple is inserted, updated, or deleted, this would have a significant impact on performance during peak periods An alternative, and generally preferable, approach is to update the statistics on a periodic basis, for example nightly, or whenever the system is idle Another approach taken by some systems is to make it the users’ responsibility to indicate when the statistics are to be updated We discuss database statistics in more detail in Section 21.4.1 As an illustration of the effects of different processing strategies on resource usage, we start with an example Example 21.1 Comparison of different processing strategies Find all Managers who work at a London branch We can write this query in SQL as: SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND (s.position = ‘Manager’ AND b.city = ‘London’); 21.1 Overview of Query Processing Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Three equivalent relational algebra queries corresponding to this SQL statement are: (1) σ(position =‘Manager’) ∧ (city =‘London’) ∧ (Staff.branchNo = Branch.branchNo)(Staff × Branch) (2) σ(position =‘Manager’) ∧ (city =‘London’)(Staff 1Staff.branchNo = Branch.branchNo Branch) (3) (σposition =‘Manager’(Staff)) 1Staff.branchNo = Branch.branchNo (σcity =‘London’(Branch)) For the purposes of this example, we assume that there are 1000 tuples in Staff, 50 tuples in Branch, 50 Managers (one for each branch), and London branches We compare these three queries based on the number of disk accesses required For simplicity, we assume that there are no indexes or sort keys on either relation, and that the results of any intermediate operations are stored on disk The cost of the final write is ignored, as it is the same in each case We further assume that tuples are accessed one at a time (although in practice disk accesses would be based on blocks, which would typically contain several tuples), and main memory is large enough to process entire relations for each relational algebra operation The first query calculates the Cartesian product of Staff and Branch, which requires (1000 + 50) disk accesses to read the relations, and creates a relation with (1000 * 50) tuples We then have to read each of these tuples again to test them against the selection predicate at a cost of another (1000 * 50) disk accesses, giving a total cost of: (1000 + 50) + 2*(1000 * 50) = 101 050 disk accesses The second query joins Staff and Branch on the branch number branchNo, which again requires (1000 + 50) disk accesses to read each of the relations We know that the join of the two relations has 1000 tuples, one for each member of staff (a member of staff can only work at one branch) Consequently, the Selection operation requires 1000 disk accesses to read the result of the join, giving a total cost of: 2*1000 + (1000 + 50) = 3050 disk accesses The final query first reads each Staff tuple to determine the Manager tuples, which requires 1000 disk accesses and produces a relation with 50 tuples The second Selection operation reads each Branch tuple to determine the London branches, which requires 50 disk accesses and produces a relation with tuples The final operation is the join of the reduced Staff and Branch relations, which requires (50 + 5) disk accesses, giving a total cost of: 1000 + 2*50 + + (50 + 5) = 1160 disk accesses Clearly the third option is the best in this case, by a factor of 87:1 If we increased the number of tuples in Staff to 10 000 and the number of branches to 500, the improvement would be by a factor of approximately 870:1 Intuitively, we may have expected this as the Cartesian product and Join operations are much more expensive than the Selection operation, and the third option significantly reduces the size of the relations that are being joined together We will see shortly that one of the fundamental strategies in query processing is to perform the unary operations, Selection and Projection, as early as possible, thereby reducing the operands of any subsequent binary operations | 633 634 | Chapter 21 z Query Processing Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Figure 21.1 Phases of query processing Query processing can be divided into four main phases: decomposition (consisting of parsing and validation), optimization, code generation, and execution, as illustrated in Figure 21.1 In Section 21.2 we briefly examine the first phase, decomposition, before turning our attention to the second phase, query optimization To complete this overview, we briefly discuss when optimization may be performed Dynamic versus static optimization There are two choices for when the first three phases of query processing can be carried out One option is to dynamically carry out decomposition and optimization every time the query is run The advantage of dynamic query optimization arises from the fact that all information required to select an optimum strategy is up to date The disadvantages are that the performance of the query is affected because the query has to be parsed, validated, and optimized before it can be executed Further, it may be necessary to reduce the number of execution strategies to be analyzed to achieve an acceptable overhead, which may have the effect of selecting a less than optimum strategy The alternative option is static query optimization, where the query is parsed, validated, and optimized once This approach is similar to the approach taken by a compiler for a programming language The advantages of static optimization are that the runtime Index | 1373 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com nested model 616–18 sagas 618–19 workflow models 621–2 architecture for 576–7 classification of 724–5 concurrency control 577–605 deadlock 594–7 granularity 602–5 locking methods 587–94 multiversion timestamp ordering 600–1 need for 577–80 optimistic techniques 601–2 recoverability 587 serializability 580–6 timestamping 597–600 and denormalization 531 design 300–1 as logical units of work 573 in object model 907 in OODBMS 871 in Oracle 774 in physical design 502–6, 1331 data usage 505–6 frequency information 504–5 paths to relations 503–4 properties 575–6 in RDBMS 813 and recovery 607–9 serializability of 580–6 conflict, testing for 582–3 distributed 737 view serializability 583–6 testing for 584–6 in SQL 187–9 transform methods in OODBMS 835 transform-oriented languages 109 transformation rules in relational algebra 640–4 transformation tools in data warehousing 1165–6 transient objects 867, 902 transient versions 872 transitive closure in RDBMS 812–13 transitive dependency 396–7 transitive persistence 870 transparency of DDBMS 690, 719–28 distribution transparency 719–22 performance transparency 725–8 transaction transparency 722–5 transparent network access in Web-DBMS applications 1008 transparent SQL access in Oracle 775 tree induction 1235 tree structure 1280 triggers and denormalization 531 in Oracle 245, 263–7 in replication 790–1 in SQL 967–70 tuple relational calculus 103–7 expressions 105–7 safety of 106–7 formulae 105–7 tuple variables 103 tuples 69, 73–4 distributing 529 Tuxedo 63 two-phase commit (2PC) 746–52 communication topologies 751–2 election protocols 751 termination protocols 748–50 two-phase locking (2PL) 589–91 in DDBMS centralized 738 distributed 2PL 739–40 majority locking 740 primary copy 739 Two Phase Optimization 672 type hierarchy 374 type inheritance in Oracle 982 type model 982 typed views in SQL 965–6 types in Object Model 906 typespecs in ObjectStore 927 typing judgment 1126 unary operations 89–91 unary relationship 349 uncommitted dependency problem 577, 578–9, 590–1 undo operation 607, 608, 612 undone transaction 574 unfederated multidatabase systems 699 unicode compression property in Access 232–3 Unified Modeling Language (UML) 288, 894 OODBMS design with 836–44 UML diagrams 837–42 usage of UML 842–4 Uniform Resource Identifiers (URIs) 1002 Uniform Resource Locators (URL) 1002–4 Uniform Resource Names (URNs) 1002 unilateral abort 746 union 92, 102 of tables in SQL 147, 148 union operations in relational algebra 642 uniqueness of candidate key 78 Universal Discovery, Description and Integration (UDDI) 1088–91 universal object storage standards 899 Universe of Discourse (UoD) 44 University Accommodation Office case study 1255–8 data requirements 1255–7 query transactions 1257–8 unnormalized form (UNF) 402, 403 unnormalized table 403 unordered (heap) files 1270, 1288 unpinned data page 609 unsafe expressions 107 unstructured complex objects 825 unstructured interviews 317 unsupervised learning approach to database segmentation 1236 update anomalies 391 update-anywhere ownership 784, 787–8 UPDATE in SQL 117, 152–3 update of data 48 restrictions in SQL 186 update query 217–20 update transactions 301 Upper-CASE tools 307 use case diagrams in UML 838–9, 840 user-accessible catalog 48–9 user-defined data types in Oracle 978–83 user-defined routines in SQL 953–5 user-defined types in SQL 948–51 user-defined words in SQL 116 user interface design 301–3 user-level security in Microsoft Office Access 555–8 user transactions in conceptual design 456–8, 1327 in logical design 474, 1329 user views in database planning 287 in Dreamhome case study 336–7 in physical design 515–16, 1331 users in Oracle 247 utility services 52 validation phase of optimistic concurrency control protocol 602 validation property in Access 232 validation rules 235–6 validation techniques in normalization 389 VB.net 304 VBScript 1012–13 Versant OODBMS 834, 850 version history 872 version management 872 versionable classes 873 versions 872–3 vertical fragmentation in distributed query optimization 764–5 of DDBMS 708, 713–15 vertical partitioning 529 view maintenance 187 view materialization 176, 186–7 view mechanism 18 1374 | Index Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com view resolution 176, 180–1 view serializability 583–6 testing for 584–6 views 83–5 in DBMS 17–18 lack of in OODBMS 885 in Oracle 245 purpose of 84–5 and security 550 in SQL 176–87 advantages 185–6 WITH CHECK OPTION 183–4 creating 177–9 disadvantages 186 grouped and joined 179 horizontal 176–7 materialization 186–7 removing 179–80 resolution of 180–1 restrictions 181 updating 181–2 vertical 177 typed, in SQL 965–6 updating 85 virtual memory mapping architecture in ObjectStore 923–4 virtual relation 83 Visual Basic (VB) 40, 304 Visual FoxPro database system 25 volatile storage 606, 1268 wait-for graph (WFG) 595 warehouse manager in data warehousing 1158 weak entity type 356, 465 Web 998–1011 ActiveX security 569 integration with DBMS 1005–6 advantages of 1006–8 approaches to 1011 disadvantages of 1008–11 security on 562–9 digital certificates 564–5 digital signatures 564 firewalls 563–4 Java security 566–9 Kerberos 565 message digest algorithm 564 proxy servers 563 secure electronic transactions 566 secure sockets layer 565–6 server, extending 1020–1 services 1004–5 static and dynamic pages 1004 Web-based database solutions 228 Web data in Oracle Warehouse Builder 1199 Web services 1004–5 Web Services Description Language (WSDL) 1088 Web sites interactive and dynamic 808 well-formed formula 104 Wellmeadows Hospital case study 1260–7 data requirements 1260–6 transaction requirements 1266–7 wide area networks 61, 700 width-balanced histograms in query optimization 677 wildcard characters 202 windowing calculations in OLAP 1223–4 windows in Oracle 268 Windows NT 63 Wireless Application Protocol (WAP) 702 Wisconsin benchmarking 878–9 WITH CHECK OPTION in SQL 183–4 wizards in Office Access 226, 229 workflow ownership 784, 787 working versions 872 workload and physical database design 502 write fault 924 write phase of optimistic concurrency control protocol 602 write_timestamp 598 X/Open Distributed Transaction Processing 62 Model 758–61 XML 1073–82 advantages 1074–6 CDATA 1078 comments 1078 and databases 1128–39 schema independent representation 1131–2 storing in an attribute 1130 storing in shredded form 1130 declaration 1076 document type definitions 1078–82 elements 1076–7 entity references 1077 ordering 1078 related technologies 1082–91 schema 1091–100 and SQL 1132–7 mapping functions 1135–7 new data type 1132–4 XML Information Set 1114–15 XML Linking Language (XLink) 1086 XML Metadata Interchange (XMI) 895 XML Path Language (XPath) 1085 2.0 data model 1115–20 XML Pointer Language (XPointer) 1085–6 XML Query Languages 1100–28 formal semantics 1121–8 dynamic evaluation 1126–7 logical expressions 1127–8 normalization 1121–5 static type analysis 1125–6 information set 1114–15 Lore and Lorel, extending 1100–1 query working group 1101–3 XQuery 1103–14 XQuery 1.0 data model 1115–20 XML schema 1091–100 built-in types 1092 cardinality 1093 constraints 1096 groups 1094 –5 lists and unions 1095–6 new types 1094 references 1093–4 simple and complex types 1092–3 XML see eXtensible Mark-up Language (XML) XQuery 1103–14 built-in functions and user-defined functions 1111–12 1.0 data model 1115–20 FLWOR expressions 1105–11 path expressions 1103–4 types and sequence types 1112–14 XSL Transformations (XSLT) 1084 Yes/No data type 229 ... in OLAP 122 3–4 windows in Oracle 26 8 Windows NT 63 Wireless Application Protocol (WAP) 7 02 Wisconsin benchmarking 878–9 WITH CHECK OPTION in SQL 183–4 wizards in Office Access 22 6, 22 9 workflow... diagrams 837– 42 usage of UML 8 42? ??4 Uniform Resource Identifiers (URIs) 10 02 Uniform Resource Locators (URL) 10 02? ??4 Uniform Resource Names (URNs) 10 02 unilateral abort 746 union 92, 1 02 of tables... 1085 2. 0 data model 1115? ?20 XML Pointer Language (XPointer) 1085–6 XML Query Languages 1100? ?28 formal semantics 1 121 –8 dynamic evaluation 1 126 –7 logical expressions 1 127 –8 normalization 1 121 –5