Distributed Database Management Systems: Lecture 30. The main topics covered in this chapter include: basic concepts of query optimization; QP in centralized and distributed DBs; query processor transforms complex queries into concise and simple ones;...
Distributed Database Management Systems Lecture 30 In the previous lecture • Locking based CC • Timestamp ordering based CC • Concluded TM In this Lecture • Basic Concepts of Query Optimization • QP in centralized and Distributed DBs Introduction • SQL one of the success factors of RDBMS • Query processor transforms complex queries into concise and simple ones • Query processing is critical performance issue • QP a complex problem specially in DDBS environment • Main function of QP is to transform an SQL query into equivalent relational algebra one (low level language) • Transformation must achieve correctness and efficiency • Correctness is straightforward since rules exist • An SQL query can have many equivalents in R Algebra • Considering the tables • EMP(eNo, eName, title) • ASG(eNo, pNo, resp, dur) • PROJ(pNo, pName, budget, loc) • Query: Get the names of employees who are managing a project • SELECT eName FROM EMP, ASG WHERE EMP.eNo = ASG.eNo AND resp = ‘Manager’ ( resp=‘Manager’ ^ EMP.eNo = ASG.eNo) (EMPxASG) eName (EMP ⋈ ( resp=‘Manager’ (ASG))) • Obviously second one needs less computing resources since avoids Cartesian product eName Site result = EMP1’ U EMP2’ Site EMP1’=EMP1 Site ASC1’= EMP1’ EMP2’ ⋈(ASG ’) ASG1’ resp = ‘Manager (ASG1) EMP2’=EMP2 Site ⋈(ASG ’) ASG2’ ASC2’= resp = ‘Manager Site (ASG2) result = (EMP1 U EMP2) ⋈ eNo resp = ‘Manager’ (ASG1 ASG1 Site ASG2 Site EMP1 U ASG2) EMP2 Site Site Lets Assume • size(EMP) • size(ASG) • tuple access cost • tuple transfer cost • There are 20 Managers • Data distributed evenly at all sites 400 1000 unit 10 units Strategy • produce ASG': 20*1 20 • transfer ASG' to the sites of E: 20 * 10 • produce EMP': (10+10) *1*2 • transfer EMP' to result site: 20*10 Total 200 40 200 460 Strategy • Transfer EMP to site 5: 400 * 10 • Transfer ASG to the site 1000 * 10 • Produce ASG‘ by selecting ASG • Join EMP and ASG’ 4000 10000 1000 8000 Total 23000 Query Optimization • An important aspect of QP • Minimize resource consumption • I/O cost + CPU cost + communication cost • First two in Centralized DB • Communication Cost will dominate in WAN • Not that dominant in LANs, so total cost should be considered in LANs • QO can also maximize throughput Operators’ Complexity • Select, Project (without duplicate elimination) O(n) • Project (with duplicate elimination), Group O(nlogn) • Join, Semi-Join, Division, Set Operators O(nlog n) • Cartesian Product O(n2) Characterization of Query Processors • Types of Optimization –Exhaustive search for the cost of each strategy to find the most optimal one –May be very costly in case of multiple options and more fragments –Heuristics • Optimization Timing –Static: during compilation • Size of intermediate tables not known always • Cost justified with repeated execution –Dynamic: during execution • Intermediate tables’ size known • Re-optimzation may be required • Statistics –Relation/Fragment: Cardinality, size of a tuple, fraction of tuples participating in a join with another relation –Attribute: cardinality of domain, actual number of distinct values • Decision Sites –Centralized: simple, need knowledge about the entire distributed database –Distributed: cooperation among sites to determine the schedule, need only local information –Hybrid: one site determines the global schedule, each site optimizes the local subqueries • Other factors like: –Network topology –Replicated fragments –Use of semijoins SQL Query on Distributed Relations QUERY DECOMPOSITION Algebraic Query on Distributed Relations DATA LOCALIZATION GLOBAL SCHEMA FRAGMENT SCHEMA Fragment Query GLOBAL OPTIMIZATION Optimized Fragment Query with Communication Operations LOCAL OPTIMIZATION Optimized Local Query STAT OF FRAGMENTS LOCAL SCHEMA ...In the previous lecture • Locking based CC • Timestamp ordering based CC • Concluded TM In this Lecture • Basic Concepts of Query Optimization • QP in centralized and Distributed DBs Introduction... distinct values • Decision Sites –Centralized: simple, need knowledge about the entire distributed database ? ?Distributed: cooperation among sites to determine the schedule, need only local information... –Network topology –Replicated fragments –Use of semijoins SQL Query on Distributed Relations QUERY DECOMPOSITION Algebraic Query on Distributed Relations DATA LOCALIZATION GLOBAL SCHEMA FRAGMENT SCHEMA