Distributed Database Management Systems: Lecture 35. The main topics covered in this chapter include: query optimization and fragmented queries; joins replaced by semijoins; three major QO algorithms; distributed query processing algorithms;...
Distributed Database Management Systems Lecture 35 In the previous lecture • Query Optimization • Centralized QO –Best access path –Join Processing • QO in Distributed Environment In this lecture • Query Optimization –Fragmented Queries –Joins replaced by Semijoins –Three major QO algorithms Semijoin based Algorithms • Reduces cost of join queries • Semijoin is …… • Join of two relations can be replaced SJ of one or both relations • So R ⋈A S can be replaced: – (R ⋉A S) ⋈A S – R ⋈A (S ⋉A R) – (R ⋉A S) ⋈A (S ⋉A R) • Which one? • Need to estimate costs • Same Assumptions: –R at site 1, S at site –Size (R) < Size (S), so – A (S) site –Site1 computes R’ = R ⋉A S’ –R’ site –Site2 computes R’ ⋈A S • Ignoring Tmsg semijoin is better if –Size( A(S)) + size(R ⋉A S) < size(R) • Join is better if … • Semijoin is better if… - • SJ with more than two tables Will be more complex • Semijoin approach can be applied to each individual join, consider EMP ⋈ ASG ⋈ PROJ • EMP ⋈ ASG ⋈ PROJ = • EMP’ ⋈ ASG’ ⋈ PROJ where • EMP’ = EMP ⋉ ASG and • ASG’ = ASG ⋉ PROJ rather • EMP” = EMP ⋉ (ASG ⋉ PROJ) 1- Ship PROJ to site of ASG 2- Ship ASG to site of PROJ 3- Fetch ASG tuples as needed for each tuple of PROJ 4- Move both to a third site Optimization involves costing for each possibility • That is it regarding R* algorithm for distributed query optimization • Lets review it SDD-1 Algorithm • System for Distributed Databases • A non-commercial database • Based on the Hill Climbing Algorithm • No semijoins, No rep/frag • Cost of transferring the result to the user site from the final result site is not considered • Can minimize either total time or response time • Input include –Query Graph –Locations of relations –Relations’ statistics 1- Do the initial local processing 2- Select the initial best plan (ES0) –Calculate cost of moving all relations to a single site –Plan with the least cost is ES0 3- Split ES0 into ES1 and ES2 –ES1: Sending one of the relation to other site, relations joined there –ES2:Sending the result back to site in ES0 4- Replace ES0 with ES1 and ES2 when we should have cost(ES1) + cost(local join) + cost (ES2) < cost (ES0) 5- Recursively apply step and on ES1 and ES2, until no improvement • Example • “Find the salaries of engineers working on CAD/CAM project” • Involves EMP, PAY, PROJ and ASG sal(PAY ⋈title(EMP ⋈pNo( ⋈eNo(ASG pName = ‘CAD/CAM’ (PROJ))))) Relation Size EMP PAY PROJ ASG 10 Site Assume Tmsg = and TTR = Length of a tuple is So size(R) = card(R) • Considering only transfers costs • Site –PAY site = –PROJ site = –ASG site = 10 – Total = 15 Relation Size EMP PAY PROJ ASG 10 Site Assume Tmsg = and TTR = Length of a tuple is So size(R) = card(R) • Considering only transfers costs • Site –PAY site = –PROJ site = –ASG site = 10 – Total = 15 • Cost for site = 19 • Cost for site = 22 • Cost for site = 13 • So site is our ES0 • Move all relations to site Thanks ... That is it regarding R* algorithm for distributed query optimization • Lets review it SDD-1 Algorithm • System for Distributed Databases • A non-commercial database • Based on the Hill Climbing...In the previous lecture • Query Optimization • Centralized QO –Best access path –Join Processing • QO in Distributed Environment In this lecture • Query Optimization –Fragmented... query • Most systems use single SJs to reduce relation size Distributed Query Processing Algorithms • Three main representative algos are ? ?Distributed INGRES Algorithm –R* Algorithm –SDD-1 Algorithm