1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu High-Performance Parallel Database Processing and Grid Databases- P8 pptx

50 819 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Cấu trúc

  • High-Performance Parallel Database Processing and Grid Databases

    • Contents

    • Preface

    • Part I Introduction

      • 1. Introduction

        • 1.1. A Brief Overview: Parallel Databases and Grid Databases

        • 1.2. Parallel Query Processing: Motivations

        • 1.3. Parallel Query Processing: Objectives

          • 1.3.1. Speed Up

          • 1.3.2. Scale Up

          • 1.3.3. Parallel Obstacles

        • 1.4. Forms of Parallelism

          • 1.4.1. Interquery Parallelism

          • 1.4.2. Intraquery Parallelism

          • 1.4.3. Intraoperation Parallelism

          • 1.4.4. Interoperation Parallelism

          • 1.4.5. Mixed Parallelism—A More Practical Solution

        • 1.5. Parallel Database Architectures

          • 1.5.1. Shared-Memory and Shared-Disk Architectures

          • 1.5.2. Shared-Nothing Architecture

          • 1.5.3. Shared-Something Architecture

          • 1.5.4. Interconnection Networks

        • 1.6. Grid Database Architecture

        • 1.7. Structure of this Book

        • 1.8. Summary

        • 1.9. Bibliographical Notes

        • 1.10. Exercises

      • 2. Analytical Models

        • 2.1. Cost Models

        • 2.2. Cost Notations

          • 2.2.1. Data Parameters

          • 2.2.2. Systems Parameters

          • 2.2.3. Query Parameters

          • 2.2.4. Time Unit Costs

          • 2.2.5. Communication Costs

        • 2.3. Skew Model

        • 2.4. Basic Operations in Parallel Databases

          • 2.4.1. Disk Operations

          • 2.4.2. Main Memory Operations

          • 2.4.3. Data Computation and Data Distribution

        • 2.5. Summary

        • 2.6. Bibliographical Notes

        • 2.7. Exercises

    • Part II Basic Query Parallelism

      • 3. Parallel Search

        • 3.1. Search Queries

          • 3.1.1. Exact-Match Search

          • 3.1.2. Range Search Query

          • 3.1.3. Multiattribute Search Query

        • 3.2. Data Partitioning

          • 3.2.1. Basic Data Partitioning

          • 3.2.2. Complex Data Partitioning

        • 3.3. Search Algorithms

          • 3.3.1. Serial Search Algorithms

          • 3.3.2. Parallel Search Algorithms

        • 3.4. Summary

        • 3.5. Bibliographical Notes

        • 3.6. Exercises

      • 4. Parallel Sort and GroupBy

        • 4.1. Sorting, Duplicate Removal, and Aggregate Queries

          • 4.1.1. Sorting and Duplicate Removal

          • 4.1.2. Scalar Aggregate

          • 4.1.3. GroupBy

        • 4.2. Serial External Sorting Method

        • 4.3. Algorithms for Parallel External Sort

          • 4.3.1. Parallel Merge-All Sort

          • 4.3.2. Parallel Binary-Merge Sort

          • 4.3.3. Parallel Redistribution Binary-Merge Sort

          • 4.3.4. Parallel Redistribution Merge-All Sort

          • 4.3.5. Parallel Partitioned Sort

        • 4.4. Parallel Algorithms for GroupBy Queries

          • 4.4.1. Traditional Methods (Merge-All and Hierarchical Merging)

          • 4.4.2. Two-Phase Method

          • 4.4.3. Redistribution Method

        • 4.5. Cost Models for Parallel Sort

          • 4.5.1. Cost Models for Serial External Merge-Sort

          • 4.5.2. Cost Models for Parallel Merge-All Sort

          • 4.5.3. Cost Models for Parallel Binary-Merge Sort

          • 4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort

          • 4.5.5. Cost Models for Parallel Redistribution Merge-All Sort

          • 4.5.6. Cost Models for Parallel Partitioned Sort

        • 4.6. Cost Models for Parallel GroupBy

          • 4.6.1. Cost Models for Parallel Two-Phase Method

          • 4.6.2. Cost Models for Parallel Redistribution Method

        • 4.7. Summary

        • 4.8. Bibliographical Notes

        • 4.9. Exercises

      • 5. Parallel Join

        • 5.1. Join Operations

        • 5.2. Serial Join Algorithms

          • 5.2.1. Nested-Loop Join Algorithm

          • 5.2.2. Sort-Merge Join Algorithm

          • 5.2.3. Hash-Based Join Algorithm

          • 5.2.4. Comparison

        • 5.3. Parallel Join Algorithms

          • 5.3.1. Divide and Broadcast-Based Parallel Join Algorithms

          • 5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms

        • 5.4. Cost Models

          • 5.4.1. Cost Models for Divide and Broadcast

          • 5.4.2. Cost Models for Disjoint Partitioning

          • 5.4.3. Cost Models for Local Join

        • 5.5. Parallel Join Optimization

          • 5.5.1. Optimizing Main Memory

          • 5.5.2. Load Balancing

        • 5.6. Summary

        • 5.7. Bibliographical Notes

        • 5.8. Exercises

    • Part III Advanced Parallel Query Processing

      • 6. Parallel GroupBy-Join

        • 6.1. Groupby-Join Queries

          • 6.1.1. Groupby Before Join

          • 6.1.2. Groupby After Join

        • 6.2. Parallel Algorithms for Groupby-Before-Join Query Processing

          • 6.2.1. Early Distribution Scheme

          • 6.2.2. Early GroupBy with Partitioning Scheme

          • 6.2.3. Early GroupBy with Replication Scheme

        • 6.3. Parallel Algorithms for Groupby-After-Join Query Processing

          • 6.3.1. Join Partitioning Scheme

          • 6.3.2. GroupBy Partitioning Scheme

        • 6.4. Cost Model Notations

        • 6.5. Cost Model for Groupby-Before-Join Query Processing

          • 6.5.1. Cost Models for the Early Distribution Scheme

          • 6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme

          • 6.5.3. Cost Models for the Early GroupBy with Replication Scheme

        • 6.6. Cost Model for “Groupby-After-Join” Query Processing

          • 6.6.1. Cost Models for the Join Partitioning Scheme

          • 6.6.2. Cost Models for the GroupBy Partitioning Scheme

        • 6.7. Summary

        • 6.8. Bibliographical Notes

        • 6.9. Exercises

      • 7. Parallel Indexing

        • 7.1. Parallel Indexing – an Internal Perspective on Parallel Indexing Structures

        • 7.2. Parallel Indexing Structures

          • 7.2.1. Nonreplicated Indexing (NRI) Structures

          • 7.2.2. Partially Replicated Indexing (PRI) Structures

          • 7.2.3. Fully Replicated Indexing (FRI) Structures

        • 7.3. Index Maintenance

          • 7.3.1. Maintaining a Parallel Nonreplicated Index

          • 7.3.2. Maintaining a Parallel Partially Replicated Index

          • 7.3.3. Maintaining a Parallel Fully Replicated Index

          • 7.3.4. Complexity Degree of Index Maintenance

        • 7.4. Index Storage Analysis

          • 7.4.1. Storage Cost Models for Uniprocessors

          • 7.4.2. Storage Cost Models for Parallel Processors

        • 7.5. Parallel Processing of Search Queries using Index

          • 7.5.1. Parallel One-Index Search Query Processing

          • 7.5.2. Parallel Multi-Index Search Query Processing

        • 7.6. Parallel Index Join Algorithms

          • 7.6.1. Parallel One-Index Join

          • 7.6.2. Parallel Two-Index Join

        • 7.7. Comparative Analysis

          • 7.7.1. Comparative Analysis of Parallel Search Index

          • 7.7.2. Comparative Analysis of Parallel Index Join

        • 7.8. Summary

        • 7.9. Bibliographical Notes

        • 7.10. Exercises

      • 8. Parallel Universal Qualification—Collection Join Queries

        • 8.1. Universal Quantification and Collection Join

        • 8.2. Collection Types and Collection Join Queries

          • 8.2.1. Collection-Equi Join Queries

          • 8.2.2. Collection–Intersect Join Queries

          • 8.2.3. Subcollection Join Queries

        • 8.3. Parallel Algorithms for Collection Join Queries

        • 8.4. Parallel Collection-Equi Join Algorithms

          • 8.4.1. Disjoint Data Partitioning

          • 8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm

          • 8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm

          • 8.4.4. Parallel Hash Collection-Equi Join Algorithm

        • 8.5. Parallel Collection-Intersect Join Algorithms

          • 8.5.1. Non-Disjoint Data Partitioning

          • 8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm

          • 8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm

          • 8.5.4. Parallel Hash Collection-Intersect Join Algorithm

        • 8.6. Parallel Subcollection Join Algorithms

          • 8.6.1. Data Partitioning

          • 8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm

          • 8.6.3. Parallel Sort-Hash Subcollection Join Algorithm

          • 8.6.4. Parallel Hash Subcollection Join Algorithm

        • 8.7. Summary

        • 8.8. Bibliographical Notes

        • 8.9. Exercises

      • 9. Parallel Query Scheduling and Optimization

        • 9.1. Query Execution Plan

        • 9.2. Subqueries Execution Scheduling Strategies

          • 9.2.1. Serial Execution Among Subqueries

          • 9.2.2. Parallel Execution Among Subqueries

        • 9.3. Serial vs. Parallel Execution Scheduling

          • 9.3.1. Nonskewed Subqueries

          • 9.3.2. Skewed Subqueries

          • 9.3.3. Skewed and Nonskewed Subqueries

        • 9.4. Scheduling Rules

        • 9.5. Cluster Query Processing Model

          • 9.5.1. Overview of Dynamic Query Processing

          • 9.5.2. A Cluster Query Processing Architecture

          • 9.5.3. Load Information Exchange

        • 9.6. Dynamic Cluster Query Optimization

          • 9.6.1. Correction

          • 9.6.2. Migration

          • 9.6.3. Partition

        • 9.7. Other Approaches to Dynamic Query Optimization

        • 9.8. Summary

        • 9.9. Bibliographical Notes

        • 9.10. Exercises

    • Part IV Grid Databases

      • 10. Transactions in Distributed and Grid Databases

        • 10.1. Grid Database Challenges

        • 10.2. Distributed Database Systems and Multidatabase Systems

          • 10.2.1. Distributed Database Systems

          • 10.2.2. Multidatabase Systems

        • 10.3. Basic Definitions on Transaction Management

        • 10.4. Acid Properties of Transactions

        • 10.5. Transaction Management in Various Database Systems

          • 10.5.1. Transaction Management in Centralized and Homogeneous Distributed Database Systems

          • 10.5.2. Transaction Management in Heterogeneous Distributed Database Systems

        • 10.6. Requirements in Grid Database Systems

        • 10.7. Concurrency Control Protocols

        • 10.8. Atomic Commit Protocols

          • 10.8.1. Homogeneous Distributed Database Systems

          • 10.8.2. Heterogeneous Distributed Database Systems

        • 10.9. Replica Synchronization Protocols

          • 10.9.1. Network Partitioning

          • 10.9.2. Replica Synchronization Protocols

        • 10.10. Summary

        • 10.11. Bibliographical Notes

        • 10.12. Exercises

      • 11. Grid Concurrency Control

        • 11.1. A Grid Database Environment

        • 11.2. An Example

        • 11.3. Grid Concurrency Control

          • 11.3.1. Basic Functions Required by GCC

          • 11.3.2. Grid Serializability Theorem

          • 11.3.3. Grid Concurrency Control Protocol

          • 11.3.4. Revisiting the Earlier Example

          • 11.3.5. Comparison with Traditional Concurrency Control Protocols

        • 11.4. Correctness of GCC Protocol

        • 11.5. Features of GCC Protocol

        • 11.6. Summary

        • 11.7. Bibliographical Notes

        • 11.8. Exercises

      • 12. Grid Transaction Atomicity and Durability

        • 12.1. Motivation

        • 12.2. Grid Atomic Commit Protocol (Grid-ACP)

          • 12.2.1. State Diagram of Grid-ACP

          • 12.2.2. Grid-ACP Algorithm

          • 12.2.3. Early-Abort Grid-ACP

          • 12.2.4. Discussion

          • 12.2.5. Message and Time Complexity Comparison Analysis

          • 12.2.6. Correctness of Grid-ACP

        • 12.3. Handling Failure of Sites with Grid-ACP

          • 12.3.1. Model for Storing Log Files at the Originator and Participating Sites

          • 12.3.2. Logs Required at the Originator Site

          • 12.3.3. Logs Required at the Participant Site

          • 12.3.4. Failure Recovery Algorithm for Grid-ACP

          • 12.3.5. Comparison of Recovery Protocols

          • 12.3.6. Correctness of Recovery Algorithm

        • 12.4. Summary

        • 12.5. Bibliographical Notes

        • 12.6. Exercises

      • 13. Replica Management in Grids

        • 13.1. Motivation

        • 13.2. Replica Architecture

          • 13.2.1. High-Level Replica Management Architecture

          • 13.2.2. Some Problems

        • 13.3. Grid Replica Access Protocol (GRAP)

          • 13.3.1. Read Transaction Operation for GRAP

          • 13.3.2. Write Transaction Operation for GRAP

          • 13.3.3. Revisiting the Example Problem

          • 13.3.4. Correctness of GRAP

        • 13.4. Handling Multiple Partitioning

          • 13.4.1. Contingency GRAP

          • 13.4.2. Comparison of Replica Management Protocols

          • 13.4.3. Correctness of Contingency GRAP

        • 13.5. Summary

        • 13.6. Bibliographical Notes

        • 13.7. Exercises

      • 14. Grid Atomic Commitment in Replicated Data

        • 14.1. Motivation

          • 14.1.1. Architectural Reasons

          • 14.1.2. Motivating Example

        • 14.2. Modified Grid Atomic Commitment Protocol

          • 14.2.1. Modified Grid-ACP

          • 14.2.2. Correctness of Modified Grid-ACP

        • 14.3. Transaction Properties in Replicated Environment

        • 14.4. Summary

        • 14.5. Bibliographical Notes

        • 14.6. Exercises

    • Part V Other Data-Intensive Applications

      • 15. Parallel Online Analytic Processing (OLAP) and Business Intelligence

        • 15.1. Parallel Multidimensional Analysis

        • 15.2. Parallelization of ROLLUP Queries

          • 15.2.1. Analysis of Basic Single ROLLUP Queries

          • 15.2.2. Analysis of Multiple ROLLUP Queries

          • 15.2.3. Analysis of Partial ROLLUP Queries

          • 15.2.4. Parallelization Without Using ROLLUP

        • 15.3. Parallelization of CUBE Queries

          • 15.3.1. Analysis of Basic CUBE Queries

          • 15.3.2. Analysis of Partial CUBE Queries

          • 15.3.3. Parallelization Without Using CUBE

        • 15.4. Parallelization of Top-N and Ranking Queries

        • 15.5. Parallelization of Cume_Dist Queries

        • 15.6. Parallelization of NTILE and Histogram Queries

        • 15.7. Parallelization of Moving Average and Windowing Queries

        • 15.8. Summary

        • 15.9. Bibliographical Notes

        • 15.10. Exercises

      • 16. Parallel Data Mining—Association Rules and Sequential Patterns

        • 16.1. From Databases To Data Warehousing To Data Mining: A Journey

        • 16.2. Data Mining: A Brief Overview

          • 16.2.1. Data Mining Tasks

          • 16.2.2. Querying vs. Mining

          • 16.2.3. Parallelism in Data Mining

        • 16.3. Parallel Association Rules

          • 16.3.1. Association Rules: Concepts

          • 16.3.2. Association Rules: Processes

          • 16.3.3. Association Rules: Parallel Processing

        • 16.4. Parallel Sequential Patterns

          • 16.4.1. Sequential Patterns: Concepts

          • 16.4.2. Sequential Patterns: Processes

          • 16.4.3. Sequential Patterns: Parallel Processing

        • 16.5. Summary

        • 16.6. Bibliographical Notes

        • 16.7. Exercises

      • 17. Parallel Clustering and Classification

        • 17.1. Clustering and Classification

          • 17.1.1. Clustering

          • 17.1.2. Classification

        • 17.2. Parallel Clustering

          • 17.2.1. Clustering: Concepts

          • 17.2.2. k-Means Algorithm

          • 17.2.3. Parallel k-Means Clustering

        • 17.3. Parallel Classification

          • 17.3.1. Decision Tree Classification: Structures

          • 17.3.2. Decision Tree Classification: Processes

          • 17.3.3. Decision Tree Classification: Parallel Processing

        • 17.4. Summary

        • 17.5. Bibliographical Notes

        • 17.6. Exercises

    • Permissions

    • List of Conferences and Journals

    • Bibliography

    • Index

Nội dung

330 Chapter 11 Grid Concurrency Control (2) The global transactions currently executing are added to a set, which stores all active transactions. The set of active transactions is represented as Active Trans. (3) The middleware appends a timestamp to every subtransaction of the global transaction before submitting it to the corresponding database. (4) If there are two active global transactions that access more than one database site simultaneously, this creates a potential threat that local databases may schedule the subtransactions in conflicting order. The subtransactions are therefore executed strictly according to the timestamp attached to the subtransaction. Total-order is achieved by executing the conflicting subtransactions according to the timestamp. (5) When all subtransactions of any global transaction complete the execution at all the sites, the transaction terminates and is removed from Active Trans set (see details in Termination Phase). Note: Active trans and Active Trans(DB) are different. The former is the set of currently active global transactions, and the latter is a function that takes the database site as an argument and returns the set of active transac- tions running at that database site. Explanation of Figure 11.3. Line 1 of Figure 11.3 checks the number of sub- transactions of the submitted transaction. If there is only a single subtransaction, then the global transaction can start executing immediately. The global transaction is added in the active set (line 2) and is submitted immediately to the database for execution (line 3). If the global transaction has more than one subtransaction, that is, the transaction accesses more than one database site, then total-order must be followed. Hence, the timestamp must be appended to all subtransactions of the global transaction. The global transaction is added in the active set (line 4). Global transactions having only one subtransaction are filtered out from the active set, and the new set (Conflict Active trans) of the conflicting global transactions is formed (line 5). Timestamps are then appended to all subtransactions of the global transaction (line 6 and line 7). If the global transaction being submitted conflicts with other active global transactions, it must be submitted to the participant site’s queue to be executed in total-order. Conflict of a submitted global transaction (T i ) with some other active global transaction (T j ) (having more than one active sub- transaction) is checked in line 8. If two global transactions having more than one active subtransaction (i.e., global-global conflict) exist, then the global transaction is added in all participating sites’ active transaction sets (Active Trans(DB i )) (line 13) and the subtransactions are submitted to the participants’ queue (line 14), to be strictly executed according to the total-order. If the submitted global transac- tion does not conflict with any other active global transaction (i.e., line 8 is true), then the global transaction is added in the active transaction set of all the partici- pant sites (line 10), and the subtransaction is immediately submitted for scheduling (line 11). Global transactions are said to be conflicting if two global transactions have more than two active subtransactions executing in different participating sites Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 11.3 Grid Concurrency Control 331 Algorithm: Grid Concurrency Control Algorithm for the submission phase input T i : Transaction var Active_trans: set of active transactions var Conflict_Active_trans: set of active transactions that conflict with global transaction being submitted var Database_accessed [ T i ]: database sites being accessed by global transaction T i Generate timestamp ts : unique timestamp is generated Split_trans(T i ) Database_accessed [ T i ] DB_accessed(T i ) 1. if Cardinality( Database_accessed [ T i ] ) D 1 2. Active_Trans( DB i ) Active_Trans( DB i ) S T i // T i has only one subtransaction 3. submit subtransaction to DB i else 4. Active_trans [ Active_Trans( DB i ) 5. Conflict_Active_trans { T i j T i 2 Active_trans ^ Cardinality(DB_accessed(T j )) > 1} 6. for each subtransaction of T i 7. Append_TS(Subtransaction) 8. if Cardinality (Database_accessed[ T i ] T ( [ T 2 Conflict  Active  trans DB_accessed ( T j )) Ä 1) 9. for each DB i 2 Database_accessed [ T i ] 10. Active_Trans( DB i ) Active_Trans( DB i ) S T i 11. submit subtransaction to DB i // Subtransaction executes immediately else 12. for each DB i 2 Database_accessed [ T i ] 13. Active_Trans( DB i ) Active_Trans( DB i ) [ T i 14. submit subtransaction to participant’s DB Queue // Signifies that subtransaction must follow // total-order Figure 11.3 Grid concurrency control algorithm for submission phase simultaneously. This is different from the definition of conflicting transaction in definition 11.2. The use of these two terms will be easily distinguished by the context. Termination Phase The global transaction is considered active until a response from all subtransac- tions is received. Because of the atomicity property of the transaction, the global Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 332 Chapter 11 Grid Concurrency Control transaction cannot reach a final decision (i.e., commit or abort) until it has received a decision from all the subtransactions. The steps of the transaction termination phase are explained as follows: (1) When any subtransaction finishes execution, the originator site of the global transaction is informed. (2) Active transactions, conflicting active transactions, and databases accessed (by the global transaction) set are adjusted to reflect the recent changes due to completion of the subtransaction. (3) The site checks whether a completed subtransaction is the last subtransac- tion of the global transaction to terminate. (3a) If the subtransaction is not the last to terminate, then the subtransac- tions waiting in the queue cannot be scheduled. (3b) If the subtransaction is the last subtransaction of the global transaction to terminate, then other conflicting subtransactions can be scheduled. The subtransactions from the queue then follow the normal submis- sion steps as discussed in Figure 11.3. Explanation of Figure 11.4. The originator site of the global transaction is informed after any subtransaction completes execution. The global transaction, Algorithm: Grid Concurrency Control Algorithm for termination phase input ST: subtransaction of T i at a site that completes execution 1. Active_trans D (Active_trans - T i ) // removes the global transaction from active set of the site 2. Conflict_Active_trans D (Conflict_Active_trans - T i ) 3. Database_accessed [ T i ] D (Database_accessed [ T i ]-DB k ) // the database where the subtransaction committed is removed from the set of database being accessed by the global transaction 4. if(Database_accessed [ T i ]) D φ //subtransaction was last cohort of GT T i 5. resubmit subtransactions from queue for execution //from Figure 11.3 else 6. resubmit subtransactions to the queue // same as line (14) Figure 11.3 Figure 11.4 Grid concurrency control algorithm for termination phase Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 11.3 Grid Concurrency Control 333 T i , is then removed from the active transaction’s set (line 1). This follows the earlier assumption that a global transaction can have only one subtransaction running at any site at any particular time. The conflicting active transaction’s set is also adjusted accordingly (line 2). The database site where the subtrans- action is completed is removed from the database accessed set (line 3). If the completed subtransaction is the last subtransaction of the global transaction, that is, the database accessed set is empty (line 4), other waiting subtransactions in the queue are submitted for execution (line 5). The normal transaction submission procedure from Figure 11.3 is followed thereafter. If the completed subtransaction is not the last subtransaction, then the queue is unaffected (line 6). 11.3.4 Revisiting the Earlier Example Taking the same scenario as the earlier example, consider that global transactions T 1 and T 2 are submitted in quick succession. Since both the transactions need to access data from more than one site, they are forwarded to the middleware to check the metadata service and form subtransactions (eq. 11.1, 11.2, 11.3, and 11.4) (step 1 of the GCC protocol). As data from multiple sites are to be accessed, the transactions are added in the Active Trans set (step 2 of the GCC protocol). Since subtransactions (eq. 11.1 and 11.2) belong to the same global transaction, T 1 ,the middleware appends same timestamp to both of them, say, timestamp D 1(step3 of the protocol). Similarly, subtransactions (eq. 11.3 and 11.4) belong to T 2 , hence the same timestamp is appended to both of them, say, timestamp D 2(step3ofthe protocol). By looking at equation 11.5, we note that history produced at the database site DB 2 schedules the subtransaction of the global transaction T 1 before the subtrans- action of T 2 (the history in equation 11.5 is serial, but it does not matter as long as H 2 is serializable, with serialization order T 1  T 2 because the timestamp attached to T 1 by the middleware is less than T 2 ). Execution of equation 11.6 will be pro- hibited by line 14 (or step 4) of the algorithm, because T 1 and T 2 are conflicting global transactions and the serialization order is T 2  T 1 , which does not follow the timestamp sequence. Hence, schedules H 2 and H 3 will be corrected by the GCC protocol as follows: H 2 D r 12 .O 1 /r 12 .O 2 /w 12 .O 1 /C 12 r 22 .O 1 /w 22 .O 1 /C 22 .same as eq. 11:5/ H 3 D w 13 .O 3 /C 13 r 23 .O 3 /w 23 .O 4 /C 23 .corrected execution order by the GCC protocol/ Thus in both schedules, T 1  T 2 . It is not required that the schedules be serial schedules, but only that the serializability order should be the same as that of the timestamp sequence from the middleware. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 334 Chapter 11 Grid Concurrency Control 11.3.5 Comparison with Traditional Concurrency Control Protocols Homogeneous distributed concurrency control protocols may be lock-based, timestamp-based, or hybrid protocols. The following discusses the lock-based protocol only, but the arguments hold for other protocols as well. The homogeneous distributed concurrency control protocols can be broadly classified as (i/ centralized and (ii) distributed. The lock manager and the global lock table are situated at a central site in a centralized protocol. The flow of con- trol (sequence diagram) for centralized concurrency control protocols in distributed DBMS (e.g., centralized two-phase locking) is shown in Figure 11.5. All the global information is stored in a central site, which makes the central site a hotspot and prone to failure. To overcome the limitations of central management, a distributed concurrency protocol is used in distributed DBMSs. The flow of control messages is shown in Figure 11.6 for distributed concurrency control protocols (e.g., dis- tributed two-phase locking). Release lock request Operation decision Coordinator site (typically where the transaction is submitted) Central site managing global information (e.g. global lock table) All participating sites (1,2…n) Lock request Lock granted Operation command Figure 11.5 Operations of a general centralized locking protocol (e.g., centralized two-phase locking) in homogeneous distributed DBMS Operation command embedded with lock request Coordinator site (typically where the transaction is submitted) All participating sites (1,2,…n) Participant’s image of global information Operation End of operation Release lock request Figure 11.6 Operations of a general distributed locking protocol (e.g., decentralized two-phase locking) in homogeneous distributed DBMS Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 11.3 Grid Concurrency Control 335 MDBS Reply Forward final decision to the originator Final decision Talk to participant depending on its local protocol Operation request embedded with global information Originator site (wherethe transaction is submitted) Multidatabase management system (global management layer) All participants (1,2,…n) Check with multi-DBMS layer if required Figure 11.7 Operations of a general multi-DBMS protocol Forward final decision to the originator Final decision Forward operation request to participants Operation request Originator site (where the transactionis submitted) Grid Middleware services (metadata and time stamp services for this purpose) All participants (1,2,…n) Figure 11.8 Operations of GCC protocol Figure 11.7 shows the sequence of operations for heterogeneous distributed DBMS (e.g., multidatabase systems). Figure 11.8 shows the sequence of operations for the GCC protocol and highlights that the middleware’s function is very lightweight in a Grid environment, as it acts only as the rerouting node for the global transaction (specifically from correctness perspective), unlike all other architectures. All other figures (Figs. 11.5–11.7) have a global image of the data and have more communication with the sites. It could be noted that the final decision in Figure 11.8 runs in a straight line from the participants to the originator via the middleware; this shows that there is no processing at the middleware and it acts only as a forwarding node. Conversely, Figure 11.7 shows a time lag after receiving the responses from the participants and before forwarding it to the originator, as the multi-DBMS layer has to map the responses in a protocol understandable to the originator. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 336 Chapter 11 Grid Concurrency Control The term “coordinator” is used in Figures 11.5 and 11.6 and “originator” in Figures 11.7 and 11.8. In both cases, the sites are where the global transaction is submitted. But the reason to distinguish between the two terms is that in Figures 11.5 and 11.6 the site also acts as the coordinator of the global transaction, while in Figures 11.7 and 11.8, because of site autonomy, the site acts only as the originator of the global transaction. But Figure 11.7 has far more communication compared with Figure 11.8, with the multi-DBMS layer, as it stores and processes all the global information. 11.4 CORRECTNESS OF GCC PROTOCOL A Grid-serializable schedule is considered correct in the Grid environment for database systems. A concurrency control protocol conforming to theorem 11.1 is Grid-serializable, and is thus correct. Hence, to show the correctness of the GCC protocol, any schedule produced by the GCC protocol has the Grid-serializability property. Proposition 11.1 states the assumption that each DBMS can correctly schedule the transactions (local transactions and global subtransactions) submitted to its site. Proposition 11.1: All local transactions and global subtransactions submitted to any local scheduler are scheduled in serializable order. Because of the autonomy of sites, local schedulers cannot communicate with each other, and because of architectural limitations, the global scheduler cannot be implemented in a Grid environment. Because of the lack of communication among the local schedulers and the absence of a global scheduler, it becomes difficult to maintain consistency of the data. Thus the execution of global sub- transactions at local database sites must be handled in such a way that data consis- tency is maintained. The additional requirement for Grid-serializability is stated in proposition 11.2. Proposition 11.2: Any two global transactions having more than one subtransac- tion actively executing simultaneously must follow total-order. Based on propositions 11.1 and 11.2, the following theorem shows that all schedules produced by GCC protocol are Grid-serializable. Theorem 11.2: Every schedule produced by GCC protocol is Grid-serializable. Proof: The types of possible schedules produced by the GCC are identified first, and then it can be shown that the schedules are Grid-serializable. Global transac- tions are broadly classified in two categories: Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 11.4 Correctness of GCC Protocol 337 (a) Global transactions having only one subtransaction: Global transactions having a single subtransaction can be scheduled immediately and will always either precede or follow any of the conflicting transactions because they execute only on a single site. From proposition 11.1, local schedulers can schedule the transaction in serializable order. (b) Global transactions having more than one subtransaction: Global trans- actions having more than one subtransaction may come under one of the following two cases: (i) Although the global transaction has multiple subtransactions, it con- flicts with other active global transactions at only a single site.This scenario is not a threat to data consistency, and thus the subtransactions could be scheduled immediately (Fig. 11.3, line 8). Local schedulers can correctly schedule transactions in this case. (ii) The global transaction has multiple transactions and conflicts with other global transactions at more than one site: Local schedulers cannot schedule global transactions for this scenario. Hence, the GCC protocol submits all subtransactions in the queue and these subtransactions are executed strictly according to the timestamp attached at the Grid middleware. This ensures that if a subtransaction of any global transaction, GT i , precedes a subtransaction of any other global transaction, GT j , at any site, then subtransactions of GT i will precede subtransactions of GT j at all sites. Thus for all cases: a, b–i and b– ii schedule conflicting global transactions in such a way that if any global transaction, GT i , precedes any other global transac- tion, GT j ,atanysite,thenGT i precedes GT j at all sites. The type of schedules produced by GCC protocol is thus identified. Next, it is shown that these schedules are Grid-serializable. To prove that schedules are Grid-serializable, the Grid-serializability graph must be acyclic and global transactions must be in total-order. Conflicts of the following types may occur: ž Conflict between local and local transactions. The local scheduler is respon- sible for scheduling local transactions. Total-order is required only for sched- ules where global subtransactions are involved. From proposition 11.1, local schedulers can schedule transactions in serializable order. ž Conflict between global transaction and local transaction. A local transac- tion executes only in one site. The subtransaction of the global transaction can only conflict with the local transaction in that site. Thus the local trans- action and the subtransaction of global transaction are scheduled by the same scheduler. From proposition 11.1, these are scheduled in serializable order. Total-order is also maintained, as only one local scheduler is involved in the serialization process. ž Conflict between global and global transactions. Assume that an arc exists from GT i ! GT j at any site DB i . It will be shown that an arc from Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 338 Chapter 11 Grid Concurrency Control GT j ! GT i cannot exist in GCC. GT j can either precede GT i at the database site DB i or at any other database site DB n . Suppose GT j precedes and conflicts with GT i at data site DB i . This contradicts with proposition 11.1. Thus GT j cannot precede GT i at DB i . Suppose GT j precedes and conflicts with GT i at any other data site DB n .IfGT j precedes GT i at any other site, then total-order is not followed and it contradicts proposition 11.2. Figure 11.3 (line 14) of the GCC protocol prevents the occurrence of such a scenario. Thus schedules produced by the GCC protocol are Grid-serializable. 11.5 FEATURES OF GCC PROTOCOL The concurrency control protocol helps to interleave operations of different trans- actions while maintaining the consistency of data in the presence of multiple users. The GCC protocol has the following main features: (a) Concurrency control in a heterogeneous environment:TheGCCprotocol does not need to store global information regarding participating sites; e.g., in traditional distributed DBMS, a global lock table stores information of all locks being accessed by the global transaction. But in the Grid environment, all database sites might not be using the same concurrency control strategy (e.g., locking protocol). In the GCC protocol, individual subtransactions are free to execute the local concurrency control protocol of participating sites. The Grid middleware is used to monitor the execution order of the conflicting transactions. (b) Reducing the load from the originating site: The centralized scheduling scheme and decentralized consensus-based policies intend to delegate the originating site of the transaction as the coordinator. Thus the coordinator site may become a bottleneck when a transaction has to access multiple sites simultaneously. The GCC protocol delegates the scheduling responsibility to the respective sites where the data resides without compromising the correctness of the data, and thus prevents the coordinator from becoming the bottleneck. (c) Reducing the number of messages in the internetwork: Centralized and consensus-based decentralized scheduling schemes need to communicate with the coordinator to achieve correct schedules. The communication increases the number of messages in the system. Messages are one of the most expensive items to handle in any distributed infrastructure. The GCC protocol has fewer messages moving across the network to achieve concurrency. Since the GCC protocol implements total-order on global transactions, the con- flicting transactions will always proceed in one direction, thereby avoiding the problem of distributed deadlocks. Local deadlock management is the policy of the local database site. Because of autonomy restrictions, external interference in Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 11.8 Exercises 339 the local policy is not possible. Other concurrency control anomalies such as lost update, dirty read, and unrepeatable read are addressed at the local DBMS level. The above-mentioned features are due to the architectural requirement of the Grid. But there is a serious architectural limitation of Grid architecture in con- currency control protocols. Because of the inability to install a global scheduler, it becomes difficult to monitor the execution of global subtransactions at differ- ent database sites. As a result, some valid interleaving of transactions cannot take place. Thus the resultant schedule becomes stricter than required. 11.6 SUMMARY Grids are evolving as a new distributed computing infrastructure. Traditional distributed databases such as distributed database management systems and multidatabase management systems make use of globally stored information for concurrency control protocols. Centralized or decentralized consensus-based policies are mostly employed for these database systems. The Grid architecture does not support the storage of global information such as global lock tables, global schedulers, etc. Thus a new concurrency control protocol, called GCC, for Grid databases is needed. The GCC protocol has several advantages: It operates in a heterogeneous envi- ronment; the load of the originator site is reduced compared with traditional dis- tributed databases; and the number of messages in the network is reduced. But at the same time, because of the lack of global control and autonomy restrictions of the Grid architecture, it is difficult to optimize the scheduling process. In this chapter, the focus was the maintenance of data consistency during scheduling of the global transactions. 11.7 BIBLIOGRAPHICAL NOTES Consistency and isolation are two of the ACID properties of transaction, which are the focus of this chapter. Most of the important work on concurrency control has been mentioned in the Bibliographical Notes section at the end of Chapter 10. This covers the work on parallel and grid transaction management by Brayner (DEXA 2001), Burger et al. (BNCOD 1994), Colohan et al. (VLDB 2005), 1993), Machado and Collet (DASFAA 1997), Wang et al. (Parallel Computing 1997), and Wiekum and Hasse (VLDB J). 11.8 EXERCISES 11.1. Explain how concurrency control helps to achieve the “C” and “I” of the ACID properties. 11.2. Explain why individual serializable schedules in each site of the Grid environment may not produce a serializable global schedule. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... homogeneous and synchronous distributed database systems cannot be implemented in Grid databases because of architectural limitations (Grid databases are heterogeneous and asynchronous) (2) Multidatabase systems, being heterogeneous and autonomous in nature, are architecturally closer to the Grid environment But they enjoy the leverage of a multidatabase management systems layer, which is absent in Grids... for a Grid database because communication among sites must be asynchronous, and the latter is unsuitable because sites in Grid databases are autonomous and cannot accept any functional/architectural changes due to external factors The purpose of this chapter is twofold First, an ACP for a Grid database is described, that is, atomicity in a Grid database is addressed Failures are unavoidable, and hence... originator and participant sites The chapter is organized as follows Section 12.1 presents the motivation for addressing atomic commitment in Grid databases Section 12.2 describes the Grid- atomic commit protocol (Grid- ACP) and proves the correctness of the protocol The Grid- ACP is extended to handle site failures in Section 12.3, including the comparison of the Grid- ACP with centralized and distributed... centralized and distributed recovery models Correctness of the recovery protocol is also given High-Performance Parallel Database Processing and Grid Databases, by David Taniar, Clement Leung, Wenny Rahayu, and Sushant Goel Copyright  2008 John Wiley & Sons, Inc 341 342 Chapter 12 Grid Transaction Atomicity and Durability 12.1 MOTIVATION 2PC is the most widely accepted ACP in distributed data environments... Chapter 11 Grid Concurrency Control 11.3 Explain the following terminologies: a Total-order b Grid- serial history c Grid- serializable history d Grid- serializability graph e Grid- serializability theorem 11.4 Summarize the main features of the grid concurrency control (GCC) protocol, and explain how it solves the concurrency issues in the Grid 11.5 Compare and contrast the difference between GCC and any... original and other variants of 2PC Grid architecture is heterogeneous and autonomous; thus dependence on other sites and synchronous communication between sites is not a valid assumption Multi/federated database systems are heterogeneous, and they have been extensively studied during the last decade Multi/federated database systems are mostly studied, designed, and optimized for short-lived and noncollaborative... control protocols (e.g., distributed databases and multidatabase systems) 11.6 Discuss why the number of messages in the internetwork using GCC is reduced in comparison with other concurrency control protocols Chapter 12 Grid Transaction Atomicity and Durability I n this chapter, the “A” and “D” (atomicity and durability) of ACID properties of transactions running on Grid databases are explained Atomic... transactions These database systems are designed in a bottom-up fashion, that is, the system designer knows the design requirements before the system is designed and deployed On the other hand, a Grid database supports long-running collaborative transactions Design requirements of Grid databases can vary rapidly as the system is more dynamic than the traditional distributed, multi/federated databases because... execute and enter into the “sleep” state (step 2 of Grid- ACP and line 1 of Fig 12.3) Since ST 12 and ST 13 both decided to commit, the originator’s decision is to “commit,” which is communicated to the participants (step 3 of Grid- ACP and lines 1 to 3 of Fig 12.2) As the global decision matches with the local ones, both subtransactions update their state from “sleep” to “commit” (step 4a of Grid- ACP, and. .. an ACP suitable for heterogeneous, autonomous, and asynchronous Grid databases is presented 12.2 GRID ATOMIC COMMIT PROTOCOL (GRID- ACP) The concept of compensating transactions is used in Grid- ACP The execution of compensating transactions results in semantic atomicity Semantic atomicity is defined as follows: Definition 12.1: Let Ti be a global transaction and CT i be a collection of local compensating . the Grid. 11.5. Compare and contrast the difference between GCC and any other concurrency control protocols (e.g., distributed databases and multidatabase. a Grid database because communication among sites must be asyn- chronous, and the latter is unsuitable because sites in Grid databases are autonomous and

Ngày đăng: 21/01/2014, 18:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN