1. Trang chủ
  2. » Giáo Dục - Đào Tạo

High performance sequential and lock free parallel access priority queue structures

205 257 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

HIGH PERFORMANCE SEQUENTIAL AND LOCK-FREE PARALLEL ACCESS PRIORITY QUEUE STRUCTURES RICK SIOW MONG GOH A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2006 HIGH PERFORMANCE SEQUENTIAL AND LOCK-FREE PARALLEL ACCESS PRIORITY QUEUE STRUCTURES RICK SIOW MONG GOH (B.Eng. (Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2006 ii ACKNOWLEDGEMENTS First and foremost I want to thank my advisor, Ian Thng Li-Jin, for his support, patience, enthusiasm, direction, guidance and encouragement. I have learnt a huge amount from working with him and have thoroughly enjoyed the countless and lively meetings, conversations and luncheons. The inspiring and positive environment at Computer Communication Network Laboratory has also contributed to the success of this work. I wish to thank all the colleagues at the laboratory for contributing to a pleasant environment. My work has benefited from working with many other colleagues over the past years. They are Tan Kah Leong, Patrick Tan Chin Wee, Tam Pei Zuan, Wang Tsu Wei, Goolaup Sarjoosing, Tang Wai Teng, Chew Yew Meng and Tan Kok Leong. Life there is not purely about work and I have really enjoyed the outings, barbeque sessions, dinners and overnights at chalets. Thank you all! In addition, I would also like to thank my former colleagues at Silicon Graphics Inc especially David Tan for the use of their servers for the performance studies of some of my algorithms. I am also very fortunate to have the love, support and encouragement of an understanding and fantastic family. I am grateful to my compelling parents who have influenced me to not just be myself but rather to continue improving myself. I wish to thank Marie Therese Robles Quieta for her love and support during the past years and for the many years to come. In addition, my life as a candidate would be much too difficult without the support of many good friends. Though it is not possible to list all of them here, they know who they are. Thank you! Last but not least, I thank God The Father, The Son, and The Holy Spirit, for walking with me throughout this life-changing journey. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS II TABLE OF CONTENTS .III SUMMARY . VI LIST OF TABLES . IX LIST OF FIGURES . XI LIST OF ABBREVIATIONS . XIV INTRODUCTION 1.1 PURPOSE AND APPLICATIONS OF PRIORITY QUEUES 1.2 TYPES OF PRIORITY QUEUES 1.2.1 1.2.1.1 Sorted-discipline Calendar Queue 1.2.1.2 Vulnerabilities of the Sorted-discipline Calendar Queue .8 1.2.2 Sorted-Discipline Priority Queues . Unsorted-Discipline Priority Queues 10 1.3 ORIGINAL CONTRIBUTIONS OF THE THESIS 13 1.4 ORGANIZATION OF THE THESIS . 15 EDPQ AND THE TWOL BACK-END STRUCTURE . 16 2.1 OUTLINE OF CHAPTER . 17 2.2 INTRODUCTION TO EPOCH-BASED DEFERRED-SORT PRIORITY QUEUE 18 2.3 TWO-TIER LIST-BASED (TWOL) BACK-END STRUCTURE 20 2.4 TWOL ALGORITHM – DELETEMIN(S) OPERATION . 22 2.4.1 Successive DeleteMin Operations Creates Epochs . 26 2.5 TWOL ALGORITHM – INSERT(E,S) OPERATION 27 2.6 THEORETICAL ANALYSIS OF EPOCH-BASED DEFERRED-SORT PRIORITY QUEUE 28 2.6.1 Scenario and Conditions for Theoretical Analysis, and 1-epoch EDPQ 28 2.6.2 Complexity of 1-epoch EDPQ . 30 2.6.3 Similarities and Differences of the Twol back-end to the UCQ . 33 2.6.4 Important Results for the 1-epoch EDPQ 34 2.6.5 Complexity of Conventional EDPQ . 38 2.7 PERFORMANCE MEASUREMENT TECHNIQUES 40 2.7.1 Access Pattern Models 40 2.7.2 Priority Increment Distributions . 41 2.7.3 Benchmark Architecture and the Effect of Cache Memory . 42 iv 2.8 NUMERICAL ANALYSIS 46 2.8.1 Performance on Intel Pentium (Cache Disabled) 46 2.8.2 Performance on the Intel Itanium (SGI Altix 3300), SGI MIPS R16000 (SGI Onyx4) and AMD Athlon MP 61 2.8.3 Cost vs Performance Consideration 64 2.8.4 Performance Evaluation Via Swan on Intel Pentium (Cache-Enabled) 66 2.9 SUMMARY . 69 TWOL WITH IMPROVED FRONT-END . 71 3.1 OUTLINE OF CHAPTER . 72 3.2 RCB+TWOL PRIORITY QUEUE STRUCTURE . 72 3.3 RCB+TWOL ALGORITHM . 75 3.3.1 Split Operation . 75 3.3.2 DeleteMin Operation 77 3.3.3 Insert Operation . 80 3.3.3.1 Insert_List Function 80 3.3.3.2 Insert_Node Function 82 3.4 NUMERICAL ANALYSIS & SUMMARY 85 LADDER QUEUE 90 4.1 OUTLINE OF CHAPTER . 91 4.2 SUMMARY OF ESSENTIAL PRINCIPLES . 91 4.3 BASIC STRUCTURE OF LADDER QUEUE . 92 4.4 LADDER QUEUE ALGORITHM . 94 4.4.1 DeleteMin Operation 95 4.4.2 Successive DeleteMin Operations Creates LadderQ Epochs 98 4.4.3 Insert Operation . 99 4.5 SPAWNING VS RESIZE AND VALUE OF THRES 100 4.6 PRACTICAL ASPECTS OF LADDERQ – INFINITE RUNG SPAWNING AND REUSING LADDER STRUCTURE . 102 4.7 THEORETICAL ANALYSIS OF LADDER QUEUE’S O(1) AMORTIZED TIME COMPLEXITY . 104 4.7.1 Scenario and Conditions for Theoretical Analysis and the 1-epoch LadderQ . 104 4.7.2 Complexity of 1-epoch LadderQ 108 4.7.3 Similarity of 1-epoch LadderQ to the UCQ 110 4.7.4 Useful Lemmas Applicable to the 1-epoch LadderQ . 110 4.7.5 Theorems for the 1-epoch LadderQ’s O(1) Amortized Complexity . 112 4.7.6 Theorems for the Conventional LadderQ’s O(1) Amortized Complexity . 114 4.8 NUMERICAL STUDIES . 115 v 4.8.1 Performance on Intel Pentium (Cache Disabled) 116 4.8.2 Effect of Bucketwidth on the Performance of LadderQ . 119 4.9 SUMMARY . 120 LOCK-FREE TWOL PRIORITY QUEUE . 121 5.1 OUTLINE OF CHAPTER . 122 5.2 INTRODUCTION 122 5.2.1 CAS/MCAS for Lock-free Queues, Linearization, Process Helping and Pointer Marking. 125 5.2.2 Current Lock-Free Queue Structures . 130 5.2.3 A Review of Sequential Twol Back-end Structure . 130 5.3 LOCK-FREE TWOL STRUCTURE . 133 5.4 MORE SUBTLE FEATURES OF LOCK-FREE TWOL STRUCTURE . 135 5.4.1 Detecting an Invalid Bucket in T1 . 135 5.4.2 Detecting a Transferred T2 and Head Changing Solution 137 5.4.3 Custom Pointer Marking for Lock-free Twol . 139 5.4.4 Additional Transfer Nodes To Speed Up Transfer 141 5.4.5 Process Helping in Lock-free Twol 142 5.4.6 Summary of the Properties of Lock-free Twol 143 5.5 NUMERICAL ANALYSIS 145 5.5.1 AMD Athlon MP – Processors 147 5.5.2 Gateway ALR 9000 – Processors 151 CONCLUSIONS 156 6.1 FUTURE RESEARCH 158 BIBLIOGRAPHY . 160 APPENDICES . 164 A. B. LOCK-FREE TWOL ALGORITHM IN DETAIL . 164 A.1. SearchT1 167 A.2. InsertT1 168 A.3. FinalDelete . 170 A.4. TransferInit 172 A.5. TransferNodes 174 A.6. Insert 177 A.7. DeleteMin . 179 PROOF OF CORRECTNESS OF LOCK-FREE TWOL ALGORITHM 182 LIST OF PUBLICATIONS . 190 vi SUMMARY This thesis presents original work in the area of high performance sequential and lock-free parallel access priority queues. The choice of a high speed priority queue structure plays an integral role in numerous applications and particularly for sizeable scenarios such as large-scale discrete event simulations. The conventional priority queues fall into two categories: sorted-discipline and unsorted-discipline. The limitation of a sorted-discipline priority queue is that its Insert operation is slow since a new event has to determine a correct position to be enqueued. However, its DeleteMin operation, where the highest priority event is removed, is fast since the events are already sorted. On the other hand, the negative aspect of an unsorteddiscipline priority queue is that its DeleteMin operation is time-consuming but its Insert operation is swift as a new event needs only to be appended to the set of unsorted events. We, for the first time, present a new paradigm called epoch-based deferredsort priority queues (EDPQs) which have a combination of both sorted and unsorted-discipline structures, bringing together the best of both categories of priority queues. The new EDPQ paradigm is based on a new and novel back-end unsorted-discipline structure called Twol (Two-tier List-based), since it is made up of two tiers of linked lists. The front-end structure of the EPDQ is a sorted-discipline priority queue. The new Twol back-end is essentially the deferred-sort (or unsorted) portion of the EDPQ and is primarily responsible for reducing queue access times significantly. Another novelty in the Twol back-end is its ability to adapt to changing run-time scenarios as it is an epoch-based structure (hence the term epoch- vii based in EDPQ). By being epoch-based, this means that the Twol back-end is a dynamic structure that undergoes several birth-death processes, which we call an epoch, as run-time progresses. The natural death of the old Twol structure will result in the birth of a new Twol structure that will make use of the latest optimized parameters for the current enqueued events. This ensures that the parameters of the Twol back-end are always kept up to date according to the distribution of the queued events. This results in a highly optimized queue access performance. This thesis will also demonstrate that any EDPQ, using the Twol back-end, will have a constant-µ O(1) amortized complexity, irrespective of the front-end structure. The term “constant-µ” is taken to mean that the average jump parameter of the priority increment distribution is a constant. Furthermore, this thesis contributes simulation results of Twol-based EDPQs outperforming conventional priority queues with an average speedup of 300% to 500% and as much as 1000% on widely different hardware architectures. Another major contribution of the thesis is the contribution of a specialized front-end structure for customized use with the Twol back-end. The front-end structure, called Reactive Cost-based (RCB), provides another additional queueaccess speedup compared to the use of conventional structures for the EDPQ’s frontend. The RCB front-end has a high-quality cost-based heuristic mechanism which allows it to react in different event scenarios. Another significant contribution made in this thesis is the Ladder Queue (LadderQ), which is made up of a sorted linked list as its front-end and an improved Twol back-end. The LadderQ adds another new dimension of improvement because the Twol back-end enhancement results in a novel true O(1) amortized time complexity priority queue. This means that the LadderQ will ensure O(1) viii performance even in a non-stationary µ scenario. This thesis presents a relevant theoretical proof for this property. Numerical studies also demonstrate that the LadderQ outperforms all the rest of the EDPQs presented and they confirm the theoretical predictions that the LadderQ is indeed O(1). Because LadderQ’s O(1) complexity applies even when µ varies, it is expected to efficiently handle many more different distributions not included in this thesis as compared to the RCB+Twol combination. Furthermore, we also demonstrate empirically that the LadderQ exhibits O(1) behavior even when the variance of the jump parameter is infinite! For the parallel access domain, a major and significant contribution made in this thesis is the derivation of a novel lock-free Twol priority queue which is nonblocking, efficient and dynamic. This is the first multi-tier multi-bucket structure invented for lock-free parallel access. Current lock-free structures are single-tier and not involve multiple buckets. The lock-free Twol structure has a high degree of disjoint-access parallelism and a high level of scalability. It is also the first structure which involves the transferring of information between tiers in an efficient and absolutely lock-free fashion, and with a high level of process helping allowing maximum collaboration amongst processes managing the Twol priority queue to complete the transfer in a synchronized and efficient way. The new structure is also linearizable and the numerical analysis shows that it outperforms the most recent lock-free algorithms. Lock-free Twol is indeed a pioneering work for the introduction of more efficient multi-tier lock-free structures in the near future. ix LIST OF TABLES Table 1-1: Arrangement of Events in SCQ Example Table 2-1: Some Important Operating Variables Maintained in Twol Back-end .22 Table 2-2: Arrangement of Events in ❚1 of Twol .25 Table 2-3: Performance of Priority Queues 39 Table 2-4: Priority Increment Distributions 42 Table 2-5: Benchmarking Platforms .44 Table 2-6: Relative Performance for Exponential Distribution on Intel Pentium 59 Table 2-7: Relative Average Performance for Distributions With Constant Mean of Jump on Intel Pentium .59 Table 2-8: Relative Average Performance for Distributions With Varying Mean of Jump on Intel Pentium .60 Table 2-9: Relative Average Performance for All Distributions on Intel Pentium 60 Table 2-10: Speedup of Twol Algorithm on Different Hardware Architecture – Comparison by priority increment distribution .62 Table 2-11: Speedup of Twol Algorithm on Different Hardware Architecture – Comparison by Queue Size 63 Table 3-1: Important Operating Variables Maintained in RCB structure .74 Table 3-2: Operating Variables Maintained in the Twol Back-end Structure 74 Table 3-3: Attributes of node pointers in Figure 3-2 .77 Table 3-4: Attributes of node pointers in Figure 3-3 .84 Table 3-5: Relative Average Performance for Distributions With Constant Mean of Jump on Intel Pentium .88 Table 3-6: Relative Average Performance for Distributions With Varying Mean of Jump on Intel Pentium .88 Table 3-7: Relative Average Performance for All Distributions on Intel Pentium 89 Table 4-1: Some Important Operating Variables Maintained in LadderQ .95 Table 4-2: Relative Average Performance for All Distributions on Intel Pentium .118 Table 4-3: Maximum number of rungs utilized in Classic Hold and Up/Down experiments 118 175 of two major sub operations, to transfer one node at a time (TN17 – TN33), and to complete the transfer by replacing the current head nodes in T2 buckets if all the buckets are empty (TN34 – TN62). The transfer_mgr holds the information to determine the new parameters of the new T1 structure. Using T2min, T2max and T2num, we derive the new bucketwidth using the expression: Bucketwidth of T1 = T1bw→key   = max ❾ T2max[➁ ] → key ) − ❾ T2min[➂ ] → key )  / T2num . ❿  ➀  The new T1start_time is set as T2min→key. In addition, the transfer_mgr contains an important variable called ➃extid which determines which bucket a particular bucket should start transferring from. For instance if there are five concurrent processes operating a lock-free Twol, there exists five T2 buckets, T2[i] where i = 0, 1, …, 4. At the onset, nextid equals 0. Just before the first process attempts to transfer, it increments nextid atomically by one (TN9 – TN12). This methodology reduces contention because each bucket is initially transferred by only a single process. After the first process completes the transfer of T2[0] and if nextid is still less than numthreads (TN33), it starts over to attempt to transfer another bucket. If however nextid is greater or equal to numthreads, then it jumps to update to attempt to complete this transfer by employing the head changing procedure to swap the old head nodes to new ones (i.e. T2head[] replaced with newT2head[]) (TN55). If a transfer2[X].next is not TAIL, it means that the bucket X has not been fully transferred. To satisfy the non-blocking condition, this particular process updates the nextid as X and starts over the transfer process again (TN44) but the difference is that it starts to transfer from T2[X] instead of T2[0]. 176 After all the buckets are empty, a new T1mgr, new T2mgrs and new T2heads are created to replace the old ones (TN45 – TN47). All of these are updated in one single MCAS (TN59). Note that this function exits when: either T1mgr, T2mgr, transfer_mgr or T2 head is detected to have changed (TN3,TN6,TN18,TN22,TN35); or when a T2 bucket is detected to be invalid (TN42). // Local variables Manager_pt transfer_mgr, newT1mgr; Manager_ppt newT2mgr; Node_pt T2min,T2max,thisT2head,thisT2head_next,tx2,tx2_next,tx2_nnext,head,head_next; Node_ppt T2mgr, T2head, T2head_next, transfer2, transfer2_next, newT2head; Bucket_pt bucket; Integer index, transferid, T2num, ret; Double newT1bw_time, newT1start_time; void (Twol_pt l, Manager_pt T1mgr, Manager_pt T2mgr0, Node_pt T2head0) { TN1 (T2head[0],T2mgr[0]) := (T2head0,T2mgr0); TN2 transfer_mgr := MCASRead(&l→transfer_mgr); start_over: TN3 (l→T1mgr≠T1mgr l→T2mgr[0]≠T2mgr[0] l→T2[0].head≠T2head0) TN4 ; TN5 (T2min,T2max,T2num) := transfer_mgr→(T2min,T2max,T2num); TN6 (l→transfer_mgr ≠ transfer_mgr) ; TN7 newT1bw_time := (T2max→key – T2min→key) / T2num; TN8 newT1start_time := T2min→key; TN9 { TN10 (transfer_mgr→nextid < numthreads) transferid := transfer_mgr→nextid; TN11 update; TN12 } (transferid < numthreads CAS (&transfer_mgr→nextid, transferid, transferid+1) ≠ transferid); TN13 (transferid ≥ numthreads) update; TN14 thisT2head := MCASRead_Return(&l→T2[transferid].head); TN15 thisT2head_next := MCASRead(&thisT2head→next); TN16 (tx2 := GET_UNMARKED(thisT2head_next) = TAIL) start_over; transfer_nodes: TN17 (tx2→next ≠ TAIL) { TN18 (l→T2[0].head ≠ T2head[0]) ; TN19 tx2_next := MCASRead(&tx2→next); TN20 (tx2_next = TAIL) ; TN21 index = (tx2_next→key – newT1start_time) / newT1bw_time; TN22 (l→T1mgr ≠ T1mgr l→T2[0].head ≠ T2head[0]) ; TN23 bucket := &l→T1[index]; TN24 head := MCASRead_Return(&bucket→head); TN25 (¬SearchT1(l,index,head,tx2_next→key,transfer_mgr,&prev,&next)) TN26 ; TN27 head_next := MCASRead(&head→next); TN28 tx2_nnext := MCASRead(&tx2_next→next); TN29 (ptr[0], old[0], new[0]) := (&tx2→next, tx2_next, tx2_nnext); TN30 (ptr[1], old[1], new[1]) := (&prev→next, next, tx2_next); 177 TN38 TN39 TN40 TN41 TN42 TN43 TN44 TN45 TN46 TN47 TN48 TN49 TN50 TN51 TN52 TN53 TN54 TN55 TN56 TN57 TN58 TN59 TN60 TN61 TN62 (ptr[2], old[2], new[2]) := (&tx2_next→next, tx2_nnext, next); ret := MCAS (3, ptr, old, new); } if (transfer_mgr→nextid < numthreads) goto start_over; /* help next T2[] */ update: while (ret = TRUE or l→transfer_mgr = transfer_mgr) { if (l→T1mgr ≠ T1mgr or l→T2mgr[0] ≠ T2mgr[0] or l→T2[0].head ≠ T2head[0]) return; (T1bw,T1start,T2cur) := (T1mgr→T1bw,T1mgr→T1start,T2mgr[0]→T2cur); T2mgr[1,M-1] := MCASRead(&l→T2mgr[1,M-1]); for (i := 0; i < numthreads; i++) { T2head[i] := MCASRead_Return(&l→T2[i].head); T2head_next[i] := MCASRead(&T2head[i]→next); if (transfer2[i] := GET_UNMARKED(T2head_next[i]) = NULL)) return; if (transfer2_next[i] := MCASRead(&transfer2[i]→next) ≠ TAIL) { CAS(&transfer_mgr→nextid, numthreads, i); goto start_over; } } newT1mgr := CreateManager(newT1bw, newT1start, NULL, T2num); newT2mgr[0,M-1] := CreateManager(newT2cur,keymax,keymin,0); newT2head[0,M-1] = CreateNode(); j := 0; for (i := 0; i < numthreads; i++) (ptr[j],old[j],new[j++]) := (&l→T2mgr[i],T2mgr[i],newT2mgr[i]); (ptr[j],old[j],new[j++]) := (&l→T1mgr,T1mgr,newT1mgr); for (i := 0; i < numthreads; i++) (ptr[j],old[j],new[j++]) := (&T2head[i]→next,T2head_next[i],NULL); for (i := 0; i < numthreads; i++) (ptr[j],old[j],new[j++]) := (&l→T2[i].head,T2head[i],newT2head[i]); for (i := 0; i < numthreads; i++) (ptr[j],old[j],new[j++]) := (&transfer2[i]→next,transfer2_next[i],NULL); (ptr[j],old[j],new[j++]) := (&l→transfer_mgr,transfer_mgr,NULL); if (MCAS (j, ptr, old, new)) { Free(transfer2[]); Free(T2head[]); Free(T1mgr); Free(T2mgr[]); Free(transfer_mgr); return; } else { Free(newT2head[]); Free(newT1mgr); Free(newT2mgr[]); } } } A.6. Insert TN31 TN32 TN33 TN34 TN35 TN36 TN37 The Insert function takes a key and a value, creates a new node and inserts it into the Twol structure. T2cur holds the value of the threshold (T2cur_time := T2cur→key) in which if the key is greater than T2cur_time, then the new node is to be inserted into T2. If it is smaller or equal, it is inserted into T1. According to its process id, T2mgr, T2head and T2head_next are safely read (I3,I6,I7). If T2head_next is already marked, it means that a transfer is ongoing and as such it will help to transfer first before proceeding with the insertion (I9 – I13). If 178 T2head_next is not marked, the function proceeds to check the threshold value of T2cur. If the key is greater than T2cur_time, then it is to be inserted into T2. This is because in the latter part of a transfer, T2min and T2max are required to calculate the bucketwidth of T1. Hence, this function is required to update the T2min and T2max of the T2mgr[id] (I17 – I32). The updates are done in a single MCAS in either of these places (I19,I24,I28,I32). If MCAS fails, it could mean at least one of these must have changed: T2head→next, T2min or T2max. And thus it retries from the start (I34). If MCAS succeeds, the QueueSize is incremented and the function exits (I33). If the key is not greater than T2cur, then it is to be inserted into T1 (I35 – I41). T1mgr is safely read along with T1bw, T1start, T1index and T1end. These are needed to calculate which bucket to insert this particular new node. If however T1index equals T1end, it means that all the buckets are empty and invalid for insertion. In fact for T1index to be equal to T1end, there must have occurred a DeleteMin operation since only the DeleteMin operation can increment T1index. And since T1index equals T1end, a transfer is about to or has already been initiated. Thus it should help to transfer first before continuing with the insertion of node (I38). If T1index does not equal to T1end, the bucket index is calculated (I39) and an Insert_T1 function is called (I40). If Insert_T1 succeeds, this function exits (40). Otherwise it retries to insert into T1 again. // Local variables Manager_pt T2mgr, T2mgr0, T1mgr; Node_pt newNode,T2cur,T2head,T2head_next,T2head0,T2min,T2max,T1bw,T1start; Integer ret, T1index, T1end, bucket_index; Double T2cur_time; void rt (Twol_pt l, key_t k, value_t v) { I1 newNode := CreateNode(k, v, NULL); I2 while (TRUE ) { I3 T2mgr := MCASRead(&l→T2mgr[id]); I4 T2cur := T2mgr→T2cur; 179 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 I33 I34 I35 I36 I37 I38 I39 I40 I41 I42 A.7. if (T2cur ≠ NULL) T2cur_time := T2cur→key; T2head := MCASRead_Cont(&l→T2[id].head); T2head_next := MCASRead(&T2head→next); if (T2head_next = NULL) continue; if (IS_MARKED(T2head_next)) { /* help transfer */ T2mgr0 := MCASRead(&l→T2mgr[0]); T2head0 := MCASRead_Cont(&l→T2[id].head); if (l→T2mgr[id] ≠ T2mgr) continue; TransferNodes (l, T2mgr0, T2head0); continue; } if (k > T2cur_time) { /* Insert into T2 */ (T2min,T2max) := T2mgr→(T2min,T2max); newNode →next := T2head_next; if (k ≥ T2min→key and k ≤ T2max→key) { (ptr[0],old[0],new[0]) := (&T2head→next, T2head_next, newNode); if (MCAS(1, ptr, old, new)) ret := TRUE; } else if (k ≤ T2min→k and k ≥ T2max→k) { (ptr[0], old[0], new[0]) := (&T2head→next, T2head_next, newNode); (ptr[1], old[1], new[1]) := (&T2mgr→T2min, T2min, newNode); (ptr[2], old[2], new[2]) := (&T2mgr→T2max, T2max, newNode); ret := MCAS (3, ptr, old, new); } else if (k ≤ T2min→k) { (ptr[0], old[0], new[0]) := (&T2head→next, T2head_next, newNode); (ptr[1], old[1], new[1]) := (&T2mgr→T2min, T2min, newNode); ret := MCAS (2, ptr, old, new); } else if (k ≥ T2max→k) { (ptr[0], old[0], new[0]) := (&T2head→next, T2head_next, newNode); (ptr[1], old[1], new[1]) := (&T2mgr→T2max, T2max, newNode); ret := MCAS (2, ptr, old, new); } if (ret = TRUE) { INC (&QueueSize); goto OUT; } continue; } insert_T1: /* Insert into T1 */ T1mgr := MCASRead(&l→T1mgr); (T1bw,T1start, T1end) := T1mgr→(T1bw,T1start ,T1end); T1index := MCASRead(&T1mgr→T1index); if (T1index = T1end) continue; /* help transfer first */ bucket_index := (k – T1start→key) / T1bw→key; if (ret := InsertT1 (l, T1mgr, bucket_index, newNode) = TRUE) goto OUT; else goto insert _T1; } OUT: } DeleteMin The DeleteMin function locates the node with the highest priority (minimum key) and returns its value. The function starts-off by checking if T1index equals T1end. If it is, means that all the buckets in T1 are now empty and invalid. As such a transfer is necessary to transfer the nodes from T2 to T1. It first checks if TransferInit has already been called by checking if T2head_next is marked (D8). If so, then the 180 TransferNodes function will be called and this process will help in transferring the nodes first before attempting a deletion. If T2head_next is not marked, a transfer is initiated (D10). The TransferInit function returns four possible values to indicate four scenarios: EMPTY, NONEED, HELPTRANSFER and INITIATED. These have been explained earlier in the function TransferInit. If EMPTY, means that the entire queue is empty (D14). If NONEED, means that a transfer has just been completed and it should try again to delete a node (D10,D11). If INITIATED or HELPTRANSFER, it means that a transfer has already been initiated and this process only needs to help to transfer the nodes from T2 to T1 (D12,D13). If T1index does not equal to T1end, it means that there is at least one node in T1. Thus the FinalDelete function is called to remove the highest priority node from the most current valid bucket which corresponds to T1[T1index] (D18). If the value returned from FinalDelete is valid (not NULL), it means that the node has already been deleted and the value is returned. If the value however is invalid, it means that the bucket is already empty. Thus the onus is on DeleteMin to increment T1index to indicate that current bucket is already empty and insertions of nodes into the bucket are now prohibited. To achieve this requires sets of data to be updated: to increment T1index; update head→next of bucket to NULL; swap a new head for the current bucket (D21 – D23). These updates are done in a single MCAS (D24) and the function tries again. Note that the rationale for D22 is to indicate that the list in the bucket is already invalid. D23 is to swap a new head with a valid list in the bucket, ready for the next transfer of T2 to T1. This is not only for efficiency but also for correctness. This is more efficient because if the swap is not done, the transfer function has to validate all the buckets in a single MCAS which can be quite expensive due to the 181 high contention as well as the large number of locations to be updated at one go. But to simply validate the buckets may cause errors because it is possible that an old outdated process which was attempting to wrongly insert a node now notices that the list is valid and thus it inserts the node in the wrong bucket. Therefore this explains again the essential need to head changing (see Section 5.4.2). // Local variables Manager_pt T1mgr, T2mgr0; Node_pt T2head0, T2head_next, head, head_next, newHead; Integer T1index, T1end, ti_ret, curT1index; value_t value; value_t in (Twol_pt l) { D1 while (TRUE) { D2 T1mgr := MCASRead(&l→T1mgr); D3 (T1index,T1end) := T1mgr→(T1index,T1end); D4 if (T1index = T1end) { D5 T2head0 := MCASRead_Cont(&l→T2[0].head); D6 T2head_next := MCASRead(&T2head0→next); D7 T2mgr0 := MCASRead_Cont(&l→T2mgr[0]); D8 if (IS_MARKED(T2head_next)) /* help transfer */ D9 TransferNodes (l, T2mgr0, T2head0); continue; D10 if (ti_ret := TransferInit (l, T1mgr, T2mgr0, T2head0) = NONEED) D11 continue; D12 else if (ti_ret = INITIATED or ti_ret = HELPTRANSFER) { D13 TransferNodes (l, T2mgr0, T2head0); continue; } D14 else if (ti_ret = EMPTY) return EMPTY; } D15 head := MCASRead(&l→T1[T1index].head); D16 if (curT1index := MCASRead(&T1mgr→T1index) ≠ T1index) continue; D17 if (l→T1mgr ≠ T1mgr) continue; /* ensures correct T2head read*/ D18 if (value := FinalDelete(l, T1index, head) ≠ NULL) return value; D19 if (T1mgr→T1index ≠ T1index) continue; /* bucket already skipped */ D20 if (head_next := MCASRead(&head_next) ≠ TAIL) continue; D21 (ptr[0], old[0], new[0]) := (&T1mgr→T1index, T1index, newT1index); D22 (ptr[1], old[1], new[1]) := (&head→next, head_next, NULL); D23 (ptr[2], old[2], new[2]) := (&l→T1[T1index].head, head, newHead); D24 if (MCAS (3, ptr, old, new)) Free(head); D25 else Free(newHead); D26 continue; } 182 B. Proof of Correctness of Lock-free Twol Algorithm ➄➅➆➇➈➉➅➊➈➋ility [Herlihy and Wing 1990] is a correctness condition for concurrent objects that exploits the semantics of an abstract data type (ADT) such as a priority queue. Formally, a concurrent implementation is considered linearizable if every operation by concurrent processes appears to take effect instantaneously at some point between its invocation and its response, implying that the meaning of operations can be given by pre- and post-conditions [Herlihy and Wing 1990]. This particular instant is defined as the linearization (or linearizability) point. For every linearizable concurrent execution, there should also exist an equal sequential execution which adheres to the pre- and post-conditions of the sequential semantics of the ADT. This means that the correctness of an ADT is essentially the mapping of a concurrent history to form an equivalent valid sequential history. Thus a correct concurrent object must satisfy this condition for all histories of concurrency. But for a complex ADT structure, it is highly difficult to get histories for all cases. However, the mapping of concurrent histories has been shown to be always possible (i.e. the concurrent object is correct) if every operation that modifies the shared object has a linearization point [Sundell and Tsigas 2005]. Thus to prove that a concurrent implementation is linearizable, the following methodology, which has been devised by Sundell and Tsigas [2005], can be employed: • Define precise sequential semantics • Define abstract state and its interpretation – show that each state is atomically updated 183 • Define linearizability points – show that operations take effect atomically at these points with respect to sequential semantics The lock-free Twol structure is first proved to be linearizable and then followed by the proof that it is ➌➍ck-free. Some definitions are first stated to help in presenting and explaining the proof better. We first define the syntax which will be used in presenting the semantics, followed by defining the sequential semantics of our priority queue operations. ➎➏➐➑➐➒➐➓N B.1: The syntax is S1 : O1, S2, where S1 is the conditional state before the operation O1, and S2 is the resulting state after performing the corresponding operation. DEFINITION B.2: Qt is denoted as the abstract internal state of the Twol priority queue at the time t. Qt is viewed as a set of pairs of 〈 p, v〉 where p is the priority (or key) and has a corresponding value v. Q1t is denoted as the abstract internal state of Twol’s first-tier structure T1 and Q 2t represents the state of secondtier T2, where Qt = Q1t ∪ Q 2t . The operations that can be carried out on the priority queue are Insert (I) and DeleteMin (DM). The time just before the atomic execution of the operation is defined as t1 and the time just after the atomic execution is defined as t2. We denote Prev and Next as contiguous priority elements within a linked list in a bucket. (Insert into T1) { p1 ≤ T 2cur, Prev ≤ p1 < Next} : I ➔ 〈 p1 , v1 〉 ) = true , Q1t = Q1t1 ∪ {〈 p1 , v1 〉} (1) (Insert into T2) { p1 > T 2cur} : I ➔ 〈 p1 , v1 〉 ) = true , Q 2t = Q 2t1 ∪ {〈 p1 , v1 〉} (2) 184 {〈 ↕1 , ➙1 〉 = 〈 ↕, ➙〉 ∈ ➛↔1} (DeleteMin) : ➜➝ → ➣ = 〈 ↕1 , ➙1 〉 (3) , ➛↔ = ➛↔1 \ {〈 ↕1 , ➙1 〉} {➛↔1 = ∅} (DeleteMin) : ➜➝ → ➣ = ⊥ (4) ➞➟➟A B.1: ➠➡➢ definition of the abstract internal state for our Twol implementation is consistent with all concurrent operations examining the state of the priority queue. PROOF: Each bucket in T1 and T2 is made up of a linked list in which the linked list has a head node pointer. And since the head and the next pointers are changed only using the MCAS operation, we are sure that all processes see the same state of the Twol priority queue. Therefore all changes of the abstract state will appear to be atomic. □ LEMMA B.2: During a transfer of priority elements from T2 to T1 occur, Insert and DeleteMin cannot successfully occur, i.e. inserting and deleting of elements cannot take place. PROOF: Before a transfer begins, the next pointers in all the buckets of both T1 and T2 are ensured marked. Only then the transfer operation starts to move the priority elements from T2 to T1. Since the next pointers are changed atomically, we are sure that all the processes executing the Insert and DeleteMin operations will see the same state, i.e. that the next pointers are marked and denote that a transfer is occurring. Therefore those processes will suspend their intended operation (i.e. Insert or DeleteMin) and attempt to help in the transfer until all the elements are transferred from T2 to T1. When the transfer is complete, all the next pointers will 185 be unmarked and those processes will continue to execute their intended operation. □ ➤➥➥A B.3: ➦➧ Insert operation where 〈 p, v〉 is successfully inserted into T1, i.e. I ➨ 〈 p, v〉 ) = true , takes effect atomically at one statement. PROOF: The linearization point (LP) for an Insert into T1 which succeeds ( I ➨ 〈 p, v〉 ) = true ) is when the MCAS sub-operation in line IT17 (in InsertT1) succeeds, and the Insert operation will finally return true. The pre-condition directly before the passing of the LP must have been p ≤ T 2cur . Otherwise, it would have been inserted into T2. SearchT1 function determines that p is to be inserted between Prev and Next, and they are guaranteed to be consistent (from the point SearchT1 has decided till just before IT17) else MCAS would have failed. The state of the priority queue directly after passing the LP will be Q1t = Q1t1 ∪ {〈 p1 , v1 〉} , i.e. Qt = Qt1 ∪ {〈 p1 , v1 〉} . □ LEMMA B.4: An Insert operation where 〈 p, v〉 is successfully inserted into T2, i.e. I ➨ 〈 p, v〉 ) = true , takes effect atomically at one statement. PROOF: The LP for an Insert into T2 which succeeds ( I ➨ 〈 p, v〉 ) = true ) is when the MCAS sub-operation in line I19 or I24 or I28 or I32 succeeds, and the Insert operation operation will finally return true. The reason for having these four possibilities is because Twol’s algorithm needs to have updated T2min and T2max, which denote the minimum and maximum priority found in T2 respectively. The pre-condition directly before the passing of the LP must have been p1 > T 2cur (see I14). The state of the priority queue directly after passing the LP will be Q 2t = Q 2t1 ∪ {〈 p1 , v1 〉} , i.e. Qt = Qt1 ∪ {〈 p1 , v1 〉} . □ 186 ➩➫➫A B.5: ➯ DeleteMin operation which succeeds, i.e. DM ➲ ) = 〈 p1 , v1 〉 where 〈 p1 , v1 〉 = 〈 p, v〉 ∈ Qt1 , takes effect atomically at one statement. PROOF: ( DM ➲ The LP for a DeleteMin operation which succeeds ) = 〈 p1 , v1 〉 ) is when the MCAS sub-operation in line FD11 (in FinalDelete) succeeds. The pre-condition directly before the passing of the LP must have been 〈 p1 , v1 〉 = 〈 p, v〉 ∈ Qt1 since the element to be deleted is the first element from the head of the most current bucket. The state of the priority queue after passing the LP will be Qt = Qt1 \ {〈 p1 , v1 〉} . □ LEMMA B.6: A DeleteMin operation which fails, i.e. DM ➲ ) = ⊥ , takes effect atomically at one statement. PROOF: The LP for a DeleteMin operation which fails ( DM ➲ ) = ⊥ ), is when the MCAS sub-operation in line TI24 (in TransferInit) succeeds. The precondition directly before the passing of the LP must have been Qt1 = ∅ . This is because the success of MCAS means that all the buckets in T1 are invalidated, and all the heads in T2 points to the Tail node, i.e. Q1t1 = ∅, Q 2t1 = ∅ . □ LEMMA B.7: One operation in our Twol implementation will always progress regardless of the actions by the other concurrent operations. PROOF: There are several potential recurrent loops in our implementation which can delay the termination of the operations, thus possibly preventing the operations in the implementation from progressing. However, we shall demonstrate that these recurrent loops occur only if some other operations have occurred (i.e. there is still progression in the system). The invocations of these loops are found in: 187 FinalDelete (FD6, FD13) • FD6 occurs only if some other operation has already deleted element D. Thus it tries again to delete from subsequent elements. But once the bucket is empty or invalid, this operation terminates. • The failure of MCAS (FD13) is only possible if some other operations have progressed by changing the next or value pointers. TransferInit (TI26, TI41) • TI26 occurs only if some other operations have inserted some new elements between the time taken to execute (TI13 – TI23), thus resulting in a failed MCAS (TI24). • TI41 occurs only if either some other operations have inserted some new elements into the T2 buckets or a TransferInit has already occurred causing the MCAS operation to fail. TransferNodes (TN16, TN26, TN33, TN44) • TN16 occurs due to the operation’s semantics to transfer only T2 buckets which are not empty. • TN26 occurs only if some other processes are helping to transfer into the same T1 bucket; the ➳➵ev and Next returned by SearchT1 has changed, which means that some other processes executing this particular operation have progressed. • TN33 happens when not all the T2 buckets have been transferred which obeys the operation’s semantics. • TN44 occurs only if some T2 buckets have yet to finish transferring the elements. Therefore it is required to start over and try again to help in the transfer starting from that bucket. 188 DeleteMin (D5, D7, D9, D11, D13, D17, D19, D20, D26) • D5, D7 and D17 occur only if other processes have progressed by transferring elements from T2 to T1. • D9 and D13 retry the operation because they have just finished helping in the transfer of elements. • D11 occurs upon the completion of TransferInit operation. • D19 occurs because another process has just incremented T1index. • D20 occurs because just before T1index is incremented by this process, another operation has progressed by inserting a new element into the current bucket. • D26 obeys the operation semantics by attempting again to delete an element. InsertT1 (IT9, IT10, IT12, IT18) • IT9 tries again to find the most current bucket as the previously determined bucket has been invalidated by another operation. • IT10 occurs only if the bucket (where the element is to be inserted) has been invalidated by another operation. • IT12 occurs only if SearchT1 returns false. This can occur if either the current bucket has changed (T1index has been incremented) or a transfer has already occurred. • The failure of MCAS sub-operation (IT18) means that the ➸➺ev and Next elements differ from the ones obtained by SearchT1. It implies that some other operations have already progressed by inserting at least one element between Prev and Next or that Prev has already been deleted. So this operation does a retry only if at least one other process has progressed. 189 Insert (I6, I8, I11 – I13, I34, I38, I41) • I6, I8, I11 and I12 occur only if some other process has progressed by changing the head pointer in the T2 bucket, i.e. transferring of elements have been completed. • I13 does a TransferNodes operation to help other processes to complete the transfer and thus Insert should restart since the T2cur must have changed after a transfer. • I34 occurs when MCAS sub-operation has failed and this is possible only if some other process has inserted at least one new element into T2. Either that or a transfer has already been initiated. • I38 is to check if T1index = T1end which means that T1 is now empty and a transfer is about to occur or has already been initiated. • I41 attempts to insert the new element into the next valid bucket because another operation has already incremented T1index. □ ➻➼➽➾➼➚ B.1: ➪➶➹ algorithm implements a linearizable and lock-free priority queue. PROOF: We have shown that the operations take effect atomically at the linearization points (Lemmas B.3, B.4, B.5, B.6) which respect the sequential semantics given in Definition B.2. And with Lemma B.7, proves that our Twol priority queue implementation is linearizable and lock-free. □ 190 List of Publications GOH, R. S. M. AND THNG, L. J. I. 2003. Mlist: An efficient pending event set structure for discrete event simulation. ➘➴➷➬➮➴➱➷✃❐➴➱❒ Journal of Simulation 4, 5-6 (Dec.), 66-77. GOH, R. S. M. AND THNG, L. J. I. 2004. An improved dynamic Mlist for the pending event set problem. International Journal of Simulation 5, 3-4 (Sep.), 26-36. GOH, R. S. M., TANG, W.T., THNG, L. J. I AND QUIETA, M. T. R. 2004. The demarcate construction: A new form of tree-based priority queues. Informatica: An International Journal of Computing and Informatics 28, (Nov.), 277-287. GOH, R. S. M. AND THNG, L. J. I. 2004. Multiq - A multi-tier multi-list based priority queue structure for stochastic discrete event simulation. In Proceedings of 15th IASTED International Conference on Modelling and Simulation, California, USA, 463-468. GOH, R. S. M. AND THNG, L.J. I. 2004. Dsplay: An efficient dynamic priority queue structure for discrete event simulation. In Proceedings of Simulation Technology Training (SimTecT) 2004 Conference, Canberra, Australia, 66-71. GOH, R. S. M., TANG, W. T., THNG, L. J. I AND QUIETA, M. T. R. 2004. A New Form of Tree-based Priority Queues for Discrete Event Simulation. In Proceedings of 2004 High Performance Computing & Simulation in conjunction with the 18th European Simulation Multi-conference, Magdeburg, Germany, 23-28. GOH, R. S. M. AND THNG, L. J. I. 2004. Dynamic Cost-Based Multi-Tier Linked List. In Proceedings of IASTED Applied Simulation and Modelling 2004, Rhodes, Greece, 287-292. GOH, R. S. M. AND THNG, L. J. I. 2005. Twol-amalgamated priority queues. ACM Journal of Experimental Algorithmics 9, 1.6 (Apr.), 1-45. TANG, W. T., GOH, R. S. M. AND THNG, L. J. I. 2005. Ladder queue: An O(1) priority queue structure for large-scale discrete event simulations. ACM Transactions on Modeling and Computer Simulation (TOMACS) 15, 3, 175-204. GOH, R. S. M. AND THNG, L. J. I. An efficient, scalable and dynamic lock-free twol structure. IEEE Transactions on Parallel and Distributed Systems (under review). [...]... development of high performance sequential priority queues that feature the positive aspects found in the current algorithms but without the vulnerabilities Upon the development of these high performance sequential priority queues, the second objective has been to advocate them into the higher echelon of lock- free parallel access domain, i.e synchronization without using mutual exclusion locks In the... is concerned with the study of priority queues in both the sequential and lock- free parallel access domain The study includes the characterization of the current efficient priority queues in terms of their access time complexity, and to observe their strengths and weaknesses The current priority queues either offer stability at the expense of performance or excellent efficiency for many situations but... priority queues as a front-end (v) Lock- free Twol Priority Queue 15 We present the Twol priority queue which is , efficient and dynamic This is the first known structure which involves the transferring of information between multiple tiers in an efficient and absolutely lock- free fashion In addition, it also has a high degree of disjoint -access parallelism, scalability and process helping We show that... introduce the lock- free parallel access Twol priority queue In Chapter 6, we present a summary of all the work contributed in this thesis and draw some conclusions We also provide some suggestions for further work in this final chapter 16 Chapter 2 2 EDPQ and the Twol Back-end Structure This chapter describes and exemplifies the ease of obtaining high performance sequential priority queues using a... Deferred-sort Priority Queue An epoch-based deferred-sort priority queue (EDPQ) has a combination of both sorted and unsorted-discipline structures It brings together the best of the sorted and (1) for an unsorted-discipline priority queues, namely, the Insert operation is unsorted-discipline priority queue (e.g an unsorted linked list) and the DeleteMin operation is (1) for a sorted-discipline priority queue. .. contributions in this thesis and a brief outline of technical subjects covered in later chapters 1.1 Purpose and Applications of Priority Queues yiroirp A queue is an abstract data type that efficiently finds the highest priority element within the set of elements S contained in the priority queue The two basic operations are Insert and DeleteMin (also sometimes labeled as enqueue and dequeue operations respectively),... insights into improving the performance of the UCQ 13 1.3 Original Contributions of the Thesis The major contributions of this thesis are in the development of high performance sequential and parallel access priority queues Specifically, the thesis makes significant contributions and reports original work as summarized below (i) Concept of epoch-based deferred-sort priority queue We present a new concept... having a higher priority than any of its children (at most two) (log( )) n y Both the Insert and DeleteMin operations have a worst-case behavior of complexity Other tree-based priority queues include the Skew Heap [Sleator and Tarjan 1986] and the Splay Tree [Sleator and Tarjan 1985] which are famous for (log( )) amortized access time Although these priority queues are endowed n (log( )), Rửnngren and Ayani... the various current priority queues with the particular aim of differentiating the positive and negative aspects that are intrinsic to these priority queues Background material and various applications of priority queues are contained in this chapter, with numerous references given for an in depth coverage In particular, we describe in detail several conventional priority queue structures We end this... on the embedded priority queue within each application, namely, the Insert and DeleteMin operations This metric is referred to as access time [Rửnngren and Ayani 1997] In real-time systems, the worst-case access time is perhaps the most important criterion in deciding the most suitable candidate out of the numerous proposed priority queue algorithms in the literature On the other hand, for discrete . for lock- free parallel access. Current lock- free structures are single-tier and do not involve multiple buckets. The lock- free Twol structure has a high degree of disjoint -access parallelism and. OF LOCK- FREE TWOL ALGORITHM 182 LIST OF PUBLICATIONS 190 vi SUMMARY This thesis presents original work in the area of high performance sequential and lock- free parallel access priority queues for Lock- free Queues, Linearization, Process Helping and Pointer Marking. 125 5.2.2 Current Lock- Free Queue Structures 130 5.2.3 A Review of Sequential Twol Back-end Structure 130 5.3 LOCK- FREE

Ngày đăng: 16/09/2015, 08:30

Xem thêm: High performance sequential and lock free parallel access priority queue structures

TỪ KHÓA LIÊN QUAN

w