High performance sequential and lock free parallel access priority queue structures

LIST OF TABLES Table 1-1: Arrangement of Events in SCQ Example ...7 Table 2-1: Some Important Operating Variables Maintained in Twol Back-end ...22 Table 2-2: Arrangement of Events in T1

Trang 1

LOCK-FREE PARALLEL ACCESS

PRIORITY QUEUE STRUCTURES

RICK SIOW MONG GOH

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE

2006

Trang 2

LOCK-FREE PARALLEL ACCESS

PRIORITY QUEUE STRUCTURES

RICK SIOW MONG GOH

(B.Eng (Hons.), NUS)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING

NATIONAL UNIVERSITY OF SINGAPORE

2006

Trang 3

I am also very fortunate to have the love, support and encouragement of an understanding and fantastic family I am grateful to my compelling parents who have influenced me to not just be myself but rather to continue improving myself

I wish to thank Marie Therese Robles Quieta for her love and support during the past years and for the many years to come In addition, my life as a candidate would be much too difficult without the support of many good friends Though it is not possible to list all of them here, they know who they are Thank you!

Last but not least, I thank God The Father, The Son, and The Holy Spirit, for walking with me throughout this life-changing journey

Trang 4

TABLE OF CONTENTS

ACKNOWLEDGEMENTS II

TABLE OF CONTENTS III

SUMMARY VI

LIST OF TABLES IX

LIST OF FIGURES XI

LIST OF ABBREVIATIONS XIV

1 INTRODUCTION 1

1.1 P URPOSE AND A PPLICATIONS OF P RIORITY Q UEUES 2

1.2 T YPES OF P RIORITY Q UEUES 4

1.2.1 Sorted-Discipline Priority Queues 4

1.2.1.1 Sorted-discipline Calendar Queue 6

1.2.1.2 Vulnerabilities of the Sorted-discipline Calendar Queue 8

1.2.2 Unsorted-Discipline Priority Queues 10

1.3 O RIGINAL C ONTRIBUTIONS OF THE T HESIS 13

1.4 O RGANIZATION OF THE T HESIS 15

2 EDPQ AND THE TWOL BACK-END STRUCTURE 16

2.1 O UTLINE OF C HAPTER 2 17

2.2 I NTRODUCTION TO E POCH - BASED D EFERRED - SORT P RIORITY Q UEUE 18

2.3 T WO -T IER L IST -B ASED (T WOL ) B ACK - END S TRUCTURE 20

2.4 T WOL A LGORITHM – D ELETE M IN (S) O PERATION 22

2.4.1 Successive DeleteMin Operations Creates Epochs 26

2.5 T WOL A LGORITHM – I NSERT ( E ,S) O PERATION 27

2.6 T HEORETICAL A NALYSIS OF E POCH - BASED D EFERRED - SORT P RIORITY Q UEUE 28

2.6.1 Scenario and Conditions for Theoretical Analysis, and 1-epoch EDPQ 28

2.6.2 Complexity of 1-epoch EDPQ 30

2.6.3 Similarities and Differences of the Twol back-end to the UCQ 33

2.6.4 Important Results for the 1-epoch EDPQ 34

2.6.5 Complexity of Conventional EDPQ 38

2.7 P ERFORMANCE M EASUREMENT T ECHNIQUES 40

2.7.1 Access Pattern Models 40

2.7.2 Priority Increment Distributions 41

2.7.3 Benchmark Architecture and the Effect of Cache Memory 42

Trang 5

2.8 N UMERICAL A NALYSIS 46

2.8.1 Performance on Intel Pentium 4 (Cache Disabled) 46

2.8.2 Performance on the Intel Itanium 2 (SGI Altix 3300), SGI MIPS R16000 (SGI Onyx4) and AMD Athlon MP 61

2.8.3 Cost vs Performance Consideration 64

2.8.4 Performance Evaluation Via Swan on Intel Pentium 4 (Cache-Enabled) 66

2.9 S UMMARY 69

3 TWOL WITH IMPROVED FRONT-END 71

3.2 RCB+T WOL P RIORITY Q UEUE S TRUCTURE 72

3.3 RCB+T WOL A LGORITHM 75

3.3.1 Split Operation 75

3.3.2 DeleteMin Operation 77

3.3.3 Insert Operation 80

3.3.3.1 Insert_List Function 80

3.3.3.2 Insert_Node Function 82

3.4 N UMERICAL A NALYSIS & S UMMARY 85

4 LADDER QUEUE 90

4.2 S UMMARY OF E SSENTIAL P RINCIPLES 91

4.3 B ASIC S TRUCTURE OF L ADDER Q UEUE 92

4.4 L ADDER Q UEUE A LGORITHM 94

4.4.1 DeleteMin Operation 95

4.4.2 Successive DeleteMin Operations Creates LadderQ Epochs 98

4.4.3 Insert Operation 99

4.5 S PAWNING VS R ESIZE AND V ALUE OF THRES 100

4.6 P RACTICAL A SPECTS OF L ADDER Q – I NFINITE R UNG S PAWNING AND R EUSING L ADDER S TRUCTURE 102

4.7 T HEORETICAL ANALYSIS OF L ADDER Q UEUE ’ S O(1)A MORTIZED T IME C OMPLEXITY 104

4.7.1 Scenario and Conditions for Theoretical Analysis and the 1-epoch LadderQ 104

4.7.2 Complexity of 1-epoch LadderQ 108

4.7.3 Similarity of 1-epoch LadderQ to the UCQ 110

4.7.4 Useful Lemmas Applicable to the 1-epoch LadderQ 110

4.7.5 Theorems for the 1-epoch LadderQ’s O(1) Amortized Complexity 112

4.7.6 Theorems for the Conventional LadderQ’s O(1) Amortized Complexity 114

4.8 N UMERICAL S TUDIES 115

Trang 6

4.8.1 Performance on Intel Pentium 4 (Cache Disabled) 116

4.8.2 Effect of Bucketwidth on the Performance of LadderQ 119

4.9 S UMMARY 120

5 LOCK-FREE TWOL PRIORITY QUEUE 121

5.2 I NTRODUCTION 122

5.2.1 CAS/MCAS for Lock-free Queues, Linearization, Process Helping and Pointer Marking 125

5.2.2 Current Lock-Free Queue Structures 130

5.2.3 A Review of Sequential Twol Back-end Structure 130

5.3 L OCK - FREE T WOL S TRUCTURE 133

5.4 M ORE S UBTLE F EATURES OF L OCK - FREE T WOL S TRUCTURE 135

5.4.1 Detecting an Invalid Bucket in T1 135

5.4.2 Detecting a Transferred T2 and Head Changing Solution 137

5.4.3 Custom Pointer Marking for Lock-free Twol 139

5.4.4 Additional Transfer Nodes To Speed Up Transfer 141

5.4.5 Process Helping in Lock-free Twol 142

5.4.6 Summary of the Properties of Lock-free Twol 143

5.5 N UMERICAL A NALYSIS 145

5.5.1 AMD Athlon MP – 2 Processors 147

5.5.2 Gateway ALR 9000 – 6 Processors 151

6 CONCLUSIONS 156

6.1 F UTURE R ESEARCH 158

BIBLIOGRAPHY 160

APPENDICES 164

A L OCK - FREE T WOL A LGORITHM IN D ETAIL 164

A.1 SearchT1 167

A.2 InsertT1 168

A.3 FinalDelete 170

A.4 TransferInit 172

A.5 TransferNodes 174

A.6 Insert 177

A.7 DeleteMin 179

B P ROOF OF C ORRECTNESS OF L OCK - FREE T WOL A LGORITHM 182

LIST OF PUBLICATIONS 190

Trang 7

SUMMARY

This thesis presents original work in the area of high performance sequential and lock-free parallel access priority queues The choice of a high speed priority queue structure plays an integral role in numerous applications and particularly for sizeable scenarios such as large-scale discrete event simulations The conventional

priority queues fall into two categories: sorted-discipline and unsorted-discipline

The limitation of a sorted-discipline priority queue is that its Insert operation is slow since a new event has to determine a correct position to be enqueued However, its DeleteMin operation, where the highest priority event is removed, is fast since the events are already sorted On the other hand, the negative aspect of an unsorted-discipline priority queue is that its DeleteMin operation is time-consuming but its Insert operation is swift as a new event needs only to be appended to the set of unsorted events

We, for the first time, present a new paradigm called epoch-based

deferred-sort priority queues (EDPQs) which have a combination of both deferred-sorted and

unsorted-discipline structures, bringing together the best of both categories of priority queues The new EDPQ paradigm is based on a new and novel back-end

unsorted-discipline structure called Twol (Two-tier List-based), since it is made up

of two tiers of linked lists The front-end structure of the EPDQ is a sorted-discipline priority queue

The new Twol back-end is essentially the deferred-sort (or unsorted) portion

of the EDPQ and is primarily responsible for reducing queue access times significantly Another novelty in the Twol back-end is its ability to adapt to changing run-time scenarios as it is an epoch-based structure (hence the term epoch-

Trang 8

based in EDPQ) By being epoch-based, this means that the Twol back-end is a dynamic structure that undergoes several birth-death processes, which we call an epoch, as run-time progresses The natural death of the old Twol structure will result

in the birth of a new Twol structure that will make use of the latest optimized parameters for the current enqueued events This ensures that the parameters of the Twol back-end are always kept up to date according to the distribution of the queued events This results in a highly optimized queue access performance This thesis will

also demonstrate that any EDPQ, using the Twol back-end, will have a constant-µ

O(1) amortized complexity, irrespective of the front-end structure The term

“constant-µ” is taken to mean that the average jump parameter of the priority

increment distribution is a constant Furthermore, this thesis contributes simulation results of Twol-based EDPQs outperforming conventional priority queues with an average speedup of 300% to 500% and as much as 1000% on widely different hardware architectures

Another major contribution of the thesis is the contribution of a specialized front-end structure for customized use with the Twol back-end The front-end

structure, called Reactive Cost-based (RCB), provides another additional

queue-access speedup compared to the use of conventional structures for the EDPQ’s end The RCB front-end has a high-quality cost-based heuristic mechanism which allows it to react in different event scenarios

front-Another significant contribution made in this thesis is the Ladder Queue

(LadderQ), which is made up of a sorted linked list as its front-end and an improved Twol back-end The LadderQ adds another new dimension of improvement because

the Twol back-end enhancement results in a novel true O(1) amortized time

complexity priority queue This means that the LadderQ will ensure O(1)

Trang 9

performance even in a non-stationary µ scenario This thesis presents a relevant

theoretical proof for this property Numerical studies also demonstrate that the LadderQ outperforms all the rest of the EDPQs presented and they confirm the

theoretical predictions that the LadderQ is indeed O(1) Because LadderQ’s O(1) complexity applies even when µ varies, it is expected to efficiently handle many

more different distributions not included in this thesis as compared to the RCB+Twol combination Furthermore, we also demonstrate empirically that the

LadderQ exhibits O(1) behavior even when the variance of the jump parameter is

infinite!

For the parallel access domain, a major and significant contribution made in

this thesis is the derivation of a novel lock-free Twol priority queue which is

non-blocking, efficient and dynamic This is the first multi-tier multi-bucket structure invented for lock-free parallel access Current lock-free structures are single-tier and

do not involve multiple buckets The lock-free Twol structure has a high degree of disjoint-access parallelism and a high level of scalability It is also the first structure which involves the transferring of information between tiers in an efficient and absolutely lock-free fashion, and with a high level of process helping allowing maximum collaboration amongst processes managing the Twol priority queue to complete the transfer in a synchronized and efficient way The new structure is also

linearizable and the numerical analysis shows that it outperforms the most recent

lock-free algorithms Lock-free Twol is indeed a pioneering work for the introduction of more efficient multi-tier lock-free structures in the near future

Trang 10

LIST OF TABLES

Table 1-1: Arrangement of Events in SCQ Example 7

Table 2-1: Some Important Operating Variables Maintained in Twol Back-end 22

Table 2-2: Arrangement of Events in T1 of Twol 25

Table 2-3: Performance of Priority Queues 39

Table 2-4: Priority Increment Distributions 42

Table 2-5: Benchmarking Platforms 44

Table 2-6: Relative Performance for Exponential Distribution on Intel Pentium 4 59

Table 2-7: Relative Average Performance for Distributions With Constant Mean of Jump on Intel Pentium 4 59

Table 2-8: Relative Average Performance for Distributions With Varying Mean of Jump on Intel Pentium 4 60

Table 2-9: Relative Average Performance for All Distributions on Intel Pentium 4 60 Table 2-10: Speedup of Twol Algorithm on Different Hardware Architecture – Comparison by priority increment distribution 62

Table 2-11: Speedup of Twol Algorithm on Different Hardware Architecture – Comparison by Queue Size 63

Table 3-1: Important Operating Variables Maintained in RCB structure 74

Table 3-2: Operating Variables Maintained in the Twol Back-end Structure 74

Table 3-3: Attributes of node pointers in Figure 3-2 77

Table 3-4: Attributes of node pointers in Figure 3-3 84

Table 3-5: Relative Average Performance for Distributions With Constant Mean of Jump on Intel Pentium 4 88

Table 3-6: Relative Average Performance for Distributions With Varying Mean of Jump on Intel Pentium 4 88

Table 3-7: Relative Average Performance for All Distributions on Intel Pentium 4 89 Table 4-1: Some Important Operating Variables Maintained in LadderQ 95

Table 4-2: Relative Average Performance for All Distributions on Intel Pentium 4 .118

Table 4-3: Maximum number of rungs utilized in Classic Hold and Up/Down experiments 118

Trang 11

Table 5-1: Benchmarking Platforms 146

Table 5-2: Relative Performance on AMD Athlon MP 151

Table 5-3: Relative Performance on Gateway ALR 9000 155

Table A-1: Operating Variables Maintained in the Lock-free Twol Structure 164

Trang 12

LIST OF FIGURES

Figure 2-1: Epoch-based deferred-sort priority queue structure .18

Figure 2-2: Mean access time for Sorted-discipline Calendar Queue under Classic Hold experiments .50

Figure 2-3: Mean access time for Sorted-discipline Calendar Queue under Up/Down experiments 50

Figure 2-4: Mean access time for Dynamic Calendar Queue under Classic Hold experiments 51

Figure 2-5: Mean access time for Dynamic Calendar Queue under Up/Down experiments 51

Figure 2-6: Mean access time for Henriksen’s queue under Classic Hold experiments 52

Figure 2-7: Mean access time for Henriksen’s queue under Up/Down experiments 52

Figure 2-8: Mean access time for Splay Tree under Classic Hold experiments .53

Figure 2-9: Mean access time for Splay Tree under Up/Down experiments .53

Figure 2-10: Mean access time for Skew Heap under Classic Hold experiments 54

Figure 2-11: Mean access time for Skew Heap under Up/Down experiments .54

Figure 2-12: Mean access time for Henriksen+Twol under Classic Hold experiments 55

Figure 2-13: Mean access time for Henriksen+Twol under Up/Down experiments 55 Figure 2-14: Mean access time for SplayTree+Twol under Classic Hold experiments 56

Figure 2-15: Mean access time for SplayTree+Twol under Up/Down experiments 56 Figure 2-16: Mean access time for SkewHeap+Twol under Classic Hold experiments 57

Figure 2-17: Mean access time for SkewHeap+Twol under Up/Down experiments 57

Figure 2-18: Mean access time for List+Twol under Classic Hold experiments .58

Figure 2-19: Mean access time for List+Twol under Up/Down experiments 58

Trang 13

Figure 2-20: Run-time performance measurements in Swan on Intel Pentium 4 for

10 to 4,000 network nodes 68

Figure 2-21: Run-time performance measurements in Swan on Intel Pentium 4 for 4,000 to 400,000 network nodes .68

Figure 3-1: Basic RCB+Twol priority queue structure 73

Figure 3-2: RCB structure with additional node pointers .76

Figure 3-3: RCB structure after two DeleteMin operations .79

Figure 3-4: RCB structure after four DeleteMin operations .79

Figure 3-5: RCB structure after additional node pointers .84

Figure 3-6: Mean access time for RCB+Twol under Classic Hold experiments .87

Figure 3-7: Mean access time for RCB+Twol under Up/Down experiments 87

Figure 4-1: Basic structure of Ladder Queue .94

Figure 4-2: Comparison of average time required per element for each operation 102 Figure 4-3: Mean access time for Ladder Queue under Classic Hold experiments.116 Figure 4-4: Mean access time for Ladder Queue under Up/Down experiments 116

Figure 4-5: Mean access time of LadderQ (with widely-varying bucketwidth) for Classic Hold experiments and Exponential distribution Maximum number of rungs (MaxRungs) used during the experiments are also shown 119

Figure 5-1: Using CAS .126

Figure 5-2: Basic structure of sequential Twol back-end structure 132

Figure 5-3: Basic structure of lock-free Twol structure 135

Figure 5-4: Detecting Transferred T2 and Head Changing .138

Figure 5-5: Pointer marking during Transfer operation 141

Figure 5-6: Pointer marking with additional transfer nodes during Transfer operation 142

Figure 5-7: Mean time per Classic Hold operation for lock-free Twol with single T2 bucket on AMD Athlon MP .149

Figure 5-8: Mean time per Classic Hold operation for lock-free Twol on AMD Athlon MP .149

Figure 5-9: Mean time per Classic Hold operation for lock-free Binary Search Tree on AMD Athlon MP 150

Trang 14

Figure 5-10: Mean time per Classic Hold operation for lock-free Skip List on AMD Athlon MP .150 Figure 5-11: Mean time per Classic Hold operation for lock-free Twol with single T2 bucket on Gateway ALR 9000 .153 Figure 5-12: Mean time per Hold operation for lock-free Twol on Gateway ALR 9000 153 Figure 5-13: Mean time per Classic Hold operation for lock-free Binary Search Tree

on Gateway ALR 9000 154 Figure 5-14: Mean time per Classic Hold operation for lock-free Skip List on

Gateway ALR 9000 154 Figure A-1: Example of InsertT1 operation .170 Figure A-2: Example of FinalDelete operation .172

Trang 15

LIST OF ABBREVIATIONS

Appears

Trang 16

Chapter 1

This thesis is concerned with the study of priority queues in both the sequential and lock-free parallel access domain The study includes the characterization of the current efficient priority queues in terms of their access time complexity, and to observe their strengths and weaknesses The current priority queues either offer stability at the expense of performance or excellent efficiency for many situations but which may falter in some other scenarios One of the main objectives of the work reported in this thesis is the development of high performance sequential priority queues that feature the positive aspects found in the current algorithms but without the vulnerabilities Upon the development of these high performance sequential priority queues, the second objective has been to advocate them into the higher echelon of lock-free parallel access domain, i.e synchronization without using mutual exclusion locks

In the first chapter, we give a brief introduction on the various current priority queues with the particular aim of differentiating the positive and negative aspects that are intrinsic to these priority queues Background material and various applications of priority queues are contained in this chapter, with numerous references given for an in depth coverage In particular, we describe in detail several conventional priority queue structures We end this chapter with a summary of the

Trang 17

contributions in this thesis and a brief outline of technical subjects covered in later chapters

1.1 Purpose and Applications of Priority Queues

A pri ori y queue is an abstract data type that efficiently finds the highest priority

element within the set of elements S contained in the priority queue The two basic operations are Insert and DeleteMin (also sometimes labeled as enqueue and

dequeue operations respectively), where Insert(e,S) inserts a new element e into the

set S so that S = e ∪ S, and DeleteMin(S) returns the element e with the highest priority (i.e e = min S) and removes it from the set S so that S = S – e

Priority queues are used in a wide variety of applications They are used frequently in many problems such as branch and bound algorithms, pattern matching, data compression techniques, operating systems, real-time systems and discrete event simulations The metric of interest in these varieties of applications is the time required to perform the most common operations on the embedded priority

queue within each application, namely, the Insert and DeleteMin operations This metric is referred to as access time [Rönngren and Ayani 1997] In real-time

systems, the worst-case access time is perhaps the most important criterion in deciding the most suitable candidate out of the numerous proposed priority queue algorithms in the literature On the other hand, for discrete event simulations, the

amortized (i.e average of total cost of executing a sequence of operations on the

queue over the number of operations) access time [Tarjan 1985] is undoubtedly the most important criterion in selecting the appropriate priority queue algorithm This

Trang 18

is because from a user’s point of view, the total run-time of the simulation job is by far more important than the time taken by each individual priority queue operation

An important use of priority queues is in the area of discrete event simulation In a discrete event simulation, a priority queue is used to hold the futu e event list1 (FEL); the elements in the priority queue are often referred to as events

The FEL is defined as the set of all events generated during a simulation that have not been simulated or evaluated yet The FEL controls the flow of simulation of events with the minimum time-stamp having the highest priority and maximum time-stamp having the least priority These events should be processed in non-decreasing time-order with multiple events of equal time-stamp being processed in the order that they are inserted into the FEL If processed in this manner, causality relations between these events would not be violated, ensuring the simulation results

to be correct and deterministic Deterministic simulation results mean that exact duplicate results will be obtained when any two or more simulation runs with identical parameter settings are specified

Comfort [1984] has revealed that up to 40% of the computational effort in a simulation may be devoted on the management of the FEL alone In the management of the FEL, the Insert and DeleteMin operations (also often referred to

as enqueue and dequeue operations respectively) account for as much as 98% of all

operations on the FEL [Comfort 1984] Understandably, as a simulation system becomes larger, the length of simulation time may take days or weeks to yield results with an acceptable level of statistical error Hence the priority queue

1

The future event list is also often referred to as the pending event set

Trang 19

employed should be efficient especially for large-scale simulations that involve large number of events during the execution of simulation models

In some cases, the problem of large number of events generated can be minimized using efficient simulation techniques that simplify the simulation model

or that which manipulate the statistical properties of the model to reduce the FEL’s size An example is the use of m c sampling if the metric of concern is the

small cell-loss ratio (e.g less than 10-8) in an ATM switch [L’Ecuyer and Champoux 2001] However such techniques are specific only to certain models of concern and cannot be generalized for most simulations In addition, most users are only concerned with the simulation models that they have created and are not conscious

of the underlying structure of the simulator Therefore a good implementation of the FEL should be able to efficiently handle a widely varying FEL size corresponding to small or large-scale discrete event simulation, without any user’s intervention on the simulation engine

1.2 Types of Priority Queues

The assortment of proposed priority queues fall mainly into two categories: discipline and unsorted-discipline

sorted-1.2.1 Sorted-Discipline Priority Queues

A simple example of a sorted-discipline priority queue is a sorted linked list – a list implemented by each element having a link to the next element The elements in the

list are kept in an order of priority where the head (first element) of the list has an

equal or higher priority than the next element An Insert operation is at most O(n)

Trang 20

complexity because a linear search is required to find the correct position in the list

of n elements before a new element is inserted A DeleteMin operation is O(1) because the highest priority element is simply removed from the head of the list

Tree-based priority queues are of sorted-discipline and one example is the well-known Implicit Binary Heap [Bentley 1985] in which it has the property of each parent element having a higher priority than any of its children (at most two) Both the Insert and DeleteMin operations have a worst-case behavior of O(log(n)) complexity Other tree-based priority queues include the Skew Heap [Sleator and Tarjan 1986] and the Splay Tree [Sleator and Tarjan 1985] which are famous for their O(log(n)) amortized access time Although these priority queues are endowed with a similar bound of O(log(n)), Rönngren and Ayani [1997] have empirically shown that the Skew Heap and the Splay Tree outperform the Implicit Binary Heap

in almost all benchmark scenarios Examples of applications of the Splay Tree include in the CelKit (formerly known as SimKit) simulator [Gomes et al 1995], data structure for fast IP lookups [Narlikar and Zane 2001] and is also used in the block sorting process of Burrows and Wheeler [Yugo et al 2002]

A popular list-based priority queue is the Henriksen’s queue [1977] which

frequently exhibits O(log(n)) behavior and is bounded by O( n) in amortized complexity [Kingston 1986] Henriksen’s queue is employed in simulators such as SLAM [Pritsker and Pegden 1979], GPSS/H [Henriksen and Crain 1996] and SLX [Henriksen 1997]

The Sorted-discipline Calendar Queue (SCQ) [Brown 1988] is a popular and widely used list-based priority queue Although the SCQ has a worst-case bound of

O(n), i.e equivalent to the bound of a linear linked list, the SCQ has been employed

in various simulation systems such as GTW [Das et al 1994], CSIM18 [Schwetman

Trang 21

1996], Network Simulator v2 [Fall and Varadhan 2002], as well as in a quality of service algorithm where it maintains real-time packet requests [Stoica et al 2000] The reason for its high regard is that it offers an e (1) amortized complexity whereby expected means that the complexity has not been proven theoretically but rather the performance of the SCQ has been empirically demonstrated to exhibit constant cost per operation frequently under many operating conditions On the other hand, a e (1) amortized complexity priority queue refers to a queue that has been theoretically proven to offer a constant amortized cost per operation under all scenarios, without imposing any restrictive conditions

An interesting note to highlight is that there has also been development of an SCQ-like structure as part of a rate controller for ATM switches [Hagai and Patt-Shamir 2001] As shall be discussed later, the new structures proposed in this thesis have some similarities and differences compared to the SCQ Therefore, we shall now briefly describe how an SCQ priority queue structure works

1.2.1.1 Sorted-discipline Calendar Queue

An SCQ has buckets numbered 0 to− , a bucketwidth δ , and an absolute start 1

time

s

of bucket[0] Each bucket consists of a sorted linked list For each event e in

the SCQ, event timestamp ( )

and each event is placed on the

day it is to occur regardless of the year The duration of a year is δ and therefore

the time span of year j is [ , ( 1) )

s s jδ j δ+ + + For practical implementation,

Trang 22

an SCQ has two important parameters, b and ; b = / 2 and

=i They represent the lower and upper threshold respectively of the number of 2events which can exist in the SCQ before a restructuring process (or z

operation) takes place to modify and δ If threshold is breached, i.e there

are at least i2 1+ events, then 2

=i , b = and = 2

i The bucketwidth δ is obtained via sampling some events in the SCQ to estimate the average inter-event time interval among all the events

Table 1-1: Arrangement of Events in SCQ Example

is bucket[1] Suppose processing of events in year[0] do not generate new events After processing event 127, since bucket[3] is empty, the next to be examined is bucket[0] and event 170 is deleted Right after event 226 is deleted, number of events in the SCQ has breached the b threshold, i.e N < / 2 where N = 1

Thus a restructuring process takes place to modify / 2 2

Trang 23

1.2.1.2 Vulnerabilities of the Sorted-discipline Calendar Queue

Rönngren and Ayani [1997] provided empirical evidence that the SCQ achieves

!(1) performance in benchmark scenarios where the number of events " in the queue and the mean of the #$% & random variable associated with the priority increment distribution, denoted as µ, remain constant However, the SCQ sometimes falters to !(') when µ varies, e.g the Camel and Change distributions, even though

" is constant Furthermore, when " varies, the performance of the SCQ becomes erratic with many peaks which occur when the queue size fluctuates by a factor of 2, revealing that the resize operations can be costly In addition, when " varies, the SCQ does not always achieve !(1) performance even though µ remains constant, e.g Triangular distribution These observations translate to the following:

(i) The SCQ’s size-based resize trigger is an ineffective mechanism for handling

skewed distributions where µ varies It results in many events being enqueued into a few buckets with long sublists and many empty buckets (i.e skewed distribution phenomena) Long sublists make Insert operations expensive since each enqueue entails a sequential search, whereas excessive traversal of empty buckets can increase the process of DeleteMin operations This problem of the SCQ has been attributed to the size-based resize triggers which are too rigid to react according to the events distribution encountered since a resize trigger occurs only if the queue size halves or doubles [Brown 1988]

(ii) Size-based resize trigger is not suitable for handling simulation scenarios

where " varies by a factor of two frequently This form of trigger is considered inflexible because even though the SCQ can be performing well

Trang 24

with its existing operating parameters, but due to this trigger, it still has to resize when ( varies by a factor of two

(iii) Sampling heuristic is inadequate to obtain good operating parameters,

namely, the number of buckets and the bucketwidth, when skewed distributions are encountered During each resize operation, the SCQ samples, at most, the first twenty-five events This is clearly too simplistic because for skewed distributions in which many events fall into some few buckets, the inter-event time-gap of the first twenty-five events, which can span several buckets, and those in the few populated buckets may vary a lot

To simply increase the events sampled is not a prudent approach unless the distribution is uniform For most distributions, particularly skewed distributions such as the Camel distribution, if we simply sample more events and then take the mean or median, it is unlikely that it will be accurate since events are spread unevenly Even if the most populated bucket is sampled, the SCQ will also perform poorly because events in that bucket will have a small average time-gap whereas other events have widely diverse time-gaps If the bucketwidth is updated to this small time-gap, there will likely be numerous empty buckets And skipping these empty buckets will lead to inferior performance Furthermore, sampling more events inevitably leads to higher overheads for each resize operation, affecting the SCQ performance on a whole

(iv) Events are immediately sorted in all the bucket sublists as they are enqueued

This enqueue process is inefficient and costly if the event distribution is heavily skewed so that bucket sublists are very long resulting in costly linear searches during Insert operations Furthermore, if resize operations are

Trang 25

triggered, the effort of sorting these events previously are all wasted since these events will have to be re-enqueued and re-sorted again Such resize operations are very costly and time-consuming in unstable large-scale scenarios where ) varies, resulting in resize triggers being activated many times

It is interesting to note that there are several SCQ variants in the literature which attempts to mitigate the problems in the SCQ by adding additional heuristics These variants are the Dynamic CQ [Oh and Ahn 1998], Reduced Operations CQ (ROCQ) [Tan 2001, Tan 2002], and an improved ROCQ [Goolaup 2003] These variants continue to employ the costly resize operations in attempts to obtain better operating parameters Extensive experiments done personally show that these heuristics are still insufficient to handle skewed distributions

1.2.2 Unsorted-Discipline Priority Queues

An example of an unsorted-discipline priority queue is the *+,-./ 0d linked list The

elements in the list are not kept in any order of priority and usually on a

first-come-first-serve basis An Insert operation is O(1) since the new element is simply

appended to the list either at the head or the tail of the list A DeleteMin operation is

however guaranteed to be Ω(n) because the entire list of n elements has to be

searched to determine the element with the highest priority It is clear that compared

to the sorted-discipline linked list mentioned in Section 1.2.1., the overall complexity of the unsorted linked list is clearly higher than the sorted-discipline linked list

The Unsorted-discipline Calendar Queue (UCQ) is also another discipline priority queue discussed by Erickson et al [2000] The UCQ is similar to

Trang 26

unsorted-the SCQ for unsorted-the most part except that each bucket in UCQ consists of an unsorted

linked list; each bucket in SCQ consists of a sorted linked list Compared to the UCQ, the SCQ is by far more practical and thus more popular because the number

of operations required for a basic Insert and DeleteMin operation is 1(2

3 4 5l63 7) and

O(1) respectively in an SCQ bucket where n sublist represents the number of events in

an SCQ bucket For the case of the UCQ, the number of operations required is

higher, i.e O(1) for enqueue and guaranteed n sublist (i.e Ω(n sublist)) operations for dequeue Hence on average, the UCQ has higher amortized cost compared to the SCQ

Though a UCQ may not perform as well as an SCQ, and has not been practically used, it allows for the derivation of an interesting theoretical property of the Calendar Queue Erickson et al employed Markov chain analysis to investigate

the performance of UCQ under a static scenario where the number of active events

N in the UCQ remains a constant Generally, the number of active events may vary

over time However, the static case is an important case because it is commonly encountered in discrete event simulations For instance, when simulating a parallel

computer, if there are N processors, then there are exactly N active events in the

priority queue [Erickson et al 2000] Another example is when we use the oriented Swan simulator [Thng and Goh 2004] to simulate computer network

process-topologies; if there are N nodes, then there are exactly N active events in the priority

queue

Erickson et al [2000] have proven theoretically that under a static scenario

where the number of events N in the UCQ remains a constant, the UCQ has constant expected processing time per event if the mean of the jump (µ) variable is a constant

The jump, which has the value t e( )−t e( )0 , is a random variable sampled from some

Trang 27

priority increment distribution that has a mean jump µ, where 8 is a newly generated event with time ( )9 8 and 80 is the current minimal time They proved that this

expected processing time depends on µ and not on the shape of its probability

density They derived the expected time per event for the infinite bucket UCQ (UCQ:) and the finite bucket UCQ (UCQ;)

The best theoretical performance occurs in the UCQ: where there are an infinite number of buckets and where the array of buckets in the UCQ: spans one year time interval The performance of the UCQ; is at best equal to that of the UCQ: The events in the UCQ: are processed bucket-by-bucket, unlike in the UCQ; where there are a finite number of buckets In a UCQ;, the buckets can contain events which are in the future years and thus will be processed only much later This results

in a poorer performing UCQ since the DeleteMin operation has to search through these future years events It is also likely that after searching an entire bucket of events, there is no event with the minimum timestamp because all the events in the bucket are in the future years

We notice that their optimized UCQ suffers from the same vulnerabilities as the SCQ, which have been mentioned earlier In simulation models where < and/or

= can vary with time, the UCQ must undergo costly resize operations to obtain a new bucketwidth δ and a new number of buckets used Otherwise, its performance

will degrade to >(<) On top of the problems associated with the SCQ, the UCQ has

an Ω(?

@ ABC @ D) cost per DeleteMin operation, where ?

@AB C:@D is the number of the events

in the bucket from which the event is deleted Nevertheless, it is a well-written theoretical article which provided insights into improving the performance of the UCQ

Trang 28

1.3 Original Contributions of the Thesis

The major contributions of this thesis are in the development of high performance sequential and parallel access priority queues Specifically, the thesis makes significant contributions and reports original work as summarized below

• We present a new concept which has a combination of both sorted and

unsorted-discipline structures, bringing together the best of both categories of priority queues In addition to having a deferred-sort methodology, an EDPQ is epoch-based in which it has the ability to adapt to changing run-time scenarios so that the latest optimized parameters used in its structure are obtained from the current enqueued elements

• We present a novel Twol back-end structure to realize practical and useful epoch-based deferred-sort priority queues (EDPQs) The Twol back-end can be combined with any conventional sorted-discipline priority queue as a front-end structure, such as a sorted linked list or a tree-based structure, to obtain high performance EDPQs

• We theoretically prove that EDPQs that maintain E active events have a

constant-µ F(1) amortized complexity

• Extensive simulation benchmarks carried out confirm the theoretical

predictions that the EDPQs have indeed near constant performance and are bounded by the front-end structure’s amortized complexity Furthermore, the EDPQs consistently outperform those conventional

Trang 29

priority queues (i.e front-end without the Twol back-end structure) with

an average speedup of 300% to 500% and as much as 1000% on widely different hardware architectures

• We present the RCB which is a specialized front-end structure for

customized use with the Twol back-end

• From numerical analysis, we recognize that List+Twol, i.e Twol

back-end with a sorted linked list as the front-back-end structure, offers the best

performance when µ remains a constant However, when µ varies, there

may be excessive elements to be managed in the front-end structure This causes the List+Twol to sometimes falter to near G(H) behavior

• RCB+Twol overcomes the weakness of List+Twol by having a reactive

front-end structure such that it offers constant-µ G(1) amortized complexity for good operating conditions and under poor conditions, the front-end reacts to keep the amortized complexity bounded to G(log(H))

• We present a priority queue which has an improved Twol back-end and a

sorted linked list as a front-end

• LadderQ is theoretically proven to have IJKL G(1) amortized complexity,

regardless of whether µ is constant or varying

• Numerical analysis demonstrates that LadderQ has the best performance

when compared with other EDPQs; i.e RCB+Twol and those with conventional priority queues as a front-end

Trang 30

• We present the PQSUVWXYY Twol priority queue which is ZQ ZV[ PQSU\Z , efficient and dynamic This is the first known structure which involves the transferring of information between multiple tiers in an efficient and absolutely lock-free fashion In addition, it also has a high degree of disjoint-access parallelism, scalability and process helping

• We show that this new structure is P ZY^X\_^[ PY, which is an important property for demonstrating the correctness for Twol to be practically useful as a priority queue structure

• Our numerical analysis demonstrates that it outperforms the most recent lock-free algorithms by more than 70%

1.4 Organization of the Thesis

This thesis consists of six chapters In Chapter 2, we introduce the new epoch-based deferred-sort paradigm and the important Twol back-end structure We provide theoretical analysis of the EDPQ and we demonstrate via numerical studies the enhanced performance of conventional priority queues when they are used as the front-end structure to the Twol back-end structure In Chapter 3, we present a new specialized front-end, the Reactive Cost-based (RCB) structure, for customized use with the Twol back-end structure In Chapter 4, we introduced the Ladder Queue (LadderQ) which has an improved Twol back-end structure Theoretical proof for the LadderQ’s true amortized `(1) property is also presented In Chapter 5, we introduce the lock-free parallel access Twol priority queue In Chapter 6, we present

a summary of all the work contributed in this thesis and draw some conclusions We also provide some suggestions for further work in this final chapter

Trang 31

of two tiers of unsorted-discipline linked lists, whereas the front-end structure is a sorted-discipline priority queue structure Another novelty in the Twol back-end is its ability to adapt to changing run-time scenarios as it is an epoch-based structure Any EDPQ, using the Twol back-end, that maintains g active events are

theoretically proven to offer a constant-µ h(1) amortized complexity, irrespective of the front-end structure The term “constant-k” is taken to mean that the average jump parameter of the priority increment distribution is a constant EDPQs where the front-end is a conventional priority queue, such as the Henrisksen’s queue, Splay Tree, or Skew Heap, are demonstrated empirically to offer stable near h(1) performance for widely varying priority increment distributions and for queue sizes ranging from 10 to 10 million Extensive numerical analysis shows that the EDPQs consistently outperform those front-end conventional priority queue structures (i.e

Trang 32

without the Twol back-end) with an average speedup of about 300% to 500%, and

as much as 1000% speedup, on widely different hardware architectures These

results provide testimony that EDPQs are suitable for implementation in sizeable application scenarios such as, but not limited to, large-scale discrete event simulation

2.1 Outline of Chapter 2

This chapter is organized as follows In Section 2.2, we give an introduction to epoch-based deferred-sort priority queue (EDPQ) and explain the methodologies employed In Section 2.3 we present the Twol back-end structure along with the definition of several important variables associated with this structure Also we highlight the significant design improvements of the Twol back-end structure over the Sorted-discipline Calendar Queue (SCQ) structure The main operations of the Twol structure – q vwv{ v| } ~ and ~v{ are described in detail in Sections 2.4 and 2.5 respectively Section 2.6 presents the theoretical analysis of EDPQ which proves

that this priority queue has a constant-µ (1) amortized complexity In addition to the proof, we have also included an extensive range of numerical studies in Section 2.7 and 2.8 that demonstrates the excellent performance Twol back-end structure offers – a speedup of three to five times Furthermore the priority queues are incorporated into our ~ discrete event simulator and a simple |/|/1 queuing system was simulated Finally, we illustrate that the performance of a front-end priority queue structure to have a Twol back-end far outweighs the cost, which is the additional memory requirement And if memory is a tight constraint, there is even a method to balance this cost and the performance desired!

Trang 33

2.2 Introduction to Epoch-based Deferred-sort Priority Queue

An epoch-based deferred-sort priority queue (EDPQ) has a combination of both

sorted and unsorted-discipline structures It brings together the best of the sorted and unsorted-discipline priority queues, namely, the Insert operation is (1) for an unsorted-discipline priority queue (e.g an unsorted linked list) and the DeleteMin operation is (1) for a sorted-discipline priority queue (e.g a sorted linked list)

Figure 2-1 shows the structure of an EDPQ It is comprised of two

fundamental structures, the front-end structure and the Twol back-end structure

The Twol back-end is essentially the deferred-sort (or unsorted) portion of the EDPQ and is primarily responsible for reducing queue access times significantly The front-end structure can be any conventional priority queue implementation In this chapter, we have implemented the Twol back-end with four front-end priority queues, namely, the Henriksen’s queue, Splay Tree, Skew Heap, and a traditional sorted linked list, for performance evaluation purpose We coin these EDPQs which

front-end is a conventional priority queue structure as Conventional+Twol This is

to differentiate these priority queues from the other EDPQ with a specialized end which will be introduced in the next chapter

front-Figure 2-1: Epoch-based deferred-sort priority queue structure

Trang 34

The two distinct methodologies associated with an EDPQ are deferred-sort and epoch-based An EDPQ defers the sorting of events until absolute necessary,

that is, when some high priority events are close to being dequeued (i.e deleted and processed) The majority of new events are simply appended into unsorted linked lists without sorting and hence only incur (1) cost

This deferred-sort approach is attributed mainly to the Twol back-end structure which is made up of two tiers The first-tier ( )1 is made up of an array of buckets where each bucket contains an unsorted linked list The second-tier (2) is simply an unsorted linked list When a new event is inserted into an EDPQ, it can either be enqueued into the front-end structure or the Twol back-end structure (either

convoluted and its methodology of postponing the sorting of events is not

epoch-based It also requires the user to know priori the value of numerous parameters

for it to perform well It also relies on the resize operation similar to the SCQ, which means having similar vulnerabilities as the SCQ, and on top of that it comes with various resize criteria We only provide a brief mention of the LazyQ because it has been already shown that the LazyQ provides similar performance compared to the SCQ for most scenarios and worse for some scenarios [Rönngren and Ayani 1997]

As such we have not included the LazyQ in our performance comparision We have

Trang 35

on the other hand empirically shown that the EDPQs significantly outperform the SCQ in almost every scenario (see Section 2.8)

Another novelty in the Twol back-end is its ability to adapt to changing time scenarios as it is an epoch-based structure (hence the term epoch-based in EDPQ) By being epoch-based, this means that the Twol structure is a dynamic structure that undergoes several birth-death processes, which we call an epoch, as run-time progresses The natural death of the old Twol structure will result in the birth of a new Twol structure that will make use of the latest optimized parameters for the current enqueued events The algorithm of EDPQ proceeds in epochs, whereby for each epoch, new parameters for the front-end structure and 1 of Twol back-end are obtained based on the current events enqueued in 2 This means that the EDPQ will always operate with a new set of operating parameters (e.g the number of buckets in 1 and the bucketwidth of each of these buckets) tailored for events in the new epoch This ensures that the parameters of the Twol back-end are always kept up to date according to the distribution of the queued events This results in a highly optimized queue access performance It is because of this unique epoch-based deferred-sort paradigm that we are able to achieve highly efficient priority queues

run-2.3 Two-Tier List-Based (Twol) Back-end Structure

The Twol back-end structure is made up of two tiers where the first-tier ( )1 is made

up of an array of buckets where each bucket contains an unsorted linked list, and the second-tier (2) is simply an unsorted linked list This structure of Twol back-end is based on an improvisation of the SCQ structure The reason in choosing a list-based

Trang 36

structure is due to the well-known fact that tree-based algorithms are bounded by

(log()) whereas a list-based structure such as the SCQ has the potential to achieve near (1) in good operating conditions Note that as mentioned in Chapter 1, the UCQ is a structure used for theoretical analysis instead of being considered as a practical priority queue As such, we shall only compare Twol back-end and the SCQ

Twol differs significantly from the SCQ structure and overcomes the SCQ’s shortcomings in four areas Firstly, in the SCQ (Brown 1988), the events in each bucket sublist are in sorted time-order The insertion of events can sometimes be an inefficient and enduring process This can happen in situations where the event distribution is heavily skewed, leading to some very long bucket sublists which can lead to near () insertion cost For the Twol back-end, the linked lists in both 1

and 2 are unsorted Each Insert ( , ) into the Twol back-end defers the sorting of this event and simply appends the event to an unsorted linked list, leading to (1) insertion cost

Secondly, the SCQ employs a resize operation that is performed according to the ratio of the number of events to the number of buckets Though this operation operates efficiently for some priority increment distributions, it falters in scenarios where there are excessive resize function calls This can occur when the ratio of events to buckets varies widely, for example in battlefield simulation scenarios [Rönngren and Ayani 1997] The Twol back-end does not require a resize function

Thirdly, the performance of the SCQ is shown to have constant amortized behavior when the events are spread evenly However, the SCQ can degrade to () complexity when a majority of the events fall in a single or in a few buckets For this

Trang 37

proposed Twol back-end, it manages to offer near constant amortized behavior whilst keeping its bound at (log()) amortized complexity upon amalgamation with tree-based front-end structures, or ( ) for the Henriksen’s queue

Lastly, instead of unreliable and somewhat costly sampling heuristics, the Twol back-end obtains good 1 operating parameters dynamically, in accordance to the current events in its 2 structure, without sampling This reduces further its management overheads on the whole

Before we present a detailed description of the Twol back-end algorithm in the following subsections, the purpose/definition of several important variables associated with the structure are given in Table 2-1

Table 2-1: Some Important Operating Variables Maintained in Twol Back-end

2.4 Twol Algorithm – DeleteMin(S) Operation

At the onset, the front-end structure and the Twol back-end (both 1 and 2) are empty All Insert ( , ) operations insert events into 2, i.e Insert( , 2), without a linear search for the correct event position Events are simply appended to the

Trang 38

unsorted linked list ¡2 On the first DeleteMin ( )¢ , all the events in ¡2 are transferred to ¡1 The bucketwidth of ¡1 is determined using the expression:

Bucketwidth of ¡1 = ¡1_£¤ = (¡2_¥ ¦§ –¡2_¥¨©) / ¡2_ª«¬, (2.1) and the bucket index in which an event is inserted in ¡1 is determined similarly to SCQ:

The rationale of using Equation (2.1) to obtain the bucketwidth is that Twol makes

an assumption that the inter-event timestamp is uniformly distributed Hence each bucket should contain one event on average if the distribution is uniform If the assumption of uniform distribution is not true, there would exist some buckets that contain many events However the performance is not affected even if some buckets contain many events This is because subsequent insertion cost of events into ¡1, or Insert( , )® ¡1 , is still °(1) since events are simply appended to the sublists without sorting After obtaining ¡1_£¤, both ¡1_¢¦¯ and ¡1_±«¯ are set to ¡2²¥¨©, and

2

¡ _¢¦¯ is set to ¡2_¥ ¦§ The bucket index of all the events inserted into ¡1 during

a transfer from ¡2 to ¡1 adhere to Equation (2.2), which is similar to an event inserted in an SCQ but without requiring modulus by the number of buckets, unlike

in an SCQ This is different from an SCQ, because in an SCQ there is no dedicated overflow structure but rather the overflow events are inserted into the sublists, forming multiple ³®¦¯ , where a year is the time interval that spans the total length

of all the bucket sublists For ¡1, there exists only one year

Thereafter, the buckets in ¡1 are processed on a bucket-by-bucket basis, starting with the first non-empty bucket of ¡1 If there is only one event in a bucket,

Trang 39

the event is simply deleted since it is the event with the minimum timestamp, thereby bypassing the front-end structure If however there are more than one event

in a bucket, the events are to be inserted into the end structure using the end priority queue’s native Insert operation And thereafter, the front-end priority queue’s native DeleteMin operation is carried out After the events in the front-end are depleted, the second non-empty bucket in ¶1 is processed and so on, until both the front-end structure and ¶1 are empty Under that circumstance, a transfer of events from ¶2 to ¶1 is performed again We shall illustrate the DeleteMin ( )· of the Twol algorithm in the following example

2

¶ will be updated as: ¶2_¾¿¹ = 10, ¶2À¾ÁÂ = 226, ¶2_ÃÄÅ = 8, · = {127, 226,

175, 25, 100, 51, 10, 36}, where event with timestamp 127 is now at the head of ¶2

On the first DeleteMin ( )· operation, because

¸

¹ and 1

º

¹ = 0, a transfer of events from ¶2 to ¶1 is necessary ¶1’s variables are initialized: ¶1_·ÆÁÇÆ and ¶1_ÈÄÇ = 10, and from Equation (2.1), ¶1_É Ê = (226 – 10) / 8 = 27 The events are then inserted

in the corresponding bucket using Equation (2.2) For instance, the event with timestamp 36 is inserted into the bucket index = (36 10) / 27 −  = 0

Trang 40

Table 2-2: Arrangement of Events in Ë1 of Twol Bucket

to be processed, bypassing the front-end structure for more efficiency That is, the greater the number of buckets that contain only one event, the more efficient is the Twol algorithm

Định dạng
Số trang	205
Dung lượng	1,68 MB