Performance on Intel Pentium 4 (Cache Disabled)- 123docz.net

2.6 Theoretical Analysis of Epoch-based Deferred-sort Priority Queue Queue

2.8.1 Performance on Intel Pentium 4 (Cache Disabled)

Figures 2-2 to 2-5 demonstrate the performance of the popular SCQ as well as its variant Dynamic Calendar Queue (DCQ) under the Classic Hold and Up/Down experiments for queue size ranging from 10 to 1 million events and for the ten different priority increment distributions. To make the performance improvement of employing the Twol algorithm more noticeable, Figures 2-6 to 2-11 present performance of the priority queues without amalgamation for comparison. Note that we do not present the performance of a sorted linked list because it is well-known that it has 4(5) average complexity.

Sorted-discipline Calendar Queue: For most of the distributions in the Classic Hold model as shown in Figure 2-2, the SCQ displays nearly constant

performance, as expected. In the situations where its performance remains stable, it is apparent that the SCQ outperforms all the other priority queues, including the Conventional+Twol ones. It is precisely because of this efficiency that SCQ is the default FEL candidate in the various discrete event simulators. However, the Camel and Change distributions demonstrate the inherent weakness of this popular priority queue. These distributions clearly show that if there are large number of events in a few buckets there would be sublists of 6(7) length. Under a Classic Hold model, the ratio of events to buckets remains a constant and thus the heuristic is able to obtain a better bucketwidth, i.e. the resize operation is not triggered. For the Up/Down model in Figure 2-3, the ratio of events to buckets changes frequently and this results in the resize operation being triggered often. This is graphically shown by the many peaks in the graphs. This access pattern model also shows that the SCQ has poor performance for distributions such as the Camel, Change and Triangular.

Dynamic Calendar Queue: Figures 2-4 and 2-5 show a variant of the SCQ which attempts to improve the heuristics for the resize criteria. There is performance improvement in managing events with the Change(exp(1), triangular(90000,100000), 2000) under the Classic Hold and the Triangular distribution under the Up/Down model. However this comes at a cost of slightly higher access time where its performance remains stable at near constant performance as compared to the SCQ. These results suggest that its additional cost- based resize heuristics is generally considered ineffective.

Henriksen’s queue: Figure 2-6 empirically shows that the Henriksen’s queue is nearly 6(log(7)) for queue sizes up to a few thousand events. However for larger queue sizes, it appears to be very sensitive to the priority increment distribution. As a comparison with the Henriksen+Twol, the Twol algorithm has

very efficiently improved the performance of the Henriksen’s queue. Figure 2-7 shows that the Henriksen’s queue is efficient in managing widely varying queue sizes and is insensitive to the priority increment distribution, displaying an efficient

8(log(9)) behavior.

Splay Tree: Figures 2-8 and 2-9 show the expected 8(log(9)) behavior. It exhibits marginal sensitivity to priority increment distribution and this is perhaps due to its balancing heuristics. Generally, the Splay Tree performs worse than the Henriksen’s queue.

Skew Heap: Figures 2-10 and 2-11 show that the access time grows at

8(log(9)) and performs marginally better than the Splay Tree. The Skew Heap is a tad sensitive to priority increment distribution.

Figures 2-12 to 2-19 reveal the performance of the priority queues after the amalgamation with the Twol back-end. The performance of the Conventional+Twol priority queues are as follows:

Henriksen+Twol: Figures 2-12 and 2-13 demonstrate that the Henriksen+Twol priority queue performs very stably with near constant performance regardless of the queue size and priority increment distribution. This near 8(1) performance corresponds very closely to the theoretical 8(1) amortized performance as stated in Table 2-3.

SplayTree+Twol: Figure 2-14 shows that the SplayTree+Twol is near 8(1) under the Classic Hold model. Figure 2-15 reveals that under the Up/Down model, for majority of the distributions, the Twol algorithm still offer near constant performance. However, this priority queue has lower performance for some distributions where there are many events which fall into the front-end structure, i.e.

the Splay Tree. It shows that the Twol algorithm is not adept at handling the

ExponentialMix, Camel and Change distributions, where the mean of the jump are varying, and relies on the front-end structure to keep the access time bounded.

SkewHeap+Twol: Figures 2-16 and 2-17 display similar performance akin to that of the SplayTree+Twol.

List+Twol: Figure 2-18 shows that the List+Twol (i.e. sorted linked list as front-end) is rather erratic for distributions with a varying mean of the jump, especially for ExponentialMix where it tends to exhibit :(;) for some queue sizes.

However, it is noted again that for distributions with a constant mean, this queue performs at :(1). Figure 2-19 shows vividly that using a linked list as the front-end structure offers better performance than using the tree structures when the mean of the jump of the distribution is a constant. This is because for a linked list the DeleteMin cost is constant since an event is simply removed from the head of the list. For a tree structure such as the Splay Tree, there are some overheads for each dequeue due to node rotations to keep the tree balanced.

These numerical analyses confirm, interestingly, that for both Classic Hold and Up/Down scenarios, if the mean of the jump < does not vary, Conventional+Twol priority queues exhibit near :(1) performance. This gives a novel insight that Theorem 2.1 is applicable even for varying =.

Figure 2-2: Mean access time for Sorted-discipline Calendar Queue under Classic Hold experiments.

Figure 2-3: Mean access time for Sorted-discipline Calendar Queue under Up/Down experiments.

Figure 2-4: Mean access time for Dynamic Calendar Queue under Classic Hold experiments.

Figure 2-5: Mean access time for Dynamic Calendar Queue under Up/Down experiments.

Figure 2-6: Mean access time for Henriksen’s queue under Classic Hold experiments.

Figure 2-7: Mean access time for Henriksen’s queue under Up/Down experiments.

Figure 2-8: Mean access time for Splay Tree under Classic Hold experiments.

Figure 2-9: Mean access time for Splay Tree under Up/Down experiments.

Figure 2-10: Mean access time for Skew Heap under Classic Hold experiments.

Figure 2-11: Mean access time for Skew Heap under Up/Down experiments.

Figure 2-12: Mean access time for Henriksen+Twol under Classic Hold experiments.

Figure 2-13: Mean access time for Henriksen+Twol under Up/Down experiments.

Figure 2-14: Mean access time for SplayTree+Twol under Classic Hold experiments.

Figure 2-15: Mean access time for SplayTree+Twol under Up/Down experiments.

Figure 2-16: Mean access time for SkewHeap+Twol under Classic Hold experiments.

Figure 2-17: Mean access time for SkewHeap+Twol under Up/Down experiments.

Figure 2-18: Mean access time for List+Twol under Classic Hold experiments.

Figure 2-19: Mean access time for List+Twol under Up/Down experiments.

Table 2-6: Relative Performance* for Exponential Distribution on Intel Pentium 4

Model Queue Size

Henriksen +Twol

SplayTree +Twol

SkewHeap

+Twol List+Twol Henriksen Splay Tree

Skew

Heap SCQ DCQ

10 1.28 2.05 1.49 1.00 1.04 2.36 1.39 1.00 1.29 102 1.35 2.20 1.67 1.09 1.43 3.63 2.48 1.00 1.26 103 1.36 2.29 1.75 1.12 1.84 4.75 3.58 1.00 1.26 104 1.40 2.46 1.92 1.21 2.34 6.04 4.87 1.00 1.25 105 1.41 2.63 2.14 1.51 2.72 6.95 5.92 1.00 1.23 106 1.40 2.57 2.15 1.49 3.14 7.98 7.03 1.00 1.23

Classic Hold

Avg 1.37 2.37 1.85 1.24 2.09 5.29 4.21 1.00 1.25 10 1.46 1.87 1.51 1.26 1.00 1.70 1.23 2.30 2.97 102 1.23 1.62 1.30 1.00 1.04 2.12 1.75 2.04 2.40 103 1.22 1.71 1.40 1.00 1.38 2.86 2.59 1.58 1.87 104 1.22 1.74 1.41 1.00 1.71 3.65 3.48 1.83 2.18 105 1.20 1.74 1.41 1.00 2.02 4.42 4.34 1.57 1.85 106 1.20 1.82 1.75 1.00 2.35 5.28 5.15 1.40 1.66

Up/Down

Avg 1.26 1.75 1.46 1.04 1.58 3.34 3.09 1.79 2.16 Total Avg 1.31 2.06 1.66 1.14 1.83 4.31 3.65 1.39 1.70

Table 2-7: Relative Average Performance* for Distributions With Constant Mean of Jump on Intel Pentium 4

Model Queue Size

Henriksen +Twol

SplayTree +Twol

SkewHeap

+Twol List+Twol Henriksen Splay Tree

Skew

Heap SCQ DCQ

10 1.40 1.96 1.54 1.14 1.00 2.65 1.55 1.07 1.34 102 1.34 1.99 1.49 1.11 1.35 3.50 2.55 1.00 1.25 103 1.48 2.04 1.56 1.14 1.96 4.66 3.79 1.00 1.25 104 1.49 2.10 1.61 1.17 2.89 5.84 5.08 1.00 1.24 105 1.46 2.03 1.63 1.22 4.46 6.88 6.26 1.00 1.23 106 1.48 1.98 1.63 1.20 7.33 7.85 7.32 1.00 1.23

Classic Hold

Avg 1.44 2.02 1.58 1.16 3.17 5.23 4.43 1.01 1.26 10 1.42 1.68 1.40 1.24 1.00 1.72 1.18 2.55 3.15 102 1.23 1.46 1.35 1.00 1.05 2.13 1.74 2.17 2.53 103 1.19 1.57 1.33 1.00 1.39 2.92 2.62 1.79 2.10 104 1.19 1.58 1.36 1.00 1.72 3.73 3.54 2.54 2.56 105 1.14 1.56 1.34 1.00 2.07 4.59 4.45 3.33 2.22 106 1.15 1.56 1.37 1.00 2.38 5.50 5.36 6.19 1.92

Up/Down

Avg 1.22 1.57 1.36 1.04 1.60 3.43 3.15 3.10 2.41 Total Avg 1.33 1.79 1.47 1.10 2.38 4.33 3.79 2.05 1.84

*Normalized with respect to the fastest access time, where the higher the number, the slower it is.

Table 2-8: Relative Average Performance* for Distributions With Varying Mean of Jump on Intel Pentium 4

Model Queue Size

Henriksen +Twol

SplayTree +Twol

SkewHeap

+Twol List+Twol Henriksen Splay Tree

Skew

Heap SCQ DCQ

10 1.22 2.07 1.59 1.10 1.00 2.48 1.42 1.15 1.43 102 1.23 2.16 1.66 1.31 1.13 2.89 2.12 1.00 1.21 103 1.08 1.92 1.50 1.77 1.16 2.59 2.10 1.00 1.56 104 1.00 1.93 1.54 8.12 1.45 2.99 2.62 10.89 2.87 105 1.00 1.75 1.49 1.33 1.95 3.58 3.28 5.43 14.50 106 1.00 1.69 1.47 1.32 2.67 3.99 3.76 2.13 24.67

Classic Hold

Avg 1.09 1.92 1.54 2.49 1.56 3.09 2.55 3.60 7.71 10 1.56 2.03 1.65 1.30 1.00 1.80 1.18 2.75 3.25 102 1.26 1.95 1.72 1.29 1.00 2.04 1.67 2.20 2.56 103 1.02 1.73 1.56 1.23 1.00 2.12 1.90 1.42 1.65 104 1.00 1.83 1.72 1.32 1.07 2.30 2.20 23.85 23.18 105 1.00 1.81 1.60 1.30 1.21 2.67 2.64 NA NA 106 1.00 1.88 1.67 1.35 1.46 3.28 3.28 NA NA

Up/Down

Avg 1.14 1.87 1.65 1.30 1.12 2.37 2.15 NA NA Total Avg 1.11 1.90 1.60 1.90 1.34 2.73 2.35 NA NA

Table 2-9: Relative Average Performance* for All Distributions on Intel Pentium 4

Model Queue Size

Henriksen +Twol

SplayTree +Twol

SkewHeap

+Twol List+Twol Henriksen Splay Tree

Skew

Heap SCQ DCQ

10 1.31 2.01 1.56 1.12 1.00 2.56 1.49 1.11 1.39 102 1.28 2.08 1.58 1.22 1.23 3.18 2.32 1.00 1.23 103 1.24 1.97 1.52 1.52 1.48 3.41 2.77 1.00 1.43 104 1.00 1.68 1.32 4.66 1.68 3.43 2.99 6.06 1.91 105 1.00 1.58 1.31 1.10 2.47 4.11 3.75 3.19 8.05 106 1.00 1.53 1.30 1.08 3.74 4.61 4.32 1.45 13.54

Classic Hold

Avg 1.14 1.81 1.43 1.78 1.93 3.55 2.94 2.30 4.59 10 1.49 1.85 1.52 1.27 1.00 1.76 1.18 2.65 3.20 102 1.21 1.67 1.50 1.12 1.00 2.04 1.67 2.14 2.49 103 1.00 1.52 1.34 1.04 1.07 2.25 2.01 1.44 1.68 104 1.00 1.61 1.48 1.12 1.23 2.66 2.54 14.63 14.25 105 1.00 1.63 1.43 1.13 1.46 3.22 3.15 NA NA 106 1.00 1.66 1.47 1.16 1.71 3.90 3.85 NA NA

Up/Down

Avg 1.12 1.66 1.46 1.14 1.25 2.64 2.40 NA NA Total Avg 1.13 1.73 1.44 1.46 1.59 3.09 2.67 NA NA

*Normalized with respect to the fastest access time, where the higher the number, the slower it is.

Tables 2-6 to 2-9 show the relative performance of the priority queues on the Intel Pentium 4. The List+Twol is the fastest for the Exponential distribution while the Henriksen+Twol is the overall leader when the priority increment distribution, queue size and access pattern model are taken into account. Tables 2-7 and 2-8 show that List+Twol is faster than Henriksen+Twol for distributions with a constant mean of jump, i.e. Exponential, Uniform, Triangular and Negative Triangular, and performs poorer for the other distributions which have a varying mean of jump.

This provides a strong motivation to improve the List+Twol further so that it maintains good performance for most/all distributions.

on the Intel Itanium 2 (SGI Altix 3300), SGI MIPS R16000 (SGI Onyx4) and AMD Athlon MP

Tables 2-10 and 2-11 provide a vivid overall performance summary4. Note that we have not included List+Twol in this section because the next chapter presents an improved version of this algorithm. Also, the expensive hardware used in this section was on loan only for a brief period from Silicon Graphics Pte Ltd and as such we only had sufficient time to benchmark the three Conventional+Twol with tree-based front-end structures.

Table 2-10 presents in detail the actual speedup of the Twol algorithm amalgamation on all the architectures including the Intel Pentium 4 (with cache- disabled). It is conspicuous that the speedup is higher on the three cache-enabled architectures, with speedup ranging from about 3 to 5 times speed up over the front- end structure without amalgamation. For the Intel Pentium 4 with cache-disabled,

4 Due to the relatively large number of figures and tables obtained, they are not included in this thesis.

Nevertheless, they are shown in the article by Goh and Thng [2005].

the speedup is only about 2. Even though the cache-disabled experiments on Intel Pentium 4 are done to illustrate a closer approximation of the complexity of the priority queues and not meant to reflect the expected speedup of normal applications, we have nevertheless included it to obtain the overall average speedup of 3.23.

Table 2-10: Speedup# of Twol Algorithm on Different Hardware Architecture – Comparison by priority increment distribution

Priority Queue Henriksen+Twol SplayTree+Twol SkewHeap+Twol Architecture P4+ IT2 MIPS AMD P4+ IT2 MIPS AMD P4+ IT2 MIPS AMD

Average (Distri.) Exponential(1) 1.46 3.12 2.57 3.31 2.20 4.13 3.58 4.04 2.32 4.79 3.24 3.43 3.18 Uniform(0,2) 2.28 9.83 6.10 7.03 2.78 3.82 3.84 4.01 2.76 4.46 3.42 3.41 4.48 Uniform(0.9,1.1) 1.65 4.90 3.36 3.71 2.43 3.02 3.39 3.06 3.03 4.50 3.68 3.52 3.35 Bimodal 1.61 5.86 3.93 4.73 2.53 3.95 3.65 3.99 2.60 4.65 3.45 3.44 3.70 Triangular (0,1.5) 2.08 10.65 6.57 7.42 2.60 3.63 3.74 3.77 2.90 4.47 3.55 3.45 4.57 NegativeTriangular (0,1000) 1.84 7.20 4.70 5.55 2.52 3.95 3.79 4.05 2.74 4.56 3.43 3.40 3.98 ExponentialMix 1.17 2.56 1.88 2.71 1.25 2.54 2.06 2.75 1.24 2.85 1.92 2.45 2.12 Camel (0,1000,0.001,0.999) 1.10 3.03 1.85 2.37 1.31 2.52 1.91 2.21 1.34 4.01 2.26 2.89 2.23 Change(exp(1),triangular

(90000,100000),2000) 1.43 4.11 2.58 2.87 1.54 2.20 1.98 2.03 1.64 3.14 2.26 2.29 2.34 Change(triangular(90000,10

0000),exp(1),10000) 1.38 4.06 2.60 2.84 1.50 2.17 2.01 1.94 1.76 3.16 2.28 2.30 2.33 Average (platform) 1.60 5.53 3.61 4.25 2.07 3.19 2.99 3.19 2.23 4.06 2.95 3.06 3.23

# If the values exceed 1, it means there is a speedup. Speedup refers to the performance of the Conventional+Twol priority queue normalized over its conventional structure without Twol back-end. +P4 is with cache disabled.

Table 2-11 gives an itemization of the average speedup as the queue size varies from 10 to 10 million events. The average speed up of the Twol-based EDPQs over its conventional priority queue without Twol back-end is 2.80 across different front-end structures and different hardware architectures.

Note that for Tables 2-10 and 2-11, the range of queue sizes tested for P4 (Intel Pentium4) is from 10 to 1 million, IT2 (Intel Itanium 2) is from 100 to 10 million, MIPS (SGI MIPS R16000) and AMD (AMD Athlon MP) are from 100 to 1 million as mentioned in Table 2-5. The IT2, MIPS and AMD start from queue size

100 because the efficient cache on these architectures enables the benchmark to run so fast that we could not obtain accurate timings for queue sizes below 100, due to the limitation of the microsecond timer. We have benchmarked up to 10 million events only on the IT2, which is fast and has ample shared RAM.

Table 2-11: Speedup of Twol Algorithm on Different Hardware Architecture – Comparison by Queue Size (if the values exceed 1, it means there is a speedup)

Priority

Queue Henriksen+Twol SplayTree+Twol SkewHeap+Twol Architecture P4+ IT2 MIPS AMD P4+ IT2 MIPS AMD P4+ IT2 MIPS AMD Queue Size

Avg Cost (MB)*

10 0.73 1.17 0.89 0.93 1.76e-04

20 0.76 1.25 1.05 1.02 3.36e-04

30 0.78 1.25 1.13 1.05 4.96e-04

40 0.81 1.35 1.21 1.12 6.56e-04

50 0.85 1.31 1.20 1.12 8.16e-04

70 0.86

Inaccurate timings due to the efficient cache and the limitation of the microsecond timer.

1.37

Inaccurate timings due to the efficient cache and the limitation of the

microsecond timer.

1.28

Inaccurate timings due to the efficient cache and the limitation of the microsecond timer.

1.17 1.14e-03 100 0.91 0.88 1.03 0.87 1.43 1.42 1.59 1.40 1.34 1.19 1.31 1.18 1.21 1.62e-03 200 0.97 0.96 1.13 0.93 1.51 1.52 1.66 1.55 1.44 1.35 1.40 1.33 1.31 3.22e-03 300 1.01 1.02 1.19 1.00 1.56 1.58 1.70 1.57 1.55 1.43 1.47 1.36 1.37 4.82e-03 400 1.04 1.07 1.23 1.00 1.60 1.62 1.74 1.62 1.58 1.47 1.51 1.39 1.41 6.42e-03 500 1.07 1.10 1.27 1.01 1.61 1.64 1.74 1.58 1.64 1.50 1.52 1.39 1.42 8.02e-03 700 1.11 1.16 1.33 1.05 1.64 1.69 1.78 1.59 1.68 1.55 1.57 1.42 1.46 1.12e-02 1000 1.15 1.21 1.40 1.13 1.65 1.74 1.84 1.65 1.71 1.59 1.63 1.46 1.51 1.60e-02 2000 1.24 1.34 1.51 1.28 1.69 1.84 1.92 1.79 1.78 1.70 1.72 1.54 1.61 3.20e-02 3000 1.31 1.44 1.61 1.42 1.76 1.99 2.00 1.88 1.86 1.82 1.82 1.68 1.72 4.80e-02 4000 1.34 1.50 1.65 1.55 1.78 2.03 2.03 1.94 1.89 1.90 1.87 1.71 1.77 6.40e-02 5000 1.38 1.56 1.70 1.64 1.81 2.07 2.07 1.99 1.94 1.95 1.92 1.77 1.82 8.00e-02 7000 1.45 1.64 1.77 1.95 1.86 2.11 2.12 2.04 2.00 2.04 1.98 1.80 1.90 1.12e-01 10000 1.52 1.70 1.84 2.21 1.90 2.15 2.17 2.05 2.05 2.12 2.04 1.78 1.96 1.60e-01 20000 1.67 1.90 2.04 2.72 2.03 2.28 2.32 2.36 2.21 2.34 2.20 2.15 2.19 3.20e-01 30000 1.75 2.10 2.23 3.02 2.10 2.43 2.42 2.56 2.30 2.47 2.24 2.39 2.33 4.80e-01 40000 1.85 2.57 2.50 3.35 2.14 2.72 2.61 2.73 2.38 2.57 2.39 2.58 2.53 6.40e-01 50000 1.88 2.87 2.77 3.52 2.19 2.76 2.79 2.85 2.45 2.67 2.53 2.73 2.67 8.00e-01 70000 1.98 3.37 3.38 3.86 2.25 2.72 3.19 3.03 2.54 2.80 2.87 2.95 2.91 1.12 100000 2.09 3.63 3.58 4.24 2.36 2.67 3.27 3.24 2.60 2.99 3.09 3.17 3.08 1.60 200000 2.34 4.09 4.37 4.99 2.47 2.76 3.44 3.64 2.72 3.46 3.43 3.56 3.44 3.20 300000 2.50 4.46 4.79 5.52 2.58 2.87 3.56 3.87 2.79 3.71 3.67 3.80 3.68 4.80 400000 2.59 4.78 5.13 5.98 2.62 2.97 3.67 4.07 2.90 3.89 3.81 3.97 3.87 6.40 500000 2.67 5.01 5.37 6.33 2.66 3.07 3.75 4.17 2.94 4.06 3.93 4.12 4.01 8.00 700000 2.86 5.46 5.86 6.93 2.72 3.23 3.89 4.36 3.00 4.30 4.12 4.35 4.26 11.20 1000000 3.01 5.96 6.38 7.62 2.76 3.41 4.06 4.59 3.05 4.56 4.31 4.62 4.53 16.00

2000000 - 7.03 - - - 3.73 - - - 5.07 - - 5.28 32.00

3000000 - 7.93 - - - 3.93 - - - 5.35 - - 5.74 48.00

4000000 - 8.49 - - - 4.04 - - - 5.52 - - 6.02 64.00

5000000 - 8.99 - - - 4.14 - - - 5.65 - - 6.26 80.00

7000000 - 9.83 - - - 4.35 - - - 5.89 - - 6.69 112.00

10000000 - 10.73 - - - 4.51 - - - 6.09 - - 7.11 160.00

Average 1.53 3.73 2.68 3.00 1.88 2.64 2.53 2.56 1.97 3.06 2.41 2.41 2.80 -

*Assume a bucket allocated in >1 of Twol algorithm requires 16 bytes of shared RAM. The cost is an upper bound for individual queue sizes. MB here refers to 1,000,000 bytes.

vs Performance Consideration

After numerous verifications to validate the performance of the Twol-based EDPQs, this section serves to discuss the additional cost in terms of additional memory required for the Twol algorithm to function. The bulk of memory requirement in the Twol back-end is in its first-tier (?1) where the number of buckets allocated is bounded above by @+1 (see Lemma 2.6). Other memory requirements are some variables associated with ?1 and ?2 as mentioned in Section 2.3 which can be considered to take up an insignificant amount of memory as compared to the bucket allocation.

The structure of a bucket in C programming language is given as: struct bucket { struct event *head; struct event *tail; }, which incurs a cost of 8B (or 8 bytes) on 32-bit and up to 16B on 64-bit hardware platform.

Assuming the cost of 16B per bucket allocated, the total cost and the speedup at corresponding queue sizes on different platforms are given in Table 2-11. At 1 million queue size, the average speedup is 4.53 and the maximum memory requirement is 16MB, which by today’s workstation or server configuration is considered nominal. Furthermore low cost 64-bit processors such as the AMD Opteron and the AMD Athlon64 (a PC processor which is likely to become a desktop commodity in the near future) have recently been introduced. Generally a 32-bit processor allows memory addressing of only up to 4GB whereas a 64-bit processor theoretically allows memory addressing of up to 16 exabytes (16 billion GB) and thus it is limited only by hardware implementation. Even though some operating systems such as Linux interestingly enables 64GB physical memory addressing on 32-bit processors, the individual applications and the operating system

still cannot access more than 4GB each. With the advent of 64-bit computing, the minor drawback of the Twol algorithm, i.e. to trade a small amount of memory for speed, seems trivial. From Table 2-11, the speedup increases as queue size increases and thus shows that it is especially suitable for implementation in large-scale applications.

For a higher performance practical implementation, if the actual number of buckets required is M in the first epoch, then we can create 2M buckets instead. If 2M buckets are insufficient for a later epoch, then a batch creation process can be initiated to double the number of buckets. From epoch-to-epoch, if the number of buckets in C1 is sufficient, the previous set of already created buckets can be reused.

Under stringent circumstance where there is memory scarcity, the Twol algorithm is still able to function by allocating the number of buckets which is less than the queue size. We can assume that there are

G events on the average in each bucket, contrary to the one event per bucket in Equation (2.1), and thus the number of buckets that the Twol algorithm requires is now 1

I G

− of the queue size. We then obtain:

Bucketwidth of C1 = C1_JK

= L 2_ 2_ / 2_ )

G i C MaP−C MQG C U VW , (2.9)

which means that the bucketwidth is directly proportional to

G . The time interval that spans one year in C1 remains the same at C1_JK i (number of buckets). So for instance, if we assume two events in each bucket during a transfer of U events from

C to C1, we would require only (U/ 2) 1+ number of buckets, i.e. ~ half the memory required, than if we assume one event per bucket. As

G →C2_GVW, the

performance of the EDPQ tends towards that of the front-end structure. Since the expected number of events in each bucket will contain more events as the number of buckets decreases, it is therefore expected to have gradual performance degradation until

Y = Z2_Y [\, where the EDPQ performs similar to its front-end structure.

Evaluation Via Swan on Intel Pentium 4 (Cache-Enabled) To determine the performance of the priority queues in real discrete event simulation, the SCQ, Skew Heap, Splay Tree, Henriksen’s queue and their Twol- based counterparts, have been implemented as the FEL structures available in the

]^_Y (Simulator Without A Name) simulator [Thng and Goh 2004]. A simple

`/`/1 queuing system was created in Swan and simulated for 10,000 simulation seconds. In this network topology, each source node generates a network packet where the inter-packet generation rate is exponentially distributed. The source nodes, which vary from 8 to 399,998, generate packets which are multiplexed into a single server that services the packets to a single destination node. The service times for the packets are also exponentially distributed. The results obtained are shown in Figures 2-20 and 2-21. For other more complex real simulation comparison, they can be found in work by Tam [2003] and Wang [2003].

Each discrete event simulator operates uniquely. For Swan, an event in the FEL corresponds to a node in the network topology. At the onset, if the number of nodes is set to be b, Swan initializes the FEL structure with b events with each holding a zero timestamp. Thereafter the process interactions between the nodes determine the DeleteMin and Insert operation of subsequent events with the number of events in the FEL structure being held constant throughout the simulation at b.

Thus the mechanism of Swan essentially corresponds to an initial Up access pattern model (see Section 2.7). This creates g number of nodes and then the simulation follows a Classic Hold model where the number of events in the FEL structure is kept constant. The number of events in the FEL is equal to the number of nodes in the network topology. Thus the x-axis in Figures 2-20 and 2-21 can be relabeled as

“Number of nodes.”

The benchmarks in this section were carried out on the Intel Pentium 4 workstation as described in Table 2-5 with both L1 and L2 cache enabled. In addition, the operating system used was Windows XP Professional SP1. Unlike in previous benchmarks where the caches were turned off to illustrate the performance of the priority queues without cache effects, this set of benchmarks using the Swan simulator was carried out to demonstrate the actual integral role that a high performance priority queue assumes during a discrete event simulation. Intel Pentium 4 was chosen because the Swan simulator is currently available only on Windows platform (although with minor tweaks, it can be made to run on Linux or other platforms since it is written in ANSI C++) and the fastest compatible hardware platform was chosen to reduce the run-time taken. The metric measured was the overall run-time, which include the Swan simulation engine management and control time, in addition to the FEL structure management time. So while the FEL structure may have constant complexity, the overall run-time is not expected to be constant. Nevertheless, the results illustrate a realistic view on the degree to which the performance of an FEL structure can affect the run-time. The run-time for each simulation with different FEL structure corresponds to taking the average value from five identical runs.

Performance on Intel Pentium 4 (Cache Disabled)

Twol Algorithm – DeleteMin(S) Operation

A Review of Sequential Twol Back-end Structure