Advanced Computer Architecture - Lecture 19: Instruction level parallelism. This lecture will cover the following: limitations of ILP and conclusion; hardware model; effects of branch/jumps; finite registers; performance of Intel P6 Micro-Architecture-based processors; thread-level parallelism;...
CS 704 Advanced Computer Architecture Lecture 19 Instruction Level Parallelism (Limitations of ILP and Conclusion) Prof Dr M Ashraf Chughtai Today's Topics - - Recap Limitations of ILP Hardware model Effects of branch/jumps finite registers Performance of Intel P6 Micro-Architecturebased processors Pentium Pro, Pentium II, III and IV Thread-level Parallelism Summary MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Recap: ILP- Dynamic Scheduling In the last few lectures we have been discussing the concepts and methodologies, which have been introduced during the last decade, to design high-performance processors Our focus has been the hardware methods for instruction level parallelism to execute multiple instructions in pipelined datapath MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Recap: ILP- Dynamic Scheduling These hardware techniques are referred to as Dynamic Scheduling techniques These techniques are used to ovoid structural, data and control hazards and minimize the number of stalls to achieve better performance We have discussed dynamic scheduling in integer pipeline datapath and in floatingpoint pipelined datapath MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Recap: ILP- Dynamic Scheduling We discussed the score-boarding and Tomasulo’s algorithm as the basic concepts for dynamic scheduling in integer and floating-point datapath The structures implementing these concepts facilitate out-of-order execution to minimize data dependencies thus avoid data hazards without stalls MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Recap: ILP- Dynamic Scheduling We also discussed branch-prediction techniques and different types of branchpredictors, used to reduce the number of stalls due to control hazards The concept of multiple instructions issue was discussed in details This concept is used to reduce the CPI to less that one, thus, the performance of the processor is enhanced MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Recap: ILP- Dynamic Scheduling Last time we talked about the extensions to the Tomasulo’s structure by including hardware-based speculation It allows to speculate that branch is correctly predicted, thus may execute outof-order but commit in-order having confirmed that the speculation is correct and no exceptions exist MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Today’s topics ILP- Dynamic Scheduling Today we will conclude our discussion on the dynamic scheduling techniques for Instruction level parallelism by introducing an ideal processor model to study the: limitations of ILP; and implementation of these concepts in Intel P6 Micro-architecture MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Limitations of the ILP – Ideal Processor To understand the limitations of ILP, let us first define an ideal processor - An ideal processor is one which doesn’t have artificial constraints on ILP; and - the only limits in such a processor are those imposed by the actual data flows through either registers or memory MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) Assumptions for an Ideal processor An ideal processor is, therefore, one wherein: a) all control dependencies and b)all but true data dependencies are eliminated The control dependencies are eliminated by assuming that the: branch and Jump predictions are perfect, i.e., MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 10 Branch performance and speculation costs It shows the fraction of the branches mispredicted either because of BTB misses or because of incorrect predictions On average about 20% of the branches either miss or are mispredicted and use the simple static predictors rule MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 52 Overall Performance of P6 Pipeline Overall performance depends on the rate at which instructions actually complete and commit Fig MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 53 Overall Performance of P6 Pipeline Fig 3.56… The fig shows the fraction of the time in which zero, one, two or three uops commit On the average, one uop commits per cycle MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 54 Overall Performance of P6 Pipeline Here, 23% of the time, three uops commit in a cycle This distribution demonstrates the ability of a dynamically scheduled pipeline to fall behind on 55% of the cycles, no uops commit) and later catch up (31% of the cycles have two or three uops committing) MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 55 The Pentium III versus Pentium The micro architecture of the Pentium 4, which is called Net Burst, is similar to that of the Pentium III, called the P6 micro architecture Both fetch up to three IA-32 instructions per cycle, decode them into micro-ops Then sends the uops to an out-of-order execution engine that executes up to three uops per cycle MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 56 The Pentium III versus Pentium There are, however, many differences which allow Net Burst micro architecture to operate at a significantly higher clock rate than the P6 micro architecture These differences also help to maintain, or close to maintain, the peak to sustained execution throughput MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 57 Differences in Pentium III versus Pentium 1) NetBurst has a much deeper pipeline than P6 P6 requires about 10 clock cycles time for a simple add instruction, from fetch to the availability of its results In comparison, Net Burst takes about 20 clock cycles, including cycles reserved simply to drive results across the chip, MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 58 The Pentium III versus Pentium 2) Net Burst uses register renaming (as in the MIPS R10K and the Alpha 21264) rather than the reorder buffer, which is used in P6 Use of register renaming allows many more outstanding results i.e., potentially up to 128 results versus the 40 permitted in P6 MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 59 The Pentium III versus Pentium 3) There are seven integer execution units in the Net Burst versus five in P6 In addition an additional integer ALU and an additional address computation unit An aggressive ALU (operating at twice the clock rate) and an aggressive data cache lead to lower latencies MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 60 The Pentium III versus Pentium – The latency for the basic ALU operations is effectively one half of a clock cycle in Net Burst versus one in P6) – The latency for data loads is effectively two cycles in Net Burst versus three cycles in P6) These high-speed functional units are critical to lowering the potential increase in stalls from the very deep pipeline MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 61 The Pentium III versus Pentium 4) Net Burst uses a sophisticated trace cache to improve instruction fetch performance, while P6 uses a conventional Prefetch buffer and instruction cache 5) Net Burst has a branch target buffer that is eight times larger and has an improved prediction algorithm MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 62 The Pentium III versus Pentium 6) Net Burst has KB Level-I data cache as compared to P6 that has16KB Level-I data cache However, the Net Burst has larger Level-2 cache (256KB) with higher bandwidth MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 63 The Pentium III versus Pentium 7) Net Burst implements the new SSE2 FP instructions that allow two FP operations per instruction These operations are structured as 12-bit SIMD or short-vector structure This gives Pentium a considerable advantage over Pentium-III on FP code MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 64 Summary MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 65 Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism-Dynamic (8) 66 ... MAC/VU -Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism- Dynamic (8) 29 Realistic branch and jump prediction Fig 3.38… MAC/VU -Advanced Computer Architecture Lecture 19 – Instruction. .. MAC/VU -Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism- Dynamic (8) 36 The effect of finite registers Fig 3.41… MAC/VU -Advanced Computer Architecture Lecture 19 – Instruction. .. the instruction- issues per cycle MAC/VU -Advanced Computer Architecture Lecture 19 – Instruction Level Parallelism- Dynamic (8) 25 window size and maximum issue count Fig 3.36… MAC/VU -Advanced Computer