1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

ĐIỆN tử VIỄN THÔNG c16 instructionlevel parallelism and superscalar processors 39 g3 khotailieu

71 368 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Instruction-Level Parallelism AND Superscalar Processors 16.5 Overview Superscalar versus Superpipelined Constraints Design Issues Instruction-Level Parallelism and Machine Parallelism Instruction Issue Policy Register Renaming Machine Parallelism Branch Prediction Superscalar Execution Superscalar Implementation Pentium Front End Out-of-Order Execution Logic Integer and Floating-Point Execution Units ARM Cortex-A8 Instruction Fetch Unit Instruction Decode Unit Integer Execute Unit SIMD and Floating-Point Pipeline Recommended Reading 16.6 Key Terms, Review Questions, and Problems 16.1 16.2 16.3 16.4 CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS LEARNING OBJECTIVES After studying this chapter, you should be able to: ♦ ♦ Explain the difference between superscalar and superpipelined approaches Define instruction-level parallelism ♦ Discuss dependencies and resource conflicts as limitations to instruction- level parallelism ♦ Present an overview of the design issues involved in instruction-level parallelism ♦ Compare and contrast techniques of improving pipeline performance in RISC machines and superscalar machines A superscalar implementation of a processor architecture is one in which common instructions—integer and floating-point arithmetic, loads, stores, and conditional branches—can be initiated simultaneously and executed independently Such implementations raise a number of complex design issues related to the instruction pipeline Superscalar design arrived on the scene hard on the heels of RISC architec- ture Although the simplified instruction set architecture of a RISC machine lends itself readily to superscalar techniques, the superscalar approach can be used on either a RISC or CISC architecture Whereas the gestation period for the arrival of commercial RISC machines from the beginning of true RISC research with the IBM 801 and the Berkeley RISC I was seven or eight years, the first superscalar machines became commer- cially available within just a year or two of the coining of the term superscalar The superscalar approach has now become the standard method for implementing high- performance microprocessors In this chapter, we begin with an overview of the superscalar approach, contrasting it with superpipelining Next, we present the key design issues associated with superscalar implementation Then we look at several important examples of superscalar architecture 16.1 OVERVIEW The term superscalar, first coined in 1987 [AGER87], refers to a machine that is designed to improve the performance of the execution of scalar instructions In most applications, the bulk of the operations are on scalar quantities Accordingly, the superscalar approach represents the next step in the evolution of high-performance general-purpose processors The essence of the superscalar approach is the ability to execute instructions independently and concurrently in different pipelines The concept can be further exploited by allowing instructions to be executed in an order different from the program order Figure 16.1 compares, in general terms, the scalar and superscalar approaches In a traditional scalar organization, there is a single pipelined functional unit for integer operations and one for floating-point operations Parallelism is achieved by enabling multiple instructions to be at different stages of the pipeline at CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS Pipelined integer functional unit Pipelined floatingpoint functional unit (a) Scalar organization Pipelined integer functional units (b) Superscalar organization Figure 16.1 Superscalar Pipelined floatingpointfunctionalunits Organization Compared to Ordinary Scalar Organization one time In the superscalar organization, there are multiple functional units, each of which is implemented as a pipeline Each individual functional unit provides a degree of parallelism by virtue of its pipelined structure The use of multiple functional units enables the processor to execute streams of instructions in parallel, one stream for each pipeline It is the responsibility of the hardware, in conjunction with the com- piler, to assure that the parallel execution does not violate the intent of the program Many researchers have investigated superscalar-like processors, and their research indicates that some degree of performance improvement is possible Table 16.1 presents the reported performance advantages The differences in the Table 16.1 Reported Speedups of Superscalar-Like Machines Reference [TJAD70] [KUCK77] [WEIS84] [ACOS86] [SOHI90] [SMIT89] [JOUP89b] [LEE91] Speedup 1.8 1.58 2.7 1.8 2.3 2.2 results arise from differences both in the hardware of the simulated machine and in the applications being simulated Superscalar versus Superpipelined An alternative approach to achieving greater performance is referred to as superpipelining, a term first coined in 1988 [JOUP88] Superpipelining exploits the fact that many pipeline stages perform tasks that require less than half a clock cycle Thus, a doubled internal clock speed allows the performance of two tasks in one external clock cycle We have seen one example of this approach with the MIPS R4000 Figure 16.2 compares the two approaches The upper part of the diagram illus- trates an ordinary pipeline, used as a base for comparison The base pipeline issues CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS Key: Execute Ifetch Decod e Write s £ Time in base cycles CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS Key Execut one instruction per clock cycle and can perform one : Ifetch Decod e Write pipeline stage per clock cycle The pipeline has four stages: e instruction fetch, operation decode, operation execu- tion, and result write back The execution stage is crosshatched for clarity Note that although several instructions No dependency are executing concurrently, only one instruction is in its execution stage at any one time The next part of the diagram shows a superpipelined implementation that is capable of performing two pipeline stages per clock cycle An alternative way of looking at this is that the functions Data dependency i (i1 uses data computed by i0) performed in each stage can be split into two nonoverlapping parts and each can execute in half a clock cycle A superpipeline implementation I I I that behaves I I I in this fashion is said to be of degree I I I Finally, the I I I i0 lowest part of the diagram shows a superscalar Procedural dependency i1/branc implementation capable of execut- ing 1 two instances h each stage in parallel Higher-degree of superpipeline and superscalar implementations are of course possible i2 Both the superpipeline and the superscalar implementations depicted in 13 Figure 16.2 have the same number of instructions executing at the same time 14 the steady state The superpipelined processor falls behind the superscalar in 15 processor at the start of the program and at each branch target Constraints Resource conflict I execute The superscalar approach depends on the ability to (i0 and i1 use the same multiple instructions functional unit) in parallel The term instruction-level parallelism refers to the degree to which, on average, the instructions of a program can Time in base cycles be executed in parallel A Figure 16.3 Effect of Dependencies combination of compiler-based optimization and hardware techniques can be used to maximize instruction-level parallelism Before examining the design techniques used in super- scalar machines to increase instruction-level parallelism, we need to look at the fun- damental limitations to parallelism with which the system must cope [JOHN91] lists five limitations: • • • • • True data dependency Procedural dependency Resource conflicts Output dependency Antidependency We examine the first three of these limitations in the remainder of this section A discussion of the last two must await some of the developments in the next section TRUE DATA DEPENDENCY Consider the following sequence: ADD EAX, ECX ;load register EAX with the con- ;tents of ECX plus the contents ;of EAX MOV EBX, EAX ;load EBX with the contents of EAX The second instruction can be fetched and decoded but cannot execute until the first instruction executes The reason is that the second instruction needs data 1For the Intel x86 assembly language, a semicolon starts a comment field 16.2 / DESIGN ISSUES CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS produced by the first instruction This situation is referred to as a true data depen- dency (also called flow dependency or read after write [RAW] dependency) Figure 16.3 illustrates this dependency in a superscalar machine of degree With no dependency, two instructions can be fetched and executed in parallel If there is a data dependency between the first and second instructions, then the second instruc- tion is delayed as many clock cycles as required to remove the dependency In general, any instruction must be delayed until all of its input values have been produced In a simple pipeline, such as illustrated in the upper part of Figure 16.2, the aforementioned sequence of instructions would cause no delay However, consider the following, in which one of the loads is from memory rather than from a register: MOV EAX, eff ;load register EAX with the con- tents of effective memory add- ress eff MOV EBX, EAX ;load EBX with the contents of EAX A typical RISC processor takes two or more cycles to perform a load from memory when the load is a cache hit It can take tens or even hundreds of cycles for a cache miss on all cache levels, because of the delay of an off-chip memory access One way to compensate for this delay is for the compiler to reorder instructions so that one or more subsequent instructions that not depend on the memory load can begin flowing through the pipeline This scheme is less effective in the case of a superscalar pipeline: The independent instructions executed during the load are likely to be executed on the first cycle of the load, leaving the processor with noth- ing to until the load completes PROCEDURAL DEPENDENCIES As was discussed in Chapter 14, the presence of branches in an instruction sequence complicates the pipeline operation The instructions following a branch (taken or not taken) have a procedural dependency on the branch and cannot be executed until the branch is executed Figure 16.3 illustrates the effect of a branch on a superscalar pipeline of degree As we have seen, this type of procedural dependency also affects a scalar pipeline The consequence for a superscalar pipeline is more severe, because a greater magnitude of opportunity is lost with each delay If variable-length instructions are used, then another sort of procedural dependency arises Because the length of any particular instruction is not known, it must be at least partially decoded before the following instruction can be fetched This prevents the simultaneous fetching required in a superscalar pipeline This is one of the reasons that superscalar techniques are more readily applicable to a RISC or RISC-like architecture, with its fixed instruction length RESOURCE conflict A resource conflict is a competition of two or more instructions for the same resource at the same time Examples of resources include memories, caches, buses, register-file ports, and functional units (e.g., ALU adder) In terms of the pipeline, a resource conflict exhibits similar behavior to a data dependency (Figure 16.3) There are some differences, however For one thing, resource conflicts can be overcome by duplication of resources, whereas a true data dependency cannot be eliminated Also, when an operation takes a long time to complete, resource conflicts can be minimized by pipelining the appropriate functional unit 16.2 DESIGN ISSUES 16.2 / DESIGN ISSUES CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS Instruction-Level Parallelism and Machine Parallelism [JOUP89a] makes an important distinction between the two related concepts of instruction-level parallelism and machine parallelism Instruction-level parallelism exists when instructions in a sequence are independent and thus can be executed in parallel by overlapping As an example of the concept of instruction-level parallelism, consider the following two code fragments [JOUP89b]: The three instructions on the left are independent, and in theory all three could be executed in parallel In contrast, the three instructions on the right cannot be executed in parallel because the second instruction uses the result of the first, and the third instruction uses the result of the second The degree of instruction-level parallelism is determined by the frequency of true data dependencies and procedural dependencies in the code These factors, in turn, are dependent on the instruction set architecture and on the application Instruction-level parallelism is also determined by what [JOUP89a] refers to as operation latency: the time until the result of an instruction is available for use as an operand in a subsequent instruction The latency determines how much of a delay a data or procedural dependency will cause Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time (the number of parallel pipelines) and by the speed and sophistication of the mecha- nisms that the processor uses to find independent instructions Both instruction-level and machine parallelism are important factors in enhancing performance A program may not have enough instruction-level parallel- ism to take full advantage of machine parallelism The use of a fixed-length instruc- tion set architecture, as in a RISC, enhances instruction-level parallelism On the other hand, limited machine parallelism will limit performance no matter what the nature of the program Instruction Issue Policy As was mentioned, machine parallelism is not simply a matter of having multi- ple instances of each pipeline stage The processor must also be able to identify instruction-level parallelism and orchestrate the fetching, decoding, and execution of instructions in parallel [JOHN91] uses the term instruction issue to refer to the process of initiating instruction execution in the processor’s functional units and the term instruction issue policy to refer to the protocol used to issue instructions In general, we can say that instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline In essence, the processor is trying to look ahead of the current point of execu- tion to locate instructions that can be brought into the pipeline and executed Three types of orderings are important in this regard: • The order in which instructions are fetched • The order in which instructions are executed • The order in which instructions update the contents of register and memory locations 16.2 / DESIGN ISSUES CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS The more sophisticated the processor, the less it is bound by a strict relation- ship between these orderings To optimize utilization of the various pipeline ele- ments, the processor will need to alter one or more of these orderings with respect to the ordering to be found in a strict sequential execution The one constraint on the processor is that the result must be correct Thus, the processor must accommo- date the various dependencies and conflicts discussed earlier In general terms, we can group superscalar instruction issue policies into the following categories: • In-order issue with in-order completion • In-order issue with out-of-order completion • Out-of-order issue with out-of-order completion IN-ORDER ISSUE WITH IN-ORDER COMPLETION The simplest instruction issue policy is to issue instructions in the exact order that would be achieved by sequential execution (inorder issue) and to write results in that same order (in-order completion) Not even scalar pipelines follow such a simple-minded policy However, it is useful to consider this policy as a baseline for comparing more sophisticated approaches Figure 16.4a gives an example of this policy We assume a superscalar pipeline capable of fetching and decoding two instructions at a time, having three separate functional units (e.g., two integer arithmetic and one floating-point arithmetic), and having two instances of the write-back pipeline stage The example assumes the fol- lowing constraints on a sixinstruction code fragment: • I1 requires two cycles to execute • I3 and I4 conflict for the same functional unit • I5 depends on the value produced by I4 • I5 and I6 conflict for a functional unit Instructions are fetched two at a time and passed to the decode unit Because instructions are fetched in pairs, the next two instructions must wait until the pair of decode pipeline stages has cleared To guarantee in-order completion, when there is a conflict for a functional unit or when a functional unit requires more than one cycle to generate a result, the issuing of instructions temporarily stalls In this example, the elapsed time from decoding the first instruction to writing the last results is eight cycles IN-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION Out-of-order completion is used in scalar RISC processors to improve the performance of instructions that require multiple cycles Figure 16.4b illustrates its use on a superscalar processor Instruction I2 is allowed to run to completion prior to I1 This allows I3 to be completed earlier, with the net result of a savings of one cycle With out-of-order completion, any number of instructions may be in the exe- cution stage at any one time, up to the maximum degree of machine parallelism across all functional units Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural dependency In addition to the aforementioned limitations, a new dependency, which we referred to earlier as an output dependency (also called write after write [WAW] dependency), arises The following code fragment illustrates this dependency (op represents any operation): R3 - R3 R p o I1: 16.2 / DESIGN ISSUES CHAPTER 16 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS I2: I3: R4 R3 - R3 + R5 + I4: R7 - R3 op R4 Decode I1 I2 I3 I3 I4 I4 I2 I1 I4 I5 I4 I5 I6 I6 I6 Execute I1 I1 I2 I3 I4 I5 I6 Write I1 I2 I3 I4 I5 I6 I1 I3 I5 E x e I2 c I4 u t I4 e I6 I6 I2 (b) In- or de r is su e an d ou tof or de r co m pl eti on Cycle Decode I1 I2 I3 I5 I4 I6 Window I1,I2 I3, I4 I4, I5, I6 I5 Execute I1 I1 WriteCycle I2 I2 3I3 I6 I1 4I4I3 I5 I4 I5 (a) In-order issue and inorder comp letion D e c o d e I3 I5 I6 W r i t e (c) Out-oford er issu e and outoford er co mpl etio n Figure 16.4 Superscalar Instruction Issue and Completion Policies Instr uction I2 cannot execute before Cyc le 5I6 Referenc e Manual Document 24896604 2001 http ://d evel oper int el.c om/d esig n/Pe ntiu m4/d ocum enta tion htm INTE01b Intel Corp Desktop Performa nce and Optimiza tion for Intel Pentium Processo r Documen t 24896604 2001 http ://d evel oper int el.c om/d esig n/ Pentium4 /documentat ion.htm INTE04a Intel Corp IA32 Intel Architect ure Software Develope r’s Manual (4 volumes) Documen t 253665 through 253668 2004 http ://d evel oper int el.c om/d esig n/ Pent ium4 /doc umen tati on.h tm JOHN08 John, E., and Rubio, J Unique Chips and Systems Boca Raton, FL: CRC Press, 2008 JOUP89a Jouppi, N., and Wall, D “Availabl e Instructio n-Level Parallelis m for Superscal ar and Superpip elined Machines ” Proceedi ngs, Third Internati onal Conferen ce on Architect ural Support for Program ming Languag es and Operatin g Systems, April 1989 KUGA91 Kuga, M.; Murakam i, K.; and Tomita, S “DSNS (Dynami callyhazard resolved, Statically -codeschedule d, Nonunifo rm Superscal ar): Yet Another Superscal ar Processor Architect ure.” Compute r Architect ure News, June 1991 LEE91 Lee, R.; Kwok, A.; and Briggs, F “The Floating Point Performa nce of a Superscalar SPARC Processor ” Proceedi ngs, Fourth Internati onal Conferen ce on Architectural Support for Program ming Languag es and Operatin g Systems, April 1991 MOSH01 Moshovo s, A., and Sohi, G “Microar chitectura l Innovatio ns: Boosting Micropro cessor Performa nce Beyond Semicon ductor Technolo gy Scaling.” Proceedi ngs of the IEEE, Novembe r 2001 OMON99 Omondi, The Micro archit ecture of Pipeli ned and Super scalar Comp A uters Boston: Kluwer, 1999 PATT01 Patt, Y “Require ments, Bottlenec ks, and Good Fortune: Agents for Microprocessor Evolution ” Proceedi ngs of the IEEE, Novembe r 2001 POPE91 Popescu, V., et al “The Metaflow Architecture.” IEEE Micro, June 1991 RICH07 Riches, S., et al “A Fully Automate d High Performa nce Impleme ntation of ARM CortexA8.” IQ Online, Vol 6, No 3, 2007 www arm com/ iqon line SHEN05 Shen, J., and Lipasti, M Mode rn Proce ssor Desig n: Fund ament als of Super scalar Proce ssors New York: McGrawHill, 2005 SIMA97 Sima, D “Superscalar Instruction Issue.” IEEE Micro, September/Octo ber 1997 SIMA04 Sima, D “Decisive Aspects in the Evolution of Micropro cessors.” Proceedings of the IEEE, Decembe r 2004 2.8 / KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 65 SMIT95 Smith, J., and Sohi, G “The Microarchitecture of Superscalar Processors.” Proceedings of the IEEE, December 1995 Proceedings, Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991 WALL91 16.6 Wall, D “Limits of Instruction-Level Parallelism.” KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS Key Terms antidependency machine parallelism register renaming branch prediction commit flow dependency in-order completion in-order issue instruction issue instruction-level parallelism instruction window micro-operations micro-ops out-of-order completion out-of-order issue output dependency procedural dependency read-write dependency resource conflict retire superpipelined superscalar true data dependency write-read dependency write-write dependency Review Questions 16.1 What is the essential characteristic of the superscalar approach to processor design? 16.2 What is the difference between the superscalar and superpipelined approaches? 16.3 What is instruction-level parallelism? 16.4 Briefly define the following terms: • True data dependency • Procedural dependency • Resource conflicts • Output dependency • Antidependency 16.5 What is the distinction between instruction-level parallelism and machine parallelism? 16.6 List and briefly define three types of superscalar instruction issue policies 16.7 What is the purpose of an instruction window? 16.8 What is register renaming and what is its purpose? 16.9 What are the key elements of a superscalar processor organization? Problems 16.1 When out-of-order completion is used in a superscalar processor, resumption of ex- ecution after interrupt processing is complicated, because the exceptional condition may have been detected as an instruction that produced its result out of order The program cannot be restarted at the instruction following the exceptional instruction, because subsequent instructions have already completed, and doing so would cause these instructions to be executed twice Suggest a mechanism or mechanisms for deal- ing with this situation 16.2 Consider the following sequence of instructions, where the syntax consists of an opcode followed by the destination register followed by one or two source registers: 2.8 / KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 66 ADD LOAD R3, R1, R2 R6 , [R3] 10 AND ADD SRL OR SUB ADD R7, R1, R7, R2 , R5, R0, LOAD SUB AND R5, R6 , R0, R4 , R3 , R1, R6 , [R5] R2 , R1, R3 , R7, R 78 R 7R 10 R 615 Assume the use of a four-stage pipeline: fetch, decode/issue, execute, write back As- sume that all pipeline stages take one clock cycle except for the execute stage For simple integer arithmetic and logical instructions, the execute stage takes one cycle, but for a LOAD from memory, five cycles are consumed in the execute stage If we have a simple scalar pipeline but allow out-of-order execution, we can con- struct the following table for the execution of the first seven instructions: The entries under the four pipeline stages indicate the clock cycle at which each instruc- tion begins each phase In this program, the second ADD instruction (instruction 3) depends on the LOAD instruction (instruction 1) for one of its operands, r6 Because the LOAD instruction takes five clock cycles, and the issue logic encounters the dependent ADD instruction after two clocks, the issue logic must delay the ADD instruction for three clock cycles With an out-of-order capability, the processor can stall instruction at clock cycle 4, and then move on to issue the following three independent instruc- tions, which enter execution at clocks 6, 8, and The LOAD finishes execution at clock 9, and so the dependent ADD can be launched into execution on clock 10 a Complete the preceding table b Redo the table assuming no out-of-order capability What is the savings using the capability? c Redo the table assuming a superscalar implementation that can handle two instructions at a time at each stage 16.3 Consider the following assembly language program: I1: I2: /R3 d (R7)/ /r8 d Memory (R3 )/ I3: R3, /r3 d (R3) + 4/ I4: (R3) /r9 d Memory (R3 )/ I5: R9, L3 /Branch if (R9 > (R8)/ This program includes WAW, RAW, and WAR dependencies Show these Move R3, Load R8, Add R3, Load R9, BLE R8, R7 (R3) Decode I1 I2 I2 I2 I3 I4 I5 I6 I5 I6 Execute I1 I1 I2 I4 I5 I5 16.4 I3 I3 I6 Write I1 I2 16.5 I3 I4 I5 I6 F i g u r e A n I n O r d er Iss ue, InOrd erCo mpl etio n Exe cuti on Seq uen ce Identify the RAW, WAR, and WAW dependencies in the following instruction sequence: I1 R1 = 100 I2 R1 = R2 + R4 I3 R2 = r4 - I4 R4 = R1 + R3 I5 R1 = R1 + 30 b Rename the registers from part (a) to prevent dependency problems Identify references to initial register values using the subscript “a” to the register reference Consider the “inorder-issue/in-ordercompletion” execution sequence shown in Figure 16.14 a Identify the most likely reason why I2 could not enter the execute stage until the fourth cycle Will “inorder issue/out-oforder completion” or “out-oforder issue/ out-of-order completion” fix this? If so, whi ch? b Ide ntif y the rea son wh y I6 cou ld not ent er the wri te sta ge unti l the nin eth cyc le Wil l “inord er issu e/o utoford er co mpl etio n” or “ou toford er issu e/o utoford er co mpl etio Cy cl e n” fix this? If so, whic h ? 16.6 Figure 16.15 shows an example of a superscalar processor organization The processor can issue two instructions per cycle if there is no resource conflict and no data dependence problem There are essentially two pipelines, with four unit) are avail- able for use in the execute stage and are shared by the two pipelines on a dynamic basis The two store units can be dynamically used by the two pipelines, depending on availability at a particular cycle There is a lookahead window with its own fetch and decoding logic This window is used for instruction lookahead for out-of-order instruction issue Consider the following program to be executed on this processor: 11 12 13 14 15 16 c Repeat for inorder issue with out-oforder compl etion d Repeat for out-oforder issue with out-oforder compl etion 16.7 Figure 16 Load R1, A 16 Add R2, R1 is Add R3, R4 fro Mul R4, R5 ma Comp R6 Mul pap R6, R7 Figure 16.15 A Dual-Pipeline Superscalar Processor processing a What dependencies stages (fetch, exist in the decode, program? execute, and b Show the pipeline store) Each activity for this pipeline has its program on the own fetch processor of Figure decode and 16.15 using instore unit Four order issue with infunctional order completion units policies and using a (multiplier, presentation similar adder, logic to Figure 16.2 unit, and load er on sup ers cal ar des ign Ex plai n the thre e /R1 /R2 /r3 /r4 /R6 /R6 — — — — — — — — Memory ( A ) / (R2) + R(1)/ (R3) + R(4)/ (R4) + R(5)/ (R6)/ (R6) X R(7)/ p a r t s o f t h e f i g - To x To y From w To z (c) u r e , a n d d e f i n e w , x , y , a n d z Yeh’s dynamic branch prediction algorithm, used on the Pentium 4, is a twolevel branch prediction algorithm The first level is the history of the last n branches The second level is the branch behavior of the last s occurrences of that unique pattern of the last n branches For each conditional branch instruction in a program, there is an entry in a Branch History Table (BHT) Each entry consists of n bits correspond- ing to the last n executions of the branch instruction, with a if the branch was taken From w To x, y, z (a) To x From w To y (b) To z T N T 2/T N/ / NT T 1/ T N (a ) T 1/N í T N N (c) N N T T N T T (b) T 2/T N // T ' N T T í 0/N ( 1/N t N Figure 16.17 Figure for Problem 16.8 N T N í 0/N N (e ) (d) and a if the branch was not Each BHT entry indexes into a Pattern Table (PT) that has 2n entries, one for each possible pattern of n bits Each PT entry consists of s bits that are used in branch prediction, as was described in Chapter 14 (e.g., Figure 14.19) When a conditional branch is encountered during instruction fetch and decode, the address of the instruction is used to retrieve the appropriate BHT entry, which shows the recent history of the instruction Then, the BHT entry is used to retrieve the appropriate PT entry for branch prediction After the branch is executed, the BHT entry is updated, and then the appropriate PT entry is updated a In testing the performance of this scheme, Yeh tried five different predic- tion schemes, illustrated in Figure 16.17 Identify which three of these schemes correspond to those shown in Figures 14.19 and 14.28 Describe the remaining two schemes b With this algorithm, the prediction is not based on just the recent history of this par- ticular branch instruction Rather, it is based on the recent history of all patterns of branches that match the n-bit pattern in the BHT entry for this instruction Suggest a rationale for such a strategy This page intentionally left blan ... machines and superscalar machines A superscalar implementation of a processor architecture is one in which common instructions—integer and floating-point arithmetic, loads, stores, and conditional... 801 and the Berkeley RISC I was seven or eight years, the first superscalar machines became commer- cially available within just a year or two of the coining of the term superscalar The superscalar. .. behind the superscalar in 15 processor at the start of the program and at each branch target Constraints Resource conflict I execute The superscalar approach depends on the ability to (i0 and i1

Ngày đăng: 12/11/2019, 13:29

Xem thêm:

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w