Advanced Computer Architecture - Lecture 17: Instruction level parallelism. This lecture will cover the following: high-performance instructions delivery - multiple issue; high-performance processors; branch target buffer; integrated instruction fetch units; return address predictors; multiple instruction-issue processors;...
CS 704 Advanced Computer Architecture Lecture 17 Instruction Level Parallelism (High-performance Instructions delivery - Multiple Issue) Prof Dr M Ashraf Chughtai Recap: MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) High-Performance Processors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) Reducing branch penalties for HighPerformance Processors Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 1: Branch Target Buffer MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units Integrated Branch Prediction Instruction Prefetch Instruction memory access and buffering MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Integrated Branch Prediction The Branch-predictor is included in the Instruction Fetch Unit So, it predicts and drive the fetch-pipe MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Instruction Prefetch An instruction pre-fetch queue is part of IIFU The queue holds multiple instructions and deliver more than one instructions in one cycle MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Instruction-Memory access and buffering Fetching multiple instructions per clock cycle may require accessing multiple cache lines, which is a complex operation IIFU facilitates to overcome these complexities and hides the cost of crossing cache-blocks IIFU also provides instruction buffering and on-demand issue MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 3: Return Address Predictors The Return-Address predictor predicts the indirect jumps, i.e., the jumps whose address varies at rum time High-level language programs generate such jumps for indirect procedure calls and select or case statements MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 10 Dynamic Scheduling in Superscalar … cont’d Modern superscalar processors issue four or more instructions per clock cycle and: often included both approaches In addition it is speculated that the Branch prediction is integrated into a dynamically scheduled pipeline This referred to as Hardware-based speculation MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 28 Example Let us consider a most general 2-issue dynamically scheduled processor and see how a simple loop, which we considered for single-issue Tomasulo, executes on this processor Recall that our example loop adds a scalar in F2 to each element of a vector in memory MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 29 Example Loop: L.D ADD.D S.D DADDUI BNE MAC/VU-Advanced Computer Architecture F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 R1,R2,LOOP ; F0=array element ; add scalar in F2 ; store result ; decrement pointer ; bytes (per DW) ; branch R1!= R2 Lecture 17 – Instruction Level Parallelism -Dynamic (6) 30 Example Let us create a table showing when each instruction issues, begins execution, and write its result to CDB for first three iterations using 2-issue version of Tomasulo’s pipeline using single issue processor Assume that Both FP and integer operation can be issued on every clock cycle, even if they are dependent MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 31 Example One integer functional unit is used for both ALU operations and effective address calculations and a separate pipeline FP functional until for each operation type Issue and write result take one cycle each There is dynamic branch prediction hardware and a separate functional unit to evaluate branch conditions There is one clock for integer ALU, two cycles for load, and three cycles for FP add MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 32 Les us have a look on he clock cycle of issue, execution, and writing result for a dual version of Tomasulo’s pipeline MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 33 Example Thus, sustaining one iteration every three cycles would lead to an IPC of 5/3=1.67 (5 instructions in clocks) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 34 Example completion rate is: 15/16=0.94 – 15 instructions execute in 16 cycles MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 35 Resource usage table MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 36 Another example Overcoming the single integer pipe bottleneck Now let us consider another example with 2-issue version of the Tomasulo's pipeline to overcome single-integer unit pipe bottleneck MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 37 Example In this example, we consider the execution of the same loop, as used in the previous example, but using 2-issue version of Tomasulo’s pipeline with 2issue processor that has wider CDBs (2 CDBs) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 38 Example … Cont’d Similar to the previous example, the activities table, similar to the previous table, shows the clock cycles of issue, execution and writing result for the dual-issue version of the Tomasulo’s pipeline Notice that dual-issue Tomasulo pipe has: - separate functional units for Integer ALU and effective address calculation; and - wider CDB MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 39 Activity Table for Example MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 40 Summary MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41 Aslam-u-Alacum and Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 42 ... MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) High-Performance Processors MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism. .. MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41 Aslam-u-Alacum and Allah Hafiz MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level. .. MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 39 Activity Table for Example MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic