Chapter04 2pipelinedprocessor

dce 2017 COMPUTER ARCHITECTURE CSE 2015 BK TP.HCM Faculty of Computer Science and Engineering Department of Computer Engineering Vo Tan Phuong http://www.cse.hcmut.edu.vn/~vtphuong dce 2017 Chapter 4.2 Thiết kế xử lý đường ống (Pipelined Processor Design) Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce Nội dung 2017  Thực thi theo kiểu đường ống so với  Datapath & Control theo kiểu đường ống  Rủi ro (Hazard) thực đường ống  Rủi ro liệu phương pháp xúc tiến sớm  Chờ lệnh “Load”, phát rủi ro khựng  Rủi ro điều khiển Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce Ví dụ chế đường ống 2017  Dịch vụ giặt đồ: bước Giặt Sấy Gấp  Mỗi bước thực 30 phút  Có mẻ Computer Architecture – Chapter 4.2 A B C D ©Fall 2017, CSE dce Phương pháp 2017 PM Time 30 30 30 30 30 30 10 30 30 11 30 30 12 AM 30 30 A B C D  Cần tiếng để hoàn thành mẻ  Dễ thấy cách làm cịn cải thiện Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce Áp dụng chế đường ống 2017 PM 30 30 30 30 30 30 30 30 30 PM Time 30 30 30 A  Cần tiếng cho mẻ B  Hiệu lần cho mẻ C  Thời gian xử lý mẻ không đổi (90 phút) D Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce Hiệu suất chế đường ống 2017  Mỗi công việc cần k công đoạn  Với ti = thời gian công đoạn Si  Chu kỳ xung nhịp t = max(ti) thời gian công đoạn dài  Tần số xung nhịp f = 1/t = 1/max(ti)  Thời gian xử lý n công việc = (k + n – 1)*t  k chu kỳ để hồn thành cơng việc  n – chu kỳ cịn lại hồn thành n – công việc  Speed up trường hợp lý tưởng Số chu kỳ cho cách Sk = Số chu lỳ cho cách pipeline Computer Architecture – Chapter 4.2 nk = k+n–1 Sk → k n lớn ©Fall 2017, CSE dce 2017 Bộ xử lý MIPS theo chế Pipeline  Gồm công đoạn, công đoạn chu kỳ IF: Instruction Fetch (nạp lệnh) ID: Instruction Decode (giải mã lệnh) EX: Execute (thực thi phép toán) MEM: Memory access (truy xuất nhớ liệu) WB: Write Back (ghi kết vào ghi đích) Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce 2017 So sánh Single-Cycle với Pipelined  Giả sử công đoạn q trình thực thi lệnh có thời gian sau:  Nạp lệnh (IF) = ALU thực thi (ALU) = truy xuất nhớ liệu (MEM) = 200 ps  Đọc ghi (RegR) = ghi ghi (RegW) = 150 ps  Tính chu kỳ xử lý đơn chu kỳ (Ts)?  Tính chu kỳ xử lý đơn đường ống (Tp)?  Tính speedup?  Lời giải: Ts = 200+150+200+200+150 = 900 ps IF Reg ALU MEM Reg 900 ps IF Reg ALU MEM Reg 900 ps Computer Architecture – Chapter 4.2 ©Fall 2017, CSE dce So sánh (tiếp theo) 2017  Tp = max(200, 150) = 200 ps IF Reg 200 IF 200 ALU Reg IF 200 MEM Reg ALU MEM Reg ALU MEM 200 200 Reg 200  CPI cho xử lý pipeline = Reg 200  Xét trường hợp số lượng lệnh lớn  Speedup xử lý pipeline = 900 ps / 200 ps = 4.5  IC CPI cho hai trường hợp  Speedup nhỏ (số công đoạn)  Do thời gian công đoạn không cân Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 10 dce Load Delay 2017  Unfortunately, not all data hazards can be forwarded  Load has a delay that cannot be eliminated by forwarding  In the example shown below …  The LW instruction does not read data until end of CC4  Cannot forward data to ADD at end of CC3 - NOT possible Program Order Time (cycles) lw $s2, 20($t1) add $s4, $s2, $t5 CC1 CC2 CC3 CC4 CC5 IF Reg ALU DM Reg IF Reg ALU DM Reg IF Reg ALU DM Reg IF Reg ALU DM or $t6, $t3, $s2 and $t7, $s2, $t4 Computer Architecture – Chapter 4.2 CC6 CC7 CC8 However, load can forward data to 2nd next and later instructions Reg ©Fall 2017, CSE 34 dce 2017 Detecting RAW Hazard after Load  Detecting a RAW hazard after a Load instruction:  The load instruction will be in the EX stage  Instruction that depends on the load data is in the decode stage  Condition for stalling the pipeline if ((EX.MemRead == 1) // Detect Load in EX stage and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard  Insert a bubble into the EX stage after a load instruction  Bubble is a no-op that wastes one clock cycle  Delays the dependent instruction after load by once cycle  Because of RAW hazard Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 35 dce Stall the Pipeline for one Cycle 2017  ADD instruction depends on LW  stall at CC3  Allow Load instruction in ALU stage to proceed  Freeze PC and Instruction registers (NO instruction is fetched)  Introduce a bubble into the ALU stage (bubble is a NO-OP)  Load can forward data to next instruction after delaying it Program Order Time (cycles) lw $s2, 20($s1) CC1 CC2 CC3 CC4 CC5 IM Reg ALU DM Reg IM stall bubble bubble bubble Reg ALU DM Reg IM Reg ALU DM add $s4, $s2, $t5 or $t6, $s3, $s2 Computer Architecture – Chapter 4.2 CC6 CC7 CC8 Reg ©Fall 2017, CSE 36 dce Showing Stall Cycles 2017  Stall cycles can be shown on instruction-time diagram  Hazard is detected in the Decode stage  Stall indicates that instruction is delayed  Instruction fetching is also delayed after a stall  Example: Data forwarding is shown using green arrows lw $s1, ($t5) lw $s2, 8($s1) IF ID IF EX MEM WB Stall add $v0, $s2, $t3 ID IF sub $v1, $s2, $v0 EX MEM WB Stall ID EX MEM WB IF ID EX MEM WB CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 37 dce Hazard Detect, Forward, and Stall 32 Data_out 32 32 WData Data Memory 0 BusW Result Im26 A BusB Address Data_in 32 Rd4 RW A L U D Rd RB B Rt RA BusA 32 ALU result 32 E Rd2 PC Instruction Rs Register File Imm26 Rd3 2017 clk Disable PC RegDst ForwardB func ForwardA Hazard Detect Forward, & Stall MemRead Stall Bubble =0 RegWrite Computer Architecture – Chapter 4.2 WB Control Signals MEM Main & ALU Control EX Op RegWrite RegWrite ©Fall 2017, CSE 38 dce Code Scheduling to Avoid Stalls 2017  Compilers reorder code in a way to avoid load stalls  Consider the translation of the following statements: A = B + C; D = E – F; // A thru F are in Memory  Slow code: lw lw add sw lw lw sub sw $t0, 4($s0) $t1, 8($s0) $t2, $t0, $t1 $t2, 0($s0) $t3, 16($s0) $t4, 20($s0) $t5, $t3, $t4 $t5, 12($0)  Fast code: No Stalls # &B = 4($s0) # &C = 8($s0) # stall cycle # &A = 0($s0) # &E = 16($s0) # &F = 20($s0) # stall cycle # &D = 12($0) Computer Architecture – Chapter 4.2 lw lw lw lw add sw sub sw $t0, $t1, $t3, $t4, $t2, $t2, $t5, $t5, 4($s0) 8($s0) 16($s0) 20($s0) $t0, $t1 0($s0) $t3, $t4 12($s0) ©Fall 2017, CSE 39 dce 2017 Name Dependence: Write After Read  Instruction J should write its result after it is read by I  Called anti-dependence by compiler writers I: sub $t4, $t1, $t3 # $t1 is read J: add $t1, $t2, $t3 # $t1 is written  Results from reuse of the name $t1  NOT a data hazard in the 5-stage pipeline because:  Reads are always in stage  Writes are always in stage 5, and  Instructions are processed in order  Anti-dependence can be eliminated by renaming  Use a different destination register for add (eg, $t5) Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 40 dce 2017 Name Dependence: Write After Write  Same destination register is written by two instructions  Called output-dependence in compiler terminology I: sub $t1, $t4, $t3 # $t1 is written J: add $t1, $t2, $t3 again # $t1 is written  Not a data hazard in the 5-stage pipeline because:  All writes are ordered and always take place in stage  However, can be a hazard in more complex pipelines  If instructions are allowed to complete out of order, and  Instruction J completes and writes $t1 before instruction I  Output dependence can be eliminated by renaming $t1  Read After Read is NOT a name dependence Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 41 dce 2017 Tiếp theo…  Thực thi theo kiểu đường ống so với  Datapath & Control theo kiểu đường ống  Rủi ro (Hazard) thực đường ống  Rủi ro liệu phương pháp xúc tiến sớm  Chờ lệnh “Load”, phát rủi ro khựng  Rủi ro điều khiển Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 42 dce Control Hazards 2017  Jump and Branch can cause great performance loss  Jump instruction needs only the jump target address  Branch instruction needs two things:  Branch Result Taken or Not Taken  Branch Target Address  PC + If Branch is NOT taken  PC + + × immediate If Branch is Taken  Jump and Branch targets are computed in the ID stage  At which point a new instruction is already being fetched  Jump Instruction: 1-cycle delay  Branch: 2-cycle delay for branch result (taken or not taken) Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 43 dce 2-Cycle Branch Delay 2017  Control logic detects a Branch instruction in the 2nd Stage  ALU computes the Branch outcome in the 3rd Stage  Next1 and Next2 instructions will be fetched anyway  Convert Next1 and Next2 into bubbles if branch is taken Beq $t1,$t2,L1 cc1 cc2 cc3 IF Reg ALU IF Next1 Next2 L1: target instruction Computer Architecture – Chapter 4.2 cc4 cc5 cc6 Reg Bubble Bubble Bubble IF Bubble Bubble Bubble Bubble IF Reg ALU DM Branch Target Addr cc7 ©Fall 2017, CSE 44 dce Implementing Jump and Branch NPC2 Bne A 32 E BusB BusW zero A L U ALUout Imm16 D RW Beq 32 32 Rd3 Rd Im26 NPC Address RB BusA J B Rt RA Next PC Rd2 Instruction Rs Register File Instruction Memory Imm26 Op PCSrc Instruction +1 PC Jump or Branch Target 2017 Branch target & outcome are computed in ALU stage J, Beq, Bne Main & ALU Control Computer Architecture – Chapter 4.2 Control Signals Bubble = 0 MEM Branch Delay = cycles Reg Dst EX func clk ©Fall 2017, CSE 45 dce Predict Branch NOT Taken 2017  Branches can be predicted to be NOT taken  If branch outcome is NOT taken then  Next1 and Next2 instructions can be executed  Do not convert Next1 & Next2 into bubbles  No wasted cycles Beq $t1,$t2,L1 Next1 cc1 cc2 cc3 IF Reg ALU NOT Taken IF Next2 Computer Architecture – Chapter 4.2 cc4 cc5 cc6 Reg ALU DM Reg IF Reg ALU DM cc7 Reg ©Fall 2017, CSE 46 dce 2017 Reducing the Delay of Branches  Branch delay can be reduced from cycles to just cycle  Branches can be determined earlier in the Decode stage  A comparator is used in the decode stage to determine branch decision, whether the branch is taken or not  Because of forwarding the delay in the second stage will be increased and this will also increase the clock cycle  Only one instruction that follows the branch is fetched  If the branch is taken then only one instruction is flushed  We should insert a bubble after jump or taken branch  This will convert the next instruction into a NOP Computer Architecture – Chapter 4.2 ©Fall 2017, CSE 47 dce Reducing Branch Delay to Cycle J Beq Bne Longer Cycle = RW A E BusB BusW 32 32 1 A L U D Rd 32 Rd3 RB BusA B Address Op Rt RA Rd2 Instruction Rs Register File Instruction Memory Instruction Imm16 PCSrc Data forwarded then compared ALUout Next PC Reset +1 PC Jump or Branch Target Zero Im16 2017 Reg Dst J, Beq, Bne Main & ALU Control Computer Architecture – Chapter 4.2 Control Signals Bubble = ALUCtrl MEM Reset signal converts next instruction after jump or taken branch into a bubble EX func clk ©Fall 2017, CSE 48