dce 2013 COMPUTER ARCHITECTURE CE2013 BK TP.HCM Faculty of Computer Science and Engineering Department of Computer Engineering Vo Tan Phuong http://www.cse.hcmut.edu.vn/~vtphuong CuuDuongThanCong.com https://fb.com/tailieudientucntt dce 2013 Chapter Single-cycle & Pipeline Processor CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE Single-Cycle Processor Overview Jump or Branch Target Address 30 30 30 Next PC Imm26 +1 PCSrc 30 00 2013 Imm16 Instruction Memory Rs 32 Instruction m u x PC dce Rt Address RA RB E BusB m u x m u Rd x RW BusW ALU result zero BusA Registers J, Beq, Bne A L U Data Memory Address 32 Data_out Data_in m 32 u x 1 clk func Op RegDst ALUop ALU Ctrl RegWrite ExtOp ALUSrc MemRead MemWrite MemtoReg Main Control CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise Fill the value of the control signals for following instruction: a slt $t0,$s0,$zero Reg Dst Reg Write Ext Op ALU Src Beq Bne J Mem Read Mem Write Mem toReg 1 x 0 0 0 J Mem Read Mem Write Mem toReg b bne $t0,$zero,exit_label Reg Dst Reg Write CuuDuongThanCong.com Ext Op ALU Src Computer Architecture – Chapter 4.2 Beq Bne https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise • We wish to add the instruction jalr (jump and link register) to the single-cycle datapath Add any necessary datapath and control signals and draw the result datapath Show the values of the control signals to control the execution of the jalr instruction • The jump and link register instruction is described below: CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt â2013, CE dce 2013 Exercise One solution: (Comment: JReg means Jump Register; RA means: Return Address) CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise • The main control signals for the JALR instruction are the same for other R-type instructions, such as ADD and SUB These control signals are shown in the table below: • The ALU Control signals for the JALR instruction are shown below JReg = and RA = ALUCtrl is a don't care CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise We want to compare the performance of a single-cycle CPU design with a multi-cycle CPU Suppose we add the multiply and divide instructions The operation times are as follows: o Instruction memory access time = 190 ps, Data memory access time = 190 ps o Register file read access time = 150 ps, Register file write access = 150 ps o ALU delay for basic instructions = 190 ps, ALU delay for multiply or divide = 550 ps Ignore the other delays in the multiplexers, control unit, sign-extension, etc Assume the following instruction mix: 30% ALU, 15% multiply & divide, 15% load, 15% store, 15% branch, and 10% jump a What is the total delay for each instruction class and the clock cycle for the single-cycle CPU design b Assume we fix the clock cycle to 200 ps for a multi-cycle CPU, what is the CPI for each instruction class and the speedup over a fixed-length clock cycle? CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise a Total delay for each instruction: Clock cycle = max delay = 1040ps CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE dce 2013 Exercise b CPI for each instruction: CPI for Basic ALU = cycles CPI for Multiply & Divide = cycles (ALU takes cycles) CPI for Load = cycles CPI for Store = cycles CPI for Branch = cycles CPI for Jump = cycles Average CPI = 0.3 * + 0.15 * + 0.15 * + 0.15 * + 0.15 * + 0.1 * = 4.1 Speedup of multi-cycle over single-cycle = (1040 * 1) / (200 * 4.1) = 1.27 CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE 10 dce 2013 Exercise • Identify all the RAW data dependencies in the following code Which dependencies are data hazards that will be resolved by forwarding? Which dependencies are data hazards that will cause a stall? Using a graphical representation of the pipeline, show the forwarding paths and stalled cycles if any add $3, $4, $2 sub $5, $3, $1 lw $6, 200($3) add $7, $3, $6 CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE 11 dce 2013 Exercise • RAW dependencies: add $3, $4, $2 and sub $5, $3, $1 (forwarding) add $3, $4, $2 and lw $6, 200($3) (forwarding) lw $6, 200($3) and add $7, $3, $6 (stall 1, forward) add $3, $4, $2 and add $7, $3, $6 (from register) CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE 12 dce 2013 Exercise • We have a program of 10^6 instructions in the format of “lw, add, lw, add,…” The add instruction depends only on the lw instruction right before it The lw instruction also depends only on the add instruction right before it If this program is executed on the 5-stage MIPS pipeline: a Without forwarding, what would be the actual CPI? It takes cycles on average to complete one LW and one ADD cycle (to complete LW) + cycles (bubbles) + cycle (to complete ADD) + cycles (bubbles) = cycles So, it takes cycles to complete instructions Average CPI = 6/2 = b With forwarding, what would be the actual CPI? It takes only cycles on average to to complete one LW and one ADD cycle (to complete LW) + cycle (bubble) + cycle (to complete ADD) = cycles So, it takes cycles to complete instructions Average CPI = 3/2 = 1.5 CuuDuongThanCong.com Computer Architecture – Chapter 4.2 https://fb.com/tailieudientucntt ©2013, CE 13 ... is described below: CuuDuongThanCong .com Computer Architecture – Chapter 4.2 https://fb .com/ tailieudientucntt ©2013, CE dce 2013 Exercise • One solution: (Comment: JReg means Jump Register; RA... ALUCtrl is a don't care CuuDuongThanCong .com Computer Architecture – Chapter 4.2 https://fb .com/ tailieudientucntt ©2013, CE dce 2013 Exercise We want to compare the performance of a single-cycle... CuuDuongThanCong .com Computer Architecture – Chapter 4.2 https://fb .com/ tailieudientucntt ©2013, CE dce 2013 Exercise a Total delay for each instruction: Clock cycle = max delay = 1040ps CuuDuongThanCong.com