Advanced Computer Architecture - Lecture 16: Instruction level parallelism. This lecture will cover the following: correlating branch predictors; tournament predictor; high performance instruction delivery – branch target buffer; hardware intensive approaches; predictor increases misprediction rate;...
CS 704 Advanced Computer Architecture Lecture 16 Instruction Level Parallelism (Dynamic Branch Prediction … Cont’d) Prof Dr M Ashraf Chughtai Today's Topics Recap Correlating Branch Predictors Tournament Predictor High Performance Instruction Delivery – Branch Target Buffer Summary MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Recap: Dynamic Scheduling and Branch Prediction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Recap: Dynamic Scheduling and Branch Prediction - Static: rely on the software (compiler) - Dynamic: hardware intensive approaches MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Important questions: Branch-Prediction Buffer Q1: What is the impact of increasing the size of branch-prediction buffer on two branches in a program? A single predictor predicting a single branch is generally more accurate than is that same predictor serving more than one instructions; and MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Q:1 Branch-Prediction Buffer It is less likely that two branches in a program share a single predictor Therefore, increasing the size of predictor buffer does not have significant effect on two branches in a program MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Question Branch-Prediction Buffer How sharing a predictor effects the misprediction rate This is explained with the help of following example: Consider two sequences of branch-taken and nottaken , sharing 1-bit predictor; and identify the sequence that a) reduces the misprediction rate b) increases the misprediction rate MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Example: Sequence P B1 P NT T Prediction - T B2 P B1 NT NT T No No P B2 P B1 P B2 P T NT NT T No - No - B1 T NT NT T No - No - No P B2 T NT - No Correct? Here, the columns B1 and B2 show the branches B1 and B2 B1 is always TAKEN B2 is always Not-TAKEN MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Example: Sequence P B1 P NT T Prediction - T B2 P B1 NT NT T No No P B2 P B1 P B2 P T NT NT T No - No - B1 T NT NT T No - No - No P B2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Example: Sequence P B1 P NT T Prediction - T B2 P B1 NT NT T No No P B2 P B1 P B2 P T NT NT T No - No - B1 T NT NT T No - No - No P B2 T NT - No Correct? MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 10 Branch-Target Buffer Steps involved in using Branch Target Buffer at IF, ID and EXE pipeline stages - IF - ID - EXE - (insert flow chart of Fig 3.20) MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 38 Branch-Target Buffer – Flow Chart Explanation - IF Stage The PC of an instruction is compared with the contents of the buffer if it is found then the instruction must be a branch instruction predicted taken Else It may be a branch predicted not-taken or normal instruction MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 39 Branch-Target Buffer - ID Stage i) Decode the instruction and If in the IF Stage, entry was found in the Target-buffer as predictedbranch then begin fetching immediately from the predicted PC ii) Check the decoded instruction If it is Taken-branch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 40 Branch-Target Buffer - EX Stage performs one of the four possible functions i) Where in the IF stage entry was not found in the target buffer and in the ID stage If it is found to be Taken-branch (i-a) then enter branch-instruction address and next PC into branch-target buffer (i-b) Else proceed as normal instruction execution MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 41 Branch-Target Buffer - EX Stage (ii) Where in the IF stage the entry was found in the target-buffer and in the ID stage If it is found to be Taken-branch (ii-a) then correctly predicted , so execute normally without stall (ii-b) Else it is mispredicted, so kill the fetched instruction, restart fetching at an other address and delete entry from the targetbuffer MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 42 Branch-Target Buffer Conclusion If the correctly predicted branch entry is found in the buffer Then there will be no branch penalty Else It suffers at least clock cycle delay as misprediction penalty - one clock delay for fetching the wrong instruction and - one clock cycle to restart the fetch MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 43 Branch-Target Buffer - Examples Inst in Buffer Prediction Actual Branch Yes Taken Taken Yes Taken Not-Taken No - Taken No - Not Taken MAC/VU-Advanced Computer Architecture Penalty Cycles Lecture 16 – Instruction Level Parallelism -Dynamic (5) 44 Branch-Target Buffer - Solution We can compute the penalty by looking at the probability of two events: i) Branch predicted taken but end up not take = %buffer hit rate x % incorrect prediction = 0.95 x 0.1 = 0.095 ii) Branch is taken but is not found in the buffer = % incorrect prediction = 0.1 The penalty in both the cases is cycles, therefore Branch Penalty = (0.095 + 0.1)x2 = 0.195 x = 0.39 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 45 Example: Branch-Target Buffer Problem: Consider a branch-target buffer implemented for conditional branches only for pipelined processor Assuming that: Misprediction penalty = cycles Buffer miss-penalty = 3cycles Hit rate and accuracy each = 90% Branch Frequency = 15% MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 46 Example: Branch-Target Buffer Solution The speedup with Branch Target Buffer verses no BTB is expressed as: Speedup = CPI no BTB /CPI BTB = (CPI base +Stalls no BTB ) / ( CPI base + Stalls BTB ) The stalls are determined as: Stalls = Σ Frequency s ε stall s x Penalty s The sum over all the stall cases as the product of frequency of the stall cases and the stall-penalty MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 47 Example: Branch-Target Buffer i) ii) Stalls no BTB = 0.15 x = 0.30 To find Stalls BTB we have to consider each output from BTB There exist three possibilities: a) Branch misses the BTB: frequency = 15 % x 0.1 = 1.5% = 0.015 Penalty =3 Stalls =0.045 b) Branch can hit and correctly predicted: frequency = 15 % x 0.9 (hit) x 0.9 (prediction) = 12.1% = 0.121 Penalty Lecture 16 – Instruction Level =0 MAC/VU-Advanced Computer Architecture Parallelism -Dynamic (5) Stalls = 48 Example: Branch-Target Buffer c) Branch can hit but incorrectly predicted: frequency = 15 % x 0.9 (hit) x 0.1 (misprediction) = 1.3% = 0.013 Penalty =4 = 0.052 Stalls ii) Stalls BTB = 0.045 + + 0.052 = 0.097 Speedup = (CPI base + Stalls no BTB ) / ( CPI base + Stalls BTB ) = (1.0 + 0.3) / (1.0 + 0.097) = 1.2 MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 49 Improvement in BTB In order to achieve more instruction delivery, one possible variation in the Branch Target Buffer is: To store one or more target instructions, in stead of or in addition to, the predicted Target Address MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 50 Improvement in BTB Advantages: - It possibly allows larger BTB as it permits access to take longer than the time between successive instruction fetches - Buffering the actual Target-Instructions allow Branch Folding, i.e., ZERO cycle Unconditional Branching or some times ZERO Cycle conditional Branching MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 51 Conclusion: MAC/VU-Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 52 ... B2 P B1 P B2 P T NT NT T No - No - B1 T NT NT T No - No - No P B2 T NT - No Correct? MAC/VU -Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) Example: Sequence... Prediction - T B2 P B1 NT NT T No No P B2 P B1 P B2 P T NT NT T No - No - B1 T NT NT T No - No - No P B2 T NT - No Correct? MAC/VU -Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism. .. P NT NT NT NT T No no No - T yes - T B1 T NT NT NT No - yes - no P B2 NT T - Correct? MAC/VU -Advanced Computer Architecture Lecture 16 – Instruction Level Parallelism -Dynamic (5) 13 Example: