Advanced Computer Architecture - Lecture 15: Instruction level parallelism. This lecture will cover the following: dynamic branch prediction; branch prediction buffer; examples of branch predictor; predicated execution can reduce number of branches, number of mispredicted branches;...
CS 704 Advanced Computer Architecture Lecture 15 Instruction Level Parallelism (Dynamic Branch Prediction) Prof Dr M Ashraf Chughtai Today's Topics Recap - Lecture 14 Dynamic Branch Prediction Branch Prediction Buffer Examples of Branch Predictor Summary MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 Tomasulo's Approach for IBM 360/91 to achieve high Performance without special compilers Here, the control and buffers are distributed with Function Units (FU) Registers in instructions are replaced by values or pointers to reservation stations(RS) ; i.e., the registers are renamed Unlike Scoreboard, Tomasulo can have multiple loads outstanding MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 These two properties allow to issue an instruction having name dependence ; e.g., MULT is issued which has name dependence of register F2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 Tomasulo eliminates the WAR hazard as in this example ADD.D writes the result in Cycle 11 even if the DIV.D will start execution in Cycle 16 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 Tomasulo issues in-order and may execute outof-order MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 • Here, the integer instructions SUBI and BNEZ are executed out-of-order to evaluate the condition • The perdition Branch-Taken is implemented by repeating the loop instruction as shown MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 • The perdition Branch-Taken is implemented by two iterations of the code • R1 has been initialized to 80 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 • L.D is issued in 6th clock cycle, prior to the condition evaluation – Predict Branch Taken • R1 is updated in Clock 6, by executing SUB in Clock cycle • SUBI and BNZE are issued in Clock Cycle and respectively • F0 never sees the result MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 • MUL1 issued in clock cycle does not start execution till Wr to F0 by LD is complete to avoid WAR Hazard MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 10 Branch History Table Accuracy wrt size Insert Fig 3.9 pp 201 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 27 Impact of size on accuracy of BHT As we try to exploit more ILP, the accuracy of the Branch Predictor becomes critical Here, the accuracy of the predictor is shown by increasing the size of the buffer as 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT Simply increasing the number of bits per predictor without changing the predictor structure has little impact – so we have to look at other methods to increase the accuracy of the predictors MAC/VU-Advanced Lecture 15 – Instruction Level Computer Architecture Parallelism -Dynamic (4) 28 Correlating Branches The 2-bit predictor scheme uses only the recent behavior of the single branch to predict the future behavior of branch In practice, the behavior of other branches, rather than only a single branch, we are trying to predict, may also influence the prediction accuracy Let us consider the worst case of SPEC92 benchmark for 2-bit predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 29 Correlating Branches SPEC92 benchmark for 2-bit predictor example: Assume aa is assigned R1 and bb the register R2 IF (aa==2) DSUBUI R3, R1, #2 aa=0; BNEZ R3, L1 ; branch b1 (aa!=2) DADD R1, R0, R0 ; aa=0 Not Branch IF (bb==2) L1 DSUBUI R3, R2, #2 bb=0; BNEZ R3, L2 ; branch b2 (bb!=2) DADD R2, R0, R0 ; bb=0 Not Branch IF (aa!=bb) L2 DSUBU R3, R1,R2 { BEQZ R3, L3 ; branch b3 (aa=bb) Here, the behavior of b3 (L2) is correlated with the behavior of b1 and b2 MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 30 Correlating Branches Here, if b1 and b2 are both nottaken (aa=0; bb=0) then b3 is taken A predictor that uses the behavior of a single branch to predict the behavior of that branch cannot capture this behavior So we need a correlating branch predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 31 Correlating Branch Predictors Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 32 Correlating Branch Predictors In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters – Old 2-bit BHT is then a (0,2) predictor MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 33 Correlating Branch Predictor: Example Let us consider an illustrative code: (d is assigned to R1) IF (d==0) BNEZ R1, L1 ; branch b1 (d!=0) d=1; DADDIU R1,R0,#1 ; branch not taken, d=1 IF (d==1) L1: DADDIU R3, R1, #-1 BNEZ R3, L2 ; branch b2 – (d!=1) The working of correlating predictor is as follows Initial d d==0? b1 d before b2 d==1? b2 yes NT yes NT No T yes NT No T no T Here, if b1 is not taken b2 will not be taken – …… next MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 34 Correlating Branch Predictor: Example We write the pair of prediction bits as: Prediction if last branch in the program is not-taken/ Prediction if the last branch is taken Therefore, the possible combinations are: Prediction bits New Prediction if last New Prediction if last branch Not Taken NT / NT NT/T T/NT T/T MAC/VU-Advanced Computer Architecture NT NT T T Lecture 15 – Instruction Level Parallelism -Dynamic (4) Branch Taken NT T NT T 35 Correlating Branch Predictor: Example The action of the 1-bit predictor with 1-bit of correlation, written as (1,1) for the above example is shown here (Fig 3.13 … pp 203 In this case the only misprediction is on the first iteration, when d=2 as this is not correlated with the previous perdition MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 36 Correlating Branches (2,2) branch prediction buffer uses 2-bit global history to choose from among predictors for each branch address Branch address 2-bits per branch predictors Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Prediction Prediction 2-bit global branch history MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 37 Accuracy of Different Schemes 18% 16% 4096 Entries 2-bit BHT 11% Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 14% 12% 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 4,096 entries: 2bits per entry MAC/VU-Advanced Computer Architecture Unlimited entries: 2bits/entry Lecture 15 – Instruction Level Parallelism -Dynamic (4) li eqntott espresso gcc fpppp spice doducd tomcatv 0% matrix300 0% nasa7 FrequencyFrequency of Mispredictions of Mispredictions 18% 1,024 entries (2,2) 38 Branch History Table or Branch Target Buffer PC instruction to Fetch Lookup Predicted PC Number of entries in Branch target Buffer No: Inst Is not predicted to be branch Proceed Normally Yes: Inst Is branch and predicted PC should be used as the next PC MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Branch Predicted Taken or Not Taken 39 Dynamic Branch Prediction Summary Branch History Table: bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 40 Asslam-u-aLacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 41 ... MAC/VU -Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 40 Asslam-u-aLacum and ALLAH Hafiz MAC/VU -Advanced Computer Architecture Lecture 15 – Instruction Level. .. MAC/VU -Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) Recap: Lecture 14 Tomasulo issues in-order and may execute outof-order MAC/VU -Advanced Computer Architecture. .. MAC/VU -Advanced Computer Architecture Lecture 15 – Instruction Level Parallelism -Dynamic (4) 15 Branch History Table If the prediction is wrong, then invert prediction-bit MAC/VU -Advanced Computer