Advanced Computer Architecture - Lecture 20: Instruction level parallelism

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	Instruction Level Parallelism
Người hướng dẫn	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	65
Dung lượng	1,85 MB

Nội dung

Advanced Computer Architecture - Lecture 20: Instruction level parallelism. This lecture will cover the following: software approaches to exploit ILP; basic compiler techniques; loop unrolling and scheduling; static branch prediction; multiple-instruction-issues per cycle processors;...

CS 704 Advanced Computer Architecture Lecture 20 Instruction Level Parallelism (Static Scheduling) Prof Dr M Ashraf Chughtai Today’s Topics Recap: Dynamic Scheduling in ILP Software Approaches to exploit ILP – Basic Compiler Techniques – Loop unrolling and scheduling – Static Branch Prediction Summary MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Recap: Dynamic Scheduling Our discussions in the last eight (8) lectures have been focused on to the hardware-based approaches to exploit parallelism among instructions The instructions in a basic block , i.e., straight-line code sequence without branches, are executed in parallel by using a pipelined datapath MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Recap: Dynamic Scheduling Here, we noticed that: – The performance of pipelined datapath is limited by its structure and data and control dependences, as they lead to structural, data and control hazards These hazards are removed by introducing stalls MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Recap: Dynamic Scheduling The stalls degrade the performance of a pipelined datapath by increasing the CPI to more than The number of stalls to overcome hazards in pipelined datapath are reduced or eliminated by introducing additional hardware and using dynamic scheduling techniques MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Recap: Dynamic Scheduling The major hardware-based techniques studied so far are summarized here: Technique Hazards type stalls Reduced - Forwarding and Potential Data Hazard Stalls bypass -Delayed Branching and Branch Scheduling Basic Dynamic Scheduling MAC/VU-Advanced Computer Architecture Control Hazard Stalls Data Hazard Stalls from (score boarding)true dependences Lecture 20 – Instruction Level Parallelism-Static (1) Recap: Dynamic Scheduling Technique Hazards type stalls Reduced - Dynamic Scheduling Stalls from: data hazards with renaming from anti-dependences and (Tomasulo’s Approach) fromoutput dependences - Dynamic Branch Prediction - Speculation Control Hazard stalls Data and Control Hazard stalls - Multiple Instructions Ideal CPI > issues per cycle MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Introduction to Static Scheduling in ILP The multiple-instruction-issues per cycle processors are rated as the highperformance processors These processors exist in a variety of flavors, such as: – Superscalar Processors – VLIW processors – Vector Processors MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Introduction to Static Scheduling in ILP The superscalar processors exploit ILP using static as well as dynamic scheduling approaches The VLIW processors, on the other hand, exploits ILP using static scheduling only The dynamic scheduling in superscalar processors has already been discussed in detail; And, the basics of static scheduling for superscalar have been introduced MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) Introduction to Static Scheduling in ILP In the today’s lectures and in a few following lectures our focus will be the detailed study of ILP exploitation through static scheduling The major software scheduling techniques, under discussion, to reduce the data and control stalls, will be as follows: MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 10 Example: Static branch prediction Let us consider an example that arises from conditional selection branches LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 51 EXPLANATION Here, note the dependence of the DSBU and BEQZ on the L.D instruction This shows that a stall will be needed after the L.D If it is predicted that the branch (BEQZ) was almost always taken; and that the value of R7 was not needed on the fall-through path Then … MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 52 EXPLANATION Then the speed of the program could be improved by moving the instruction L: DADDU R7, R8, R9 to the position after the L.D On the other hand, if it is predicted that branch (BEQZ) was rarely taken and that the value of R4 was not needed on the taken path MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 53 EXPLANATION Then we could also consider moving the OR instruction after the L.D Furthermore, we can also use the information to better schedule any branch delay as Scheduling depends on knowing the branch behavior MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 54 Static Branch Prediction To perform optimization we need to predict the branch statically when the program is complied There are several methods to statically predict the branch behavior MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 55 Static Branch Prediction A: The simplest scheme is to predict a branch as taken This scheme has an average misprediction rate that is equal to the untaken branch frequency, which for the SPEC is 34% However, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%) MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 56 Static branch prediction B To predict on the basis of the branch direction Choosing backward going branches to be taken and forward going branches to be not taken branches For SPEC programs, more than half of the forward going branches are taken Hence, predicting all the branches as taken is a better approach MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 57 Static branch prediction C To predict branches on the basis of the profile information This is more accurate technique where the branches are predicted on the basis of the profile information collected from earlier runs The behavior of these branches is often bi-modal distributed i.e an individual branch is often highly biased toward taken or untaken MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 58 Static branch prediction We can derive the prediction accuracy of a predicted taken strategy and measures the accuracy of the profile scheme The fig below shows the misprediction rate on SPEC 92 for a profile based predictor Here, you can see that the misprediction rate varies widely MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 59 Static branch prediction Fig 4.3 MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 60 Static branch prediction It is generally better for the FP programs than the integer programs For FP benchmarks the misprediction rate varies from 4% to 9% and for the integer programs it varies from 5% to 15% MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 61 Summary Today we started with the discussion on the static scheduling techniques to exploit the ILP in pipeline datapath Here we discussed the basic compiler approach to used to avoid hazards, specially the data and control hazards by inserting stalls MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 62 Summary The number of stall are reducing by scheduling the instructions by the compiler In case of loops, the loops are unrolled to enhance the performance and reduce stalls The number of stalls are further reduced when stalls unrolled loop is scheduled by repeating each instruction for the number of iteration, but using additional registers MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 63 Summary Finally we discussed the impact of static branch prediction on the performance on the scheduled and unrolled loops We observed that static branch prediction result in decrease in the misprediction rate ranging between 4% to 15%, thus greatly enhances the performance of superscalar processor MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 64 Asslam-u-aLacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism-Static (1) 65 ... of MAC/VU -Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism- Static (1) 46 Loop Unrolling in Superscalar Integer instruction Loop: LD F0,0(R1) LD F6 ,-8 (R1) LD F10 ,-1 6(R1)... table MAC/VU -Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism- Static (1) 18 Stalls of FP ALU and Load Instruction Here, the First column shows originating instruction. .. loop-overhead (to evaluate the condition, stall and branch); i.e., the loop over-head is 100% in this example MAC/VU -Advanced Computer Architecture Lecture 20 – Instruction Level Parallelism- Static

Ngày đăng: 05/07/2022, 11:52