1. Trang chủ
  2. » Công Nghệ Thông Tin

Advanced Computer Architecture - Lecture 17: Instruction level parallelism

42 6 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 42
Dung lượng 1,53 MB

Nội dung

Advanced Computer Architecture - Lecture 17: Instruction level parallelism. This lecture will cover the following: high-performance instructions delivery - multiple issue; high-performance processors; branch target buffer; integrated instruction fetch units; return address predictors; multiple instruction-issue processors;...

CS 704 Advanced Computer Architecture Lecture 17 Instruction Level Parallelism (High-performance Instructions delivery - Multiple Issue) Prof Dr M Ashraf Chughtai Recap: MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) High-Performance Processors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) Reducing branch penalties for HighPerformance Processors Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 1: Branch Target Buffer MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units Integrated Branch Prediction Instruction Prefetch Instruction memory access and buffering MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Integrated Branch Prediction The Branch-predictor is included in the Instruction Fetch Unit So, it predicts and drive the fetch-pipe MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Instruction Prefetch An instruction pre-fetch queue is part of IIFU The queue holds multiple instructions and deliver more than one instructions in one cycle MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2: Integrated Instruction Fetch Units … Cont’d Instruction-Memory access and buffering Fetching multiple instructions per clock cycle may require accessing multiple cache lines, which is a complex operation IIFU facilitates to overcome these complexities and hides the cost of crossing cache-blocks IIFU also provides instruction buffering and on-demand issue MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 3: Return Address Predictors The Return-Address predictor predicts the indirect jumps, i.e., the jumps whose address varies at rum time High-level language programs generate such jumps for indirect procedure calls and select or case statements MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 10 Dynamic Scheduling in Superscalar … cont’d Modern superscalar processors issue four or more instructions per clock cycle and: often included both approaches In addition it is speculated that the Branch prediction is integrated into a dynamically scheduled pipeline This referred to as Hardware-based speculation MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 28 Example Let us consider a most general 2-issue dynamically scheduled processor and see how a simple loop, which we considered for single-issue Tomasulo, executes on this processor Recall that our example loop adds a scalar in F2 to each element of a vector in memory MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 29 Example Loop: L.D ADD.D S.D DADDUI BNE MAC/VU-Advanced Computer Architecture F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 R1,R2,LOOP ; F0=array element ; add scalar in F2 ; store result ; decrement pointer ; bytes (per DW) ; branch R1!= R2 Lecture 17 – Instruction Level Parallelism -Dynamic (6) 30 Example Let us create a table showing when each instruction issues, begins execution, and write its result to CDB for first three iterations using 2-issue version of Tomasulo’s pipeline using single issue processor Assume that Both FP and integer operation can be issued on every clock cycle, even if they are dependent MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 31 Example One integer functional unit is used for both ALU operations and effective address calculations and a separate pipeline FP functional until for each operation type Issue and write result take one cycle each There is dynamic branch prediction hardware and a separate functional unit to evaluate branch conditions There is one clock for integer ALU, two cycles for load, and three cycles for FP add MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 32 Les us have a look on he clock cycle of issue, execution, and writing result for a dual version of Tomasulo’s pipeline MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 33 Example Thus, sustaining one iteration every three cycles would lead to an IPC of 5/3=1.67 (5 instructions in clocks) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 34 Example completion rate is: 15/16=0.94 – 15 instructions execute in 16 cycles MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 35 Resource usage table MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 36 Another example Overcoming the single integer pipe bottleneck Now let us consider another example with 2-issue version of the Tomasulo's pipeline to overcome single-integer unit pipe bottleneck MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 37 Example In this example, we consider the execution of the same loop, as used in the previous example, but using 2-issue version of Tomasulo’s pipeline with 2issue processor that has wider CDBs (2 CDBs) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 38 Example … Cont’d Similar to the previous example, the activities table, similar to the previous table, shows the clock cycles of issue, execution and writing result for the dual-issue version of the Tomasulo’s pipeline Notice that dual-issue Tomasulo pipe has: - separate functional units for Integer ALU and effective address calculation; and - wider CDB MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 39 Activity Table for Example MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 40 Summary MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41 Aslam-u-Alacum and Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 42 ... MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) High-Performance Processors MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism. .. MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41 Aslam-u-Alacum and Allah Hafiz MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level. .. MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 39 Activity Table for Example MAC/VU -Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic

Ngày đăng: 05/07/2022, 11:51