Advanced Computer Architecture - Lecture 37: Multiprocessors

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	multiprocessors
Người hướng dẫn	Prof. Dr. M. Ashraf Chughtai
Trường học	mac/vu
Chuyên ngành	advanced computer architecture
Thể loại	lecture

Định dạng
Số trang	56
Dung lượng	1,63 MB

Nội dung

Advanced Computer Architecture - Lecture 37: Multiprocessors. This lecture will cover the following: performance of multiprocessors with symmetric shared-memory, distributed shared memory; synchronization in parallel architecture; hardware supplied synchronization instructions;...

CS 704 Advanced Computer Architecture Lecture 37 Multiprocessors (Performance and Synchronization) Prof Dr M Ashraf Chughtai Today’s Topics Recap: Performance of Multiprocessors with – Symmetric Shared-Memory – Distributed Shared Memory Synchronization in Parallel Architecture Conclusion MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Cache Coherence Problem So far we have discussed the sharing of caches for multi-processing in the:  symmetric shared-memory architecture  Distributed shared memory architecture We have studied cache coherence problem in symmetric and distributed sharedmemory multiprocessors; and have noticed that this problem is indeed performancecritical MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Multiprocessor cache Coherence Last time we also studied the cache coherence protocols, which use different techniques to track the sharing status and maintain coherence without performance degrading These protocols are classified as: Snooping Protocols Directory-Based Protocols These protocols are implemented using a FSM controller MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Snooping Protocols Snooping protocols employ write invalidate and write broadcast techniques Here, the block of memory is in one of the three states, and each cached-block tracks these three states; and the controller responds to the read/write request for a block of memory or cached block, both from the processor and from the bus MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Implementation Complications of snoopy protocols The three states of the basic FSM are: Shared, Exclusive or Invalid However, the complications such as: write races, interventions and invalidation have been observed in the implementation of snoopy protocols; and to overcome these complications number of variations in the FSM controller have been suggested These variations are: MESI Protocol, Barkley Protocol and Illinois Protocol MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Variations in snoopy protocols These variations resulted in four (4) states FSM controller – The states of MESI Protocol are: Modify, Exclusive, Shared and Invalid – The sates of Barkley Protocol are: OwnedExclusive, Owned-Sheared, Shared and Invalid; and of – Illinois Protocol are: Private Dirty, Private clean, shared and Invalid MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Directory based Protocols The larger multiprocessor systems employ distributed shared-memory , i.e., a separate memory per processor is provided Here, the Cache Coherency is achieved using non-cached pages or directory containing information for every block in memory The directory-based protocol tracks state of every block in every cache and finds the … MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Directory Based Protocol …… caches having copies of block being dirty or clean The directory-based protocol tracks state of every block in every cache and finds the caches having copies of block being dirty or clean Similar to the Snoopy Protocol, the directory-based protocol are implemented by FSM having three states: Shared, Uncached and Exclusive MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Directory-based Protocol MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 10 Hardware Primitives: Uninterruptable Instructions Atomic Exchange: To see how we can use this primitive to build synchronization, let us assume we want to build a simple lock where indicates that lock is free; and indicates that lock is unavailable To implement synchronization, a processor tries to set the lock by exchange of 1, which is in the register, with the memory address corresponding to the lock The value returned from the exchange instruction is if some other processor had already claimed access, otherwise the value MAC/VU-Advanced Computer Architecture 42 Lec 37 Multiprocessor (4) returned is 0; i.e., Hardware Primitives: Uninterruptable Instructions The synchronization is locked and unavailable if some other processor had already claimed access; otherwise the value returned is In the later case, where the value returned is 0, the value is changed to 1, preventing any competing exchange from also retrieving Example: – Consider two processors trying to exchange simultaneously – This race is broken when one of the processor exchange first and returns 0, and the second … MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 43 Hardware Primitives: Uninterruptable Instructions … processor will return when it does the exchange Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it – Key to the atomic operations is that each operation is indivisible MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 44 Uninterruptable Instructions … Cont’d Implementing a single atomic instruction in hardware is complex and is hard to have read & write in one instruction; therefore In the recent multiprocessor pair of instructions is used – the two instructions are: – Load linked (or load locked) and – store conditional Here, the second instruction returns a value from which it can be deduced as if the instruction were executed as atomic MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 45 Uninterruptable Instructions … Cont’d Note that – Load linked (LL) returns the initial value – Store conditional (SC) returns if it succeeds (no other store to same memory location since proceeding load) and otherwise These instructions are used in sequence: – If the contents of memory location, specified by the LL are changed before the before the SC to the same address occurs, then the SC fails – The store conditional returns a value or indicating whether the SC was successful or not MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 46 Uninterruptable Instructions … Cont’d Let us consider an example program segment showing implementation of atomic exchange on memory location specified by the contents of register R1 Example doing atomic swap with LL & SC: try: MOV R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 At the end of this sequence, the contents of R4 and memory location specified by R1 have been atomically exchanged MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 47 Uninterruptable Instructions … Cont’d The LL –SC primitive can be used to build other primitives, e.g., the atomic fetch and increment can be constructed as: Example doing fetch & increment with LL & SC: try: ll addi sc beqz R2,0(R1) R2,R2,#1 R2,0(R1) R2,try ; load linked ; increment (OK if reg–reg) ; store conditional ; branch store fails (R2 = 0) As the SC instruction simply checks that its address matches that in the link register, therefore, register-register instructions can safely be place after the LL instruction; however, the number of instructions in between LL and SC must be kept small MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 48 Summary In this series of four lectures on multiprocessors we have studied how improvement in computer performance can be accomplished using Parallel Processing Architectures Parallel Architecture is a collection of processing elements that cooperate and communicate to solve larger problems fast Then we described the four categories of Parallel Architecture as: SISD, SIMD, MISD and MIMD architecture MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 49 Summary We noticed that based on the memory organization and interconnect strategy, the MIMD machines are classified as: – Centralized Shared Memory Architecture – Distributed Memory Architecture We also introduced the framework to describe parallel architecture as a two layer representation: Programming and Communication models MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 50 Summary We talked about sharing of caches for multiprocessing in the symmetric shared-memory architecture in details Here, we studied the cache coherence problem and introduced two methods, write invalidation and write broadcasting schemes, to resolve the problem We also discussed the finite state machine for the implementation of snooping algorithm MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 51 Summary Today we have discussed FSM controller to implement Directory Based Protocols which involve three processors or nodes, namely: local, home and remote nodes We discussed the state transition and messages generated by FSM controller in each state to implement the directory-based protocols We have also discussed in details the performance of distributed and centralized shared-memory architecture MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 52 Summary Concluding our discussion on the multiprocessor, we can say that multiprocessors are highly effective for multi-programmed work loads More recently, multiprocessors have proved very effective for commercial workloads such as web searching The centralized memory architecture, also known as Symmetric Multiprocessors (SMPs) maintain a single centralized memory with uniform access time; while …… MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 53 Conclusion In contrast, the Distributed Shared-Memory Multiprocessor (DSMs) have non uniform memory architecture and can achieve greater scalability The advantages of these two architecture, i.e., maximizing uniform memory access while allowing greater scalability can be partially combined in the Sun Microsystems's Wildfire architecture, shown here – (Fig 6.48 pp 623) MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 54 Conclusion Here, note that large SMPs (such as E6000) are used as nodes to maximize uniform memory access and greater scalability is achieved by using Wildfire Interface (WFI) Each E6000 can accept up to 15 processors or I/O Boards on Giga-plane bus interconnect WFI can connect or E6000 multiprocessors by replacing one I/O board with WFI board You may look into further details of the Sun Microsystems's Wildfire architecture from literature and study its performance MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 55 Thanks and Allah Hafiz MAC/VU-Advanced Computer Architecture Lec 37 Multiprocessor (4) 56 ... states: Shared, Uncached and Exclusive MAC/VU -Advanced Computer Architecture Lec 37 Multiprocessor (4) Recap: Directory-based Protocol MAC/VU -Advanced Computer Architecture Lec 37 Multiprocessor (4)... event 4) MAC/VU -Advanced Computer Architecture Lec 37 Multiprocessor (4) 32 Performance of Multiprocessors Distributed Shared-Memory Architecture The performance of directory-based multiprocessors. .. miss rate MAC/VU -Advanced Computer Architecture Lec 37 Multiprocessor (4) 25 Performance of Multiprocessors Symmetric Shared-Memory Architecture Cont’d – The misses arising from inter-processor communication,

Ngày đăng: 05/07/2022, 11:58