Advanced Computer Architecture - Lecture 34: Multiprocessors

71 4 0
Advanced Computer Architecture - Lecture 34: Multiprocessors

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Advanced Computer Architecture - Lecture 34: Multiprocessors. This lecture will cover the following: shared memory architectures; parallel processing; parallel processing architectures; symmetric shared memory; distributed shared memory; performance of parallel architectures;...

CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof Dr M Ashraf Chughtai Today’s Topics Recap: Parallel Processing Parallel Processing Architectures – Symmetric Shared Memory – Distributed Shared Memory Performance of Parallel Architectures Summary MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Recap So far our focus have been to study the performance of a single instruction stream computers; and methodologies to enhance the performance of such machines We studied how – the Instruction Level Parallelism is exploited among the instructions of a stream; and – the control, data and memory dependencies are resolved MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Recap:ILP These characteristics are realized through: – Pipelining the datapath – Superscalar Architecture – Very Long Instruction Word (VLIW) Architecture – Out-of-Order execution MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Processing and Parallel Architecture However, further improvements in the performance may be achieved by exploiting parallelism among multiple instruction streams, which uses: – Multithreading, i.e., number of instruction streams running on one CPU – Multiprocessing, i.e., streams running on multiple CPUs where each CPU can itself be multithreaded MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Computers Performance Amdahl’s Law Furthermore, while evaluating the performance enhancement due to parallel processing two important challenges are to be taken into consideration – Limited parallelism available in program – High cost of communication These limitations make it difficult to achieve good speedup in any parallel processor MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Computers Performance Amdahl’s Law For example, if a portion of the program is sequential, it limits the speedup; this can be understood by the following example: Example: What fraction of original computation can be sequential to achieve speedup of 80 with 100 processors? Answer: The Amdahl’s law states that: Speedup = Fraction Enhanced + (1- Fraction Enhanced) Speedup Enhanced MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Computers Performance Here, the fraction enhanced is the fraction in parallel, therefore speedup can be expressed as 80 = / [(Fraction parallel/100 + (1-Fractionparallel )] Simplifying the expression, we get 0.8*Fraction parallel + 80*(1-Fraction parallel) = 80 - 79.2* Fraction parallel = Fraction parallel = (80-1)/79.2 = 0.9975 i.e., to achieve speedup of 80 with 100 processors only 0.25% sequential allowed! MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Computers Performance The second major challenge in parallel processing is the communication cost that involves the latency of remote access Now let us consider another example to explain the impact of communication cost on the performance of parallel computers Example: Consider an application running on 32-processors multiprocessor, with 40 nsec time to handle remote memory reference MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) Parallel Computers Performance Assume instruction per cycle for all memory reference hit is and processor clock rate is 1GHz, and find: How fast is the multiprocessor when there is no communication versus 0.2% of the instructions involve remote access? Solution: The effective CPI for multiprocessor with remote reference is: CPI = Base CPI + Remote request rate x remote access cost MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 10 4: Distribute Memory Architecture Cray T3E is a typical example of NUMA architecture which scales up to 1024 processors with 480 MB/sec links Here, the non-local references are accessed using communication requests generated automatically by the memory controller in the external I/Os Here no hardware coherence mechanism is employed rather directory based cachecoherence protocols are used – We will discuss this in detail later MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 57 Message Passing Architecture So far we have been talking about the programming and communication models of shared-memory address space architecture and their evolution Now let us discuss the programming and communication models of Message passing Architecture The programming model depicted here illustrates that the whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 58 Message Passing Architecture Programming Model MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 59 Message Passing Architecture Note that the message passing is essentially NUMA but it is integrated at I/O devices vs memory system Here, the local memory is directly accessed, i.e., it directly accesses the private address space (e.g., the processor P directly access the local address X; and Communication takes place via explicit message passing, i.e., via send/receive MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 60 Message Passing Architecture Send specifies local buffer and the receiving process on remote computer Receive specifies sending process on remote computer and local buffer to place data (i.e., address Y on Processor Q) – Usually send includes process tag and receive has rule on tag: match 1, match any MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 61 Message Passing Architecture Send and receive is memory-memory copy, where each supplies local address, AND does pair-wise synchronization The synchronization is achieved as follows: receive wait for send when –send completes –buffer free and –request accepted MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 62 Message Passing Architecture Communication Model The high-level block diagram for complete computer as a building block, similar to the distributed memory spared address space is shown here to describe the communication abstraction MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 63 Message Passing Architecture Communication Model Here, the communication is integrated at IO level, and not into memory system It has networks of workstations (clusters), but tighter integration It is easier to build than scalable shared address space machines Typical example of Message Passing Machines are IBM SP shown here MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 64 Message Passing Architecture Communication Model IBM SP: Message Passing Machine Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus Bandwidth is limited by I/O bus MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 65 Summary Today we have explored how further improvement in computer performance can be accomplished using Parallel Processing Architectures Parallel Architecture is a collection of processing elements that cooperate and communicate to solve larger problems fast Then we described the four categories of Parallel Architecture as: SISD, SIMD, MISD and MIMD architecture MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 66 Summary We noticed that based on the memory organization and interconnect strategy, the MIMD machines are classified as: – Centralized Shared Memory Architecture – Distributed Memory Architecture We also introduced the framework to describe parallel architecture as a two layer representation: Programming and Communication models MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 67 Summary These models present sharing of address space and message passing in parallel architecture The advantages of Shared-Memory Communication model are as follows: – Ease of programming when communication patterns are complex or vary dynamically during execution MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 68 Summary – Lower communication overhead, better use of BW for small items – HW-controlled caching to reduce remote comm by caching of all data, both shared and private The advantages of message passing communication model are as follows: MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 69 Summary – Communication is explicit and simpler to understand where as in shared memory it can be hard to know when communicating and when not, and how costly it is – Easier to use sender-initiated communication, which may have some advantages in performance – Synchronization is associated with sending messages MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 70 Thanks and Allah Hafiz MAC/VU-Advanced Computer Architecture Lec 34 Multiprocessor (1) 71 ... MAC/VU -Advanced Computer Architecture Lec 34 Multiprocessor (1) 21 Centralized Share-Memory Architecture MAC/VU -Advanced Computer Architecture Lec 34 Multiprocessor (1) 22 Centralized Shared-Memory... this architecture is referred to as the Symmetric Multi Processors (SMPs) MAC/VU -Advanced Computer Architecture Lec 34 Multiprocessor (1) 49 2: Minicomputers (SMP) Architecture MAC/VU -Advanced Computer. .. of processors which may be …… MAC/VU -Advanced Computer Architecture Lec 34 Multiprocessor (1) 25 Decentralized Memory Architecture MAC/VU -Advanced Computer Architecture Lec 34 Multiprocessor (1)

Ngày đăng: 05/07/2022, 11:57

Mục lục

    CS 704 Advanced Computer Architecture

    Parallel Processing and Parallel Architecture

    Parallel Computers Performance Amdahl’s Law

    Introduction to Parallel Processing

    MIMD and Thread Level Parallelism

    Decentralized or Distributed Memory

    Issues of Parallel Machines

    Issue #3: Latency and Bandwidth

    Framework for Parallel processing

    Shared Address Space Architecture (for Decentralized Memory Architecture)

Tài liệu cùng người dùng

Tài liệu liên quan