The final part of the book looks at the increasingly important area of parallel orga- nization. In a parallel organization, multiple processing units cooperate to execute applications. Whereas a superscalar processor exploits opportunities for parallel ex- ecution at the instruction level, a parallel processing organization looks for a grosser level of parallelism, one that enables work to be done in parallel, and cooperatively, by multiple processors.
PART FIVE Parallel Organization P.1 ISSUES FOR PART FIVE The final part of the book looks at the increasingly important area of parallel orga nization. In a parallel organization, multiple processing units cooperate to execute applications. Whereas a superscalar processor exploits opportunities for parallel ex ecution at the instruction level, a parallel processing organization looks for a grosser level of parallelism, one that enables work to be done in parallel, and cooperatively, by multiple processors. A number of issues are raised by such organizations. For ex ample, if multiple processors, each with its own cache, share access to the same memory, hardware or software mechanisms must be employed to ensure that both processors share a valid image of main memory; this is known as the cache coher ence problem. This design issue, and others, is explored in Part Five 627 CHAPTER PARALLEL PROCESSING 17.1 Multiple Processor Organizations Types of Parallel Processor Systems Parallel Organizations 17.2 Symmetric Multiprocessors Organization Multiprocessor Operating System Design Considerations A Mainframe SMP 17.3 Cache Coherence and the Mesi Protocol Software Solutions Hardware Solutions The MESI Protocol 17.4 Multithreading and Chip Multiprocessors Implicit and Explicit Multithreading Approaches to Explicit Multithreading Example Systems 17.5 Clusters Cluster Configurations Operating System Design Issues Cluster Computer Architecture Blade Servers Clusters Compared to SMP 17.6 Nonuniform Memory Access Motivation Organization NUMA Pros and Cons 17.7 Vector Computation Approaches to Vector Computation IBM 3090 Vector Facility 628 17.8 Recommended Reading and Web Site 17.9 Key Terms, Review Questions, and Problems PARALLEL PROCESSING 629 KEY POINTS ◆ A traditional way to increase system performance is to use multiple proces sors that can execute in parallel to support a given workload. The two most common multipleprocessor organizations are symmetric multiprocessors (SMPs) and clusters. More recently, nonuniform memory access (NUMA) systems have been introduced commercially ◆ An SMP consists of multiple similar processors within the same computer, interconnected by a bus or some sort of switching arrangement. The most critical problem to address in an SMP is that of cache coherence. Each processor has its own cache and so it is possible for a given line of data to be present in more than one cache. If such a line is altered in one cache, then both main memory and the other cache have an invalid version of that line Cache coherence protocols are designed to cope with this problem ◆ When more than one processor are implemented on a single chip, the con figuration is referred to as chip multiprocessing. A related design scheme is to replicate some of the components of a single processor so that the processor can execute multiple threads concurrently; this is known as a multithreaded processor ◆ A cluster is a group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine. The term whole computer means a system that can run on its own, apart from the cluster ◆ A NUMA system is a sharedmemory multiprocessor in which the access time from a given processor to a word in memory varies with the location of the memory word ◆ A specialpurpose type of parallel organization is the vector facility, which is tailored to the processing of vectors or arrays of data Traditionally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms as sequences of instructions Processors execute programs by executing machine instructions in a sequence and one at a time Each instruction is executed in a sequence of operations (fetch instruction, fetch operands, perform operation, store results) This view of the computer has never been entirely true. At the microoperation level, multiple control signals are generated at the same time. Instruction pipelining, at least to the extent of overlapping fetch and execute operations, has been around for a long time Both of these are examples of performing functions in parallel This approach is taken further with superscalar organization, which exploits instruction level parallelism With a superscalar machine, there are multiple execution units within a single processor, and these may execute multiple instructions from the same program in parallel As computer technology has evolved, and as the cost of computer hardware has dropped, computer designers have sought more and more opportunities for parallelism, usually to enhance performance and, in some cases, to increase availability. After an overview, this chapter looks at some of the most prominent approaches to parallel or ganization First, we examine symmetric multiprocessors (SMPs), one of the earliest and still the most common example of parallel organization. In an SMP organization, multiple processors share a common memory. This organization raises the issue of cache coherence, to which a separate section is devoted Then we describe clusters, which consist of multiple independent computers organized in a cooperative fashion. Next, the chapter examines multithreaded processors and chip multiprocessors. Clus ters have become increasingly common to support workloads that are beyond the capacity of a single SMP. Another approach to the use of multiple processors that we examine is that of nonuniform memory access (NUMA) machines. The NUMA approach is relatively new and not yet proven in the marketplace, but is often consid ered as an alternative to the SMP or cluster approach Finally, this chapter looks at hardware organizational approaches to vector computation. These approaches opti mize the ALU for processing vectors or arrays of floatingpoint numbers. They are common on the class of systems known as supercomputers 17.1 MULTIPLE PROCESSOR ORGANIZATIONS Types of Parallel Processor Systems A taxonomy first introduced by Flynn [FLYN72] is still the most common way of categorizing systems with parallel processing capability. Flynn proposed the follow ing categories of computer systems: • Single instruction, single data (SISD) stream: A single processor executes a single instruction stream to operate on data stored in a single memory. Uniprocessors fall into this category • Single instruction, multiple data (SIMD) stream: A single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing element has an associated data memory, so that each instruction is executed on a different set of data by the different processors Vector and array processors fall into this category, and are dis cussed in Section 18.7 • Multiple instruction, single data (MISD) stream: A sequence of data is trans mitted to a set of processors, each of which executes a different instruction se quence. This structure is not commercially implemented • Multiple instruction, multiple data (MIMD) stream: A set of processors simul taneously execute different instruction sequences on different data sets. SMPs, clusters, and NUMA systems fit into this category With the MIMD organization, the processors are general purpose; each is able to process all of the instructions necessary to perform the appropriate data transfor mation. MIMDs can be further subdivided by the means in which the processors communicate (Figure 17.1) If the processors share a common memory, then each processor accesses programs and data stored in the shared memory, and processors 17.1 / MULTIPLE PROCESSOR ORGANIZATIONS 631 Processor organizations Single instruction, single data stream (SISD) Single instruction, multiple data stream (SIMD) Multiple instruction, single data stream (MISD) Multiple instruction, multiple data stream (MIMD) Uniprocessor Vector processor Array processor Shared memory (tightly coupled) Distributed memory (loosely coupled) Clusters Symmetric multiprocesso r (SMP) Nonuniform memory access (NUMA) Figure 17.1 A Taxonomy of Parallel Processor Architectures communicate with each other via that memory. The most common form of such system is known as a symmetric multiprocessor (SMP), which we examine in Section 17.2. In an SMP, multiple processors share a single memory or pool of memory by means of a shared bus or other interconnection mechanism; a distinguishing feature is that the memory access time to any region of memory is approximately the same for each processor.A more recent development is the nonuniform memory access (NUMA) or ganization, which is described in Section 17.5. As the name suggests, the memory access time to different regions of memory may differ for a NUMA processor A collection of independent uniprocessors or SMPs may be interconnected to form a cluster. Communication among the computers is either via fixed paths or via some network facility Parallel Organizations Figure 17.2 illustrates the general organization of the taxonomy of Figure 17.1. Figure 17.2a shows the structure of an SISD. There is some sort of control unit (CU) that provides an instruction stream (IS) to a processing unit (PU) The processing unit operates on a single data stream (DS) from a memory unit (MU). With an SIMD, there is still a single control unit, now feeding a single instruction stream to multiple PUs. Each PU may have its own dedicated memory (illustrated in Figure 17.2b), or there may be a shared memory. Finally, with the MIMD, there are multiple control units, each feeding a separate instruction stream to its own PU. The MIMD DS (9.a) DS SISD DS IS PU • • • (9.c) DS (9.b) DS DS MIMD (with shared memory) CU = Control unit IS = Instruction stream PU = Processing unit DS = Data stream MU = Memory unit LM = Local memory SIMD (with distributed memory) SISD = Single instruction, = single data stream SIMD = Single instruction, multiple data stream MIMD = Multiple instruction, multiple data stream DS • • • DS (9.d) MIMD (with distributed memory) Figure 17.2 Alternative Computer Organizations may be a sharedmemory multiprocessor (Figure 17.2c) or a distributedmemory multicomputer (Figure 17.2d) The design issues relating to SMPs, clusters, and NUMAs are complex, involv ing issues relating to physical organization, interconnection structures, interproces sor communication, operating system design, and application software techniques. Our concern here is primarily with organization, although we touch briefly on oper ating system design issues 17.2 SYMMETRIC MULTIPROCESSORS Until fairly recently, virtually all singleuser personal computers and most worksta tions contained a single generalpurpose microprocessor. As demands for perfor mance increase and as the cost of microprocessors continues to drop, vendors have introduced systems with an SMP organization. The term SMP refers to a computer hardware architecture and also to the operating system behavior that reflects that architecture. An SMP can be defined as a standalone computer system with the fol lowing characteristics: 9.d.1 There are two or more similar processors of comparable capability appropriately. It includes an advanced powergating capability that allows for an ultra finegrained logic control that turns on individual processor logic subsystems only if and when they are needed Additionally, many buses and arrays are split so that data required in some modes of operation can be put in a low power state when not needed The Core Duo chip includes a shared 2MB L2 cache. The cache logic allows for a dynamic allocation of cache space based on current core needs, so that one core can be assigned up to 100% of the L2 cache. The L2 cache includes logic to sup port the MESI protocol for the attached L1 caches. The key point to consider is when a cache write is done at the L1 level. A cache line gets the M state when a processor writes to it; if the line is not in E or Mstate prior to writing it, the cache sends a ReadForOwnership (RFO) request that ensures that the line exists in the L1 cache and is in the I state in the other L1 cache. The Intel Core Duo extends this protocol to take into account the case when there are multiple Core Duo chips or ganized as a symmetric multiprocessor (SMP) system. The L2 cache controller allow the system to distinguish between a situation in which data are shared by the two local cores, but not with the rest of the world, and a situation in which the data are shared by one or more caches on the die as well as by an agent on the external bus (can be another processor). When a core issues an RFO, if the line is shared only by the other cache within the local die, we can resolve the RFO internally very fast, without going to the external bus at all. Only if the line is shared with another agent on the external bus do we need to issue the RFO externally The bus interface connects to the external bus, known as the Front Side Bus, which connects to main memory, I/O controllers, and other processor chips Intel Core i7 The Intel Core i7, introduced in November of 2008, implements four x86 SMT proces sors, each with a dedicated L2 cache, and with a shared L3 cache (Figure 18.8d) The general structure of the Intel Core i7 is shown in Figure 18.10. Each core has its own dedicated L2 cache and the four cores share an 8MB L3 cache. One mechanism Intel uses to make its caches more effective is prefetching, in which Core 0 D Core 1 Core 2 Core 3 3 × 8B @ 1.33 GT/s 4 × 20b @ 6.4 GT/s Figure 18.10 Intel Core i7 Block Diagram Table 18.1 Cache Latency (in clock cycles) CPU Core 2 Quad Core i7 Clock Frequency L1 Cache L2 Cache L3 Cache 2.66 GHz 2.66 GHz 3 cycles 4 cycles 15 cycles 11 cycles 39 cycles — the hardware examines memory access patterns and attempts to fill the caches spec ulatively with data that’s likely to be requested soon. It is interesting to compare the performance of this threelevel on chip cache organization with a comparable two level organization from Intel. Table 18.1 shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency. The Core 2 Quad has a shared L2 cache, similar to the Core Duo. The Core i7 im proves on L2 cache performance with the use of the dedicated L2 caches, and pro vides a relatively highspeed access to the L3 cache The Core i7 chip supports two forms of external communications to other chips. The DDR3 memory controller brings the memory controller for the DDR main memory2 onto the chip. The interface supports three channels that are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s. With the memory controller on the chip, the Front Side Bus is eliminated The QuickPath Interconnect (QPI) is a cachecoherent, pointtopoint link based electrical interconnect specification for Intel processors and chipsets. It en ables highspeed communications among connected processor chips. The QPI link operates at 6.4 GT/s (transfers per second). At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total band width is 25.6 GB/s 18.5 ARM11 MPCore The ARM11 MPCore is a multicore product based on the ARM11 processor family. The ARM11 MPCore can be configured with up to four processors, each with its own L1 instruction and data caches, per chip Table 18.1 lists the configurable op tions for the system, including the default values Figure 18.11 presents a block diagram of the ARM11 MPCore. The key ele ments of the system are as follows: • Distributed interrupt controller (DIC): Handles interrupt detection and inter rupt prioritization. The DIC distributes interrupts to individual processors • Timer: Each CPU has its own private timer that can generate interrupts • Watchdog: Issues warning alerts in the event of software failures. If the watch dog is enabled, it is set to a predetermined value and counts down to 0. It is pe riodically reset. If the watchdog value reaches zero, an alert is issued • CPU interface: Handles interrupt acknowledgement, interrupt masking, and interrupt completion acknowledgement The DDR synchronous RAM memory is discussed in Chapter 5 te Read/write 64bit bus Optional 2nd R/W 64bit bus Figure 18.11 ARM11 MPCore Processor Block Diagram • CPU: A single ARM11 processor. Individual CPUs are referred to as MP11 CPUs • Vector floatingpoint (VFP) unit: A coprocessor that implements floating point operations in hardware • L1 cache: Each CPU has its own dedicated L1 data cache and L1 instruction cache • Snoop control unit (SCU): Responsible for maintaining coherency among L1 data caches Interrupt Handling The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources. It provides • Masking of interrupts • Prioritization of the interrupts • Distribution of the interrupts to the target MP11 CPUs • Tracking the status of interrupts • Generation of interrupts by software The DIC is a single functional unit that is placed in the system alongside MP11 CPUs. This enables the number of interrupts supported in the system to be indepen dent of the MP11 CPU design. The DIC is memory mapped; that is, control registers for the DIC are defined relative to a main memory base address The DIC is ac cessed by the MP11 CPUs using a private interface through the SCU The DIC is designed to satisfy two functional requirements: • Provide a means of routing an interrupt request to a single CPU or multiple CPUs, as required • Provide a means of interprocessor communication so that a thread on one CPU can cause activity by a thread on another CPU As an example that makes use of both requirements, consider a multithreaded application that has threads running on multiple processors. Suppose the applica tion allocates some virtual memory. To maintain consistency, the operating system must update memory translation tables on all processors The OS could update the tables on the processor where the virtual memory allocation took place, and then issue an interrupt to all the other processors running this application The other processors could then use this interrupt’s ID to determine that they need to update their memory translation tables The DIC can route an interrupt to one or more CPUs in the following three ways: • An interrupt can be directed to a specific processor only • An interrupt can be directed to a defined group of processors. The MPCore views the first processor to accept the interrupt, typically the least loaded, as being best positioned to handle the interrupt • An interrupt can be directed to all processors From the point of view of software running on a particular CPU, the OS can gen erate an interrupt to all but self, to self, or to specific other CPUs. For communication between threads running on different CPUs, the interrupt mechanism is typically com bined with shared memory for message passing. Thus, when a thread is interrupted by an interprocessor communication interrupt, it reads from the appropriate block of shared memory to retrieve a message from the thread that triggered the interrupt. A total of 16 interrupt IDs per CPU are available for interprocessor communication From the point of view of an MP11 CPU, an interrupt can be • Inactive: An Inactive interrupt is one that is nonasserted, or which in a multi processing environment has been completely processed by that CPU but can still be either Pending or Active in some of the CPUs to which it is targeted, and so might not have been cleared at the interrupt source • Pending: A Pending interrupt is one that has been asserted, and for which pro cessing has not started on that CPU • Active: An Active interrupt is one that has been started on that CPU, but pro cessing is not complete. An Active interrupt can be preempted when a new in terrupt of higher priority interrupts MP11 CPU interrupt processing Interrupts come from the following sources: • Interprocessor interrupts (IPIs): Each CPU has private interrupts, ID0 ID15, that can only be triggered by software. The priority of an IPI depends on the receiving CPU, not the sending CPU • Private timer and/or watchdog interrupts: These use interrupt IDs 29 and 30 • Legacy FIQ line: In legacy IRQ mode, the legacy FIQ pin, on a per CPU basis, bypasses the Interrupt Distributor logic and directly drives interrupt requests into the CPU • Hardware interrupts: Hardware interrupts are triggered by programmable events on associated interrupt input lines. CPUs can support up to 224 inter rupt input lines. Hardware interrupts start at ID32 Figure 18.12 is a block diagram of the DIC. The DIC is configurable to support between 0 and 255 hardware interrupt inputs. The DIC maintains a list of interrupts, showing their priority and status. The Interrupt Distributor transmits to each CPU Interface the highest Pending interrupt for that interface. It receives back the infor mation that the interrupt has been acknowledged, and can then change the status of the corresponding interrupt. The CPU Interface also transmits End of Interrupt Private bus read/write Core acknowledge and End of interrupt (EOI) information from CPU interface iority interrupts IRQ request to each CPU interface Interrupt list Figure 18.12 Interrupt Distributor Block Diagram Table 18.2 ARM11 MPCore Configurable Options Feature Range of Options Default Value Processors 1 to 4 Instruction cache size per processor 16 KB, 32 KB, or 64 KB 32 KB Data cache size per processor 16 KB, 32 KB, or 64 KB 32 KB Master ports 1 or 2 Width of interrupt bus 0 to 224 by increments of 32 pins 32 pins Vector floating point (VFP) coprocessor per processor Included or not Included Information (EOI), which enables the Interrupt Distributor to update the status of this interrupt from Active to Inactive Cache Coherency The MPCore’s Snoop Control Unit (SCU) is designed to resolve most of the tradi tional bottlenecks related to access to shared data and the scalability limitation in troduced by coherence traffic The L1 cache coherency scheme is based on the MESI protocol described in Chapter 17. The SCU monitors operations shared data to optimize MESI state mi gration. The SCU introduces three types of optimization: direct data intervention, duplicated tag RAMs, and migratory lines Direct data intervention (DDI) enables copying clean data from one CPU L1 data cache to another CPU L1 data cache without accessing external memory. This reduces read after read activity from the Level 1 cache to the Level 2 cache. Thus, a local L1 cache miss is resolved in a remote L1 cache rather than from access to the shared L2 cache Recall that main memory location of each line within a cache is identified by a tag for that line. The tags can be implemented as a separate block of RAM of the same length as the number of lines in the cache. In the SCU, duplicated tag RAMs are du plicated versions of L1 tag RAMs used by the SCU to check for data availability be fore sending coherency commands to the relevant CPUs. Coherency commands are sent only to CPUs that must update their coherent data cache. This reduces the power consumption and performance impact from snooping into and manipulating each processor’s cache on each memory update. Having tag data available locally lets the SCU limit cache manipulations to processors that have cache lines in common The migratory lines feature enables moving dirty data from one CPU to another without writing to L2 and reading the data back in from external memory. The opera tion can be described as follows. In a typical MESI protocol, one processor has a mod ified line and another processor attempts to read that line, the following actions occur: The line contents are transferred from the modified line to the processor that initiated the read The line contents are read back to main memory The line is put in the shared state in both caches The MPCore SCU handles this situation differently. The SCU monitors the system for a migratory line. If one processor has a modified line, and another processor reads then writes to it, the SCU assumes such a location will experience this same operation in the future. As this operation starts again, the SCU will auto matically move the cache line directly to an invalid state rather than expending en ergy moving it first into the shared state. This optimization also causes the processor to transfer the cache line directly to the other processor without intervening exter nal memory operations 18.6 RECOMMENDED READING AND WEB SITE Two books that provide good coverage of the issues in this chapter are [OLUK07] and [JERR05]. [GOCH06] and [MEND06] describe the Intel Core Duo. [FOG08b] provides a detailed description of the Core Duo pipeline architecture [ARM08b] provides thorough coverage of the ARM CortexA8 pipeline. [HIRA07] and [GOOD05] are good overview articles Recommended Web site: • Multicore Association: Vendor organization promoting the development of and use of multicore technology 18.7 / KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 705 18.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS Key Terms Amdahl’s law chip multiprocessor multicore simultaneous multithreading (SMT) superscalar Review Questions 18.1 18.2 18.3 18.4 18.5 18.6 Summarize the differences among simple instruction pipelining, superscalar, and si multaneous multithreading Give several reasons for the choice by designers to move to a multicore organization rather than increase parallelism within a single processor Why is there a trend toward given an increasing fraction of chip area to cache memory? List some examples of applications that benefit directly from the ability to scale throughput with the number of cores At a top level, what are the main design variables in a multicore organization? List some advantages of a shared L2 cache among cores compared to separate dedi cated L2 caches for each core Problems 18.1 Consider the following problem. A designer has available a chip and decided what fraction of the chip will be devoted to cache memory (L1, L2, L3). The remainder of the chip can be devoted to a single complex superscalar and/or SMT core or multiple somewhat simpler cores. Define the following parameters: n = maximum number of cores that can be contained on the chip k = actual number of cores implemented (1 … k … n, where r = n/k is an integer) perf(r) = sequential performance gain by using the resources equivalent to r cores to form a single processor, where perf(1) = f = fraction of software that is parallelizable across multiple cores Thus, if we construct a chip with n cores, we expect each core to provide sequential per formance of 1 and for the n cores to be able to exploit parallelism up to a degree of n par allel threads. Similarly, if the chip has k cores, then each core should exhibit a performance of perf(r) and the chip is able to exploit parallelism up to a degree of k parallel threads We can modify Amdhal’s law (Equation 18.1) to reflect this situation as follows: Speedup = 1 - f perf(r) + f * r perf(r) * n 1.a Justify this modification of Amdahl’s law 1.b Using Pollack’s rule, we set perf(r) = 1r. Let n = 16. We want to plot speedup as a function of r for f = 0.5; f = 0.9; f = 0.975; f = 0.99; f = 0.999. The results are available in a document at this book’s Web site (multicoreperformance.pdf). What conclusions can you draw? 1.c Repeat part (b) for n = 256 18.2 The technical reference manual for the ARM11 MPCore says that the Distributed In terrupt Controller is memory mapped. That is, the core processors use memory mapped I/O to communicate with the DIC. Recall from Chapter 7 that with memory mapped I/O, there is a single address space for memory locations and I/O devices. The processor treats the status and data registers of I/O modules as memory locations and uses the same machine instructions to access both memory and I/O devices. Based on this information, what path through the block diagram of Figure 18.11 is used for the core processors to communicate with the DIC? ...CHAPTER PARALLEL PROCESSING 17.1 Multiple Processor Organizations Types of Parallel Processor Systems Parallel Organizations 17.2 Symmetric Multiprocessors Organization Multiprocessor Operating System Design Considerations ... program in parallel As computer technology has evolved, and as the cost of computer hardware has dropped, computer designers have sought more and more opportunities for parallelism,... overview, this chapter looks at some of the most prominent approaches to parallel or ganization First, we examine symmetric multiprocessors (SMPs), one of the earliest and still the most common example of parallel organization. In an SMP organization, multiple processors share a common memory. This organization raises