Computer architecture Part VII Advanced Architectures

Part VII Advanced Architectures Mar 2007 Computer Architecture, Advanced Architectures Slide About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara Instructors can use these slides freely in classroom teaching and for other educational purposes Any other use is strictly prohibited © Behrooz Parhami Edition Released Revised Revised Revised First July 2003 July 2004 July 2005 Mar 2007 Mar 2007 Computer Architecture, Advanced Architectures Revised Slide VII Advanced Architectures Performance enhancement beyond what we have seen: • What else can we at the instruction execution level? • Data parallelism: vector and array processing • Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing Mar 2007 Computer Architecture, Advanced Architectures Slide 25 Road to Higher Performance Review past, current, and future architectural trends: • General-purpose and special-purpose acceleration • Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Mar 2007 Computer Architecture, Advanced Architectures Slide 25.1 Past and Current Performance Trends Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005 0.06 MIPS (4-bit processor) 10,000 MIPS (32-bit processor) 8008 8-bit 80386 8080 80486 Pentium, MMX 8084 32-bit 8086 16-bit 8088 80186 80188 Pentium Pro, II Pentium III, M Celeron 80286 Mar 2007 Computer Architecture, Advanced Architectures Slide Architectural Innovations for Improved Performance Newer methods Mar 2007 Improvement factor Pipelining (and superpipelining) 3-8 √ Cache memory, 2-3 levels 2-5 √ RISC and related ideas 2-3 √ Multiple instruction issue (superscalar) 2-3 √ ISA extensions (e.g., for multimedia) 1-3 √ Multithreading (super-, hyper-) 2-5 ? Speculation and value prediction 2-3 ? Hardware acceleration 2-10 ? Vector and array processing 2-10 ? 10 Parallel/distributed computing 2-1000s ? Computer Architecture, Advanced Architectures Previously discussed Established methods Architectural method Available computing power ca 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Covered in Part VII Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture Slide Peak Performance of Supercomputers PFLOPS Earth Simulator × 10 / years ASCI White Pacific ASCI Red TFLOPS TMC CM-5 Cray X-MP GFLOPS 1980 Cray T3D TMC CM-2 Cray 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol 47, No 4, pp 399-403, 2004 [Dong04] Mar 2007 Computer Architecture, Advanced Architectures Slide Energy Consumption is Getting out of Hand TIPS DSP performance per watt Absolute proce ssor performance Performance GIPS GP processor performance per watt MIPS kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs Mar 2007 Computer Architecture, Advanced Architectures Slide 25.2 Performance-Driven ISA Extensions Adding instructions that more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a × b) Multiply-accumulate: reduce round-off error (s := s + a × b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Mar 2007 Computer Architecture, Advanced Architectures Slide Intel MMX ISA Extension Class Copy Arithmetic Shift Logic Table 25.1 Memory access Control Mar 2007 Instruction Register copy Parallel pack Parallel unpack low Parallel unpack high Parallel add Parallel subtract Parallel multiply low Parallel multiply high Parallel multiply-add Parallel compare equal Parallel compare greater Parallel left shift logical Parallel right shift logical Parallel right shift arith Parallel AND Parallel ANDNOT Parallel OR Parallel XOR Parallel load MMX reg Parallel store MMX reg Empty FP tag bits Vector Op type 32 bits 4, Saturate 8, 4, 8, 4, 8, 4, Wrap/Saturate# 8, 4, Wrap/Saturate# 4 8, 4, 8, 4, 4, 2, 4, 2, 4, Bitwise Bitwise Bitwise Bitwise 32 or 64 bits 32 or 64 bit Computer Architecture, Advanced Architectures Function or results Integer register ↔ MMX register Convert to narrower elements Merge lower halves of vectors Merge upper halves of vectors Add; inhibit carry at boundaries Subtract with carry inhibition Multiply, keep the low halves Multiply, keep the high halves Multiply, add adjacent products* All 1s where equal, else all 0s All 1s where greater, else all 0s Shift left, respect boundaries Shift right, respect boundaries Arith shift within each (half)word dest ← (src1) ∧ (src2) dest ← (src1) ∧ (src2)′ dest ← (src1) ∨ (src2) dest ← (src1) ⊕ (src2) Address given in integer register Address given in integer register Required for compatibility$ Slide 10 Processors and caches Memories To interconnection network Scalable Coherent Interface (SCI) Figure 27.11 Structure of a ring-based distributed-memory multiprocessor Mar 2007 Computer Architecture, Advanced Architectures Slide 54 28 Distributed Multicomputing Computer architects’ dream: connect computers like toy blocks • Building multicomputers from loosely connected nodes • Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.4 Building and Using Multicomputers 28.5 Network-Based Distributed Computing 28.6 Grid Computing and Beyond Mar 2007 Computer Architecture, Advanced Architectures Slide 55 28.1 Communication by Message Passing Parallel input/output Memories and processors A computing node Interconnection network p−1 Figure 28.1 Mar 2007 Routers Structure of a distributed multicomputer Computer Architecture, Advanced Architectures Slide 56 Router Design Routing and arbitration Ejection channel LC LC Q Q Input channels Input queues Link controller Message queue Output queues LC Q Q LC LC Q Q LC LC Q Q LC LC Q Q LC Switch Output channels Injection channel Figure 28.2 The structure of a generic router Mar 2007 Computer Architecture, Advanced Architectures Slide 57 Building Networks from Switches Straight through Crossed connection Lower broadcast Upper broadcast Figure 28.3 Example × switch with point-to-point and broadcast connection capabilities Processors Memories Row 0 1 2 Row Row 5 6 Row Row Figure 27.2 9 Butterfly and Beneš networks 10 10 11 12 Row 11 12 13 13 14 14 15 Mar 2007 Row Row (a) Butterfly network P r o c e s s o r s 15 M e m o r i e s (b) Beneš network Computer Architecture, Advanced Architectures Slide 58 Interprocess Communication via Messages Communication latency Time Process A Process B send x receive x Process B is suspended Process B is awakened Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes Mar 2007 Computer Architecture, Advanced Architectures Slide 59 28.2 Interconnection Networks Nodes (a) Direct network Figure 28.5 Mar 2007 Routers Nodes (b) Indirect network Examples of direct and indirect interconnection networks Computer Architecture, Advanced Architectures Slide 60 Direct Interconnection Networks (a) 2D torus (b) 4D hyperc ube (c) Chordal ring (d) Ring of rings Figure 28.6 A sampling of common direct interconnection networks Only routers are shown; a computing node is implicit for each router Mar 2007 Computer Architecture, Advanced Architectures Slide 61 Indirect Interconnection Networks Level-3 bus Level-2 bus Level-1 bus (a) Hierarchical bus es (b) Omega network Figure 28.7 Two commonly used indirect interconnection networks Mar 2007 Computer Architecture, Advanced Architectures Slide 62 28.3 Message Composition and Routing Message Packet data Padding First packet Header Data or payload Last packet Trailer A transmitted packet Flow control digits (flits) Figure 28.8 Mar 2007 Messages and their parts for message passing Computer Architecture, Advanced Architectures Slide 63 Wormhole Switching Each worm is blocked at the point of attempted right turn Destination Destination Worm 1: moving Worm 2: blocked Source Source (a) Two worms en route to their respective destinations (b) Deadlock due to circular waiting of four blocked worms Figure 28.9 Concepts of wormhole switching Mar 2007 Computer Architecture, Advanced Architectures Slide 64 28.4 Building and Using Multicomputers Inputs t=1 B A B C D A B C A B E G H C F D E F GH t=2 t=2 t=2 A C F t=1 D E F G H H t=3 D t=2 t=1 G E (a) Static task graph Figure 28.10 Mar 2007 Outputs Time 10 (b) Schedules on 1-3 computers A task system and schedules on 1, 2, and computers Computer Architecture, Advanced Architectures Slide 65 15 Building Multicomputers from Commodity Nodes One module: CPU(s), memory, disks Expansion slots One module: CPU, memory, disks (a) Current racks of modules Wireless connection surfaces (b) Futuristic toy-block construction Figure 28.11 Growing clusters using modular nodes Mar 2007 Computer Architecture, Advanced Architectures Slide 66 28.5 Network-Based Distributed Computing PC Fast network interface with large memory System or I/O bus NIC Network built of high-s peed wormhole switches Figure 28.12 Mar 2007 Network of workstations Computer Architecture, Advanced Architectures Slide 67 28.6 Grid Computing and Beyond Computational grid is analogous to the power grid Decouples the “production” and “consumption” of computational power Homes don’t have an electricity generator; why should they have a computer? Advantages of computational grid: Near continuous availability of computational and related resources Resource requirements based on sum of averages, rather than sum of peaks Paying for services based on actual usage rather than peak demand Distributed data storage for higher reliability, availability, and security Universal access to specialized and one-of-a-kind computing resources Still to be worked out: How to charge for computation usage Mar 2007 Computer Architecture, Advanced Architectures Slide 68

Định dạng
Số trang	68
Dung lượng	0,98 MB

Tiêu đề	Advanced Architectures
Tác giả	Behrooz Parhami
Trường học	University of California, Santa Barbara
Chuyên ngành	Computer Architecture
Thể loại	presentation
Năm xuất bản	2007
Thành phố	Santa Barbara