Bài giảng hệ thống máy tính (computer systems) chương 4 nguyễn kim khánh

NKK-HUST Hệ thống máy tính Chương CÁC KIẾN TRÚC SONG SONG Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội CS-HEDSPI2019 Computer Systems 198 NKK-HUST Nội dung học phần Chương Tổng quan hệ thống máy tính Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song CS-HEDSPI2019 Computer Systems 199 NKK-HUST Nội dung chương 4.1 Phân loại kiến trúc máy tính 4.2 Đa xử lý nhớ dùng chung 4.3 Đa xử lý nhớ phân tán 4.4 Bộ xử lý đồ họa đa dụng CS-HEDSPI2019 Computer Systems 200 NKK-HUST 4.1 Phân loại kiến trúc máy tính Phân loại kiến trúc máy tính (Michael Flynn -1966) n SISD - Single Instruction Stream, Single Data Stream n SIMD - Single Instruction Stream, Multiple Data Stream n MISD - Multiple Instruction Stream, Single Data Stream n MIMD - Multiple Instruction Stream, Multiple Data Stream CS-HEDSPI2019 Computer Systems 201 NKK-HUST SISD CU n n n n n n n CS-HEDSPI2019 IS PU DS MU CU: Control Unit PU: Processing Unit MU: Memory Unit Một xử lý Đơn dòng lệnh Dữ liệu lưu trữ nhớ Chính Kiến trúc von Neumann (tuần tự) Computer Systems 202 NKK-HUST SIMD PU1 CU IS PU2 DS DS LM1 LM2 PUn CS-HEDSPI2019 Computer Systems DS LMn 203 NKK-HUST SIMD (tiếp) n n n n Đơn dòng lệnh điều khiển đồng thời đơn vị xử lý PUs Mỗi phần tử xử lý có nhớ liệu riêng LM (local memory) Mỗi lệnh thực tập liệu khác Các mô hình SIMD n n CS-HEDSPI2019 Vector Computer Array processor Computer Systems 204 NKK-HUST MISD n n n n Một luồng liệu truyền đến tập xử lý Mỗi xử lý thực dãy lệnh khác Chưa tồn máy tính thực tế Có thể có tương lai CS-HEDSPI2019 Computer Systems 205 NKK-HUST MIMD n n n Tập xử lý Các xử lý đồng thời thực dãy lệnh khác liệu khác Các mơ hình MIMD n n CS-HEDSPI2019 Multiprocessors (Shared Memory) Multicomputers (Distributed Memory) Computer Systems 206 NKK-HUST MIMD - Shared Memory Đa xử lý nhớ dùng chung (shared memory mutiprocessors) CU1 CU2 IS IS CUn CS-HEDSPI2019 PU1 PU2 DS DS IS PUn Bộ nhớ dùng chung DS Computer Systems 207 NKK-HUST Các dạng tổ chức xử lý đa lõi 18.3 / MULTICORE ORGANIZATION CPU Core CPU Core n CPU Core CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache Main memory I/O L2 cache I/O Main memory (b) Dedicated L2 cache (a) Dedicated L1 cache CPU Core CPU Core n L1-D L1-I L1-D L1-I CPU Core CPU Core n L1-D L1-I L1-D L1-I L2 cache L2 cache L2 cache Main memory 675 L3 cache I/O Main memory (c) Shared L2 cache I/O (d ) Shared L3 cache Figure 18.8 Multicore Organization Alternatives CS-HEDSPI2019 Computer Systems Interprocessor communication is easy to implement, via shared memory locations 215 n n n 32KiB instruction and 32KiB data 2MiB shared L2 cache 32-kB L1 Caches Execution resources Arch state n 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core Arch state n Execution resources Intel - Core Duo 32-kB L1 Caches NKK-HUST density of today’s chips, thermal management is a fundamental capability, e cially for laptop and mobile systems The Core Duo thermal control unit is desig to manage chip heat dissipation to maximize processor performance within ther constraints Thermal management also improves ergonomics with a cooler sys and lower fan acoustic noise In essence, the thermal management unit moni digital sensors for high-accuracy die temperature measurements Each core be defined as an independent thermal zone The maximum temperature for e Thermal control Thermal control APIC APIC Power management logic MB L2 shared cache Bus interface Front-side bus Figure 18.9 Intel Core Duo Block Diagram CS-HEDSPI2019 Computer Systems 216 NKK-HUST Intel Core i7-990X 678 CHAPTER 18 / MULTICORE COMPUTERS Core Core Core Core Core Core 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 12 MB L3 Cache DDR3 Memory Controllers QuickPath Interconnect ! 8B @ 1.33 GT/s ! 20B @ 6.4 GT/s Figure 18.10 Intel Core i7-990X Block Diagram The general structure of the Intel Core i7-990X is shown in Figure 18.10 Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache CS-HEDSPI2019 Systems One mechanism Intel uses to makeComputer its caches more effective is prefetching, in which 217 8.3.3 Many different topologies, switching schemes, and routing algorithms are used What all multicomputers have in common is that when an application proNKK-HUST gram executes the send primitive, the communication processor is notified and transmits a block of user data to the destination machine (possibly after first asking for and getting permission) A generic multicomputer is shown in Fig 8-36 4.3 Đa xử lý nhớ phân tán CPU Node Memory … Local interconnect Disk and I/O … … Local interconnect Disk and I/O Communication processor High-performance interconnection network Figure 8-36 A generic multicomputer n n Máy tính qui mơ lớn (Warehouse Scale Computers Interconnection NetworksProcessors – MPP) or 8.4.1 Massively Parallel In Fig 8-36 we see that multicomputers are held together by interconnection Máy tínhNow cụm (clusters) networks it is time to look more closely at these interconnection networks Interestingly enough, multiprocessors and multicomputers are surprisingly similar in this respect because multiprocessors often have multiple memory modules that must also be interconnected with one another and with the CPUs Thus the materCS-HEDSPI2019 ial in this section frequently appliesComputer to bothSystems kinds of systems 218 NKK-HUST Mạng liên kết SEC 8.4 CS-HEDSPI2019 MESSAGE-PASSING MULTICOMPUTERS (a) (b) (c) (d) (e) (f) (g) (h) Figure 8-37 Various topologies The heavy dots represent switches The CPUs Computer Systems and memories are not shown (a) A star (b) A complete interconnect (c) A tree (d) A ring (e) A grid (f) A double torus (g) A cube (h) A 4D hypercube 619 219 NKK-HUST Massively Parallel Processors n n n n Hệ thống qui mô lớn Đắt tiền: nhiều triệu USD Dùng cho tính tốn khoa học tốn có số phép tốn liệu lớn Siêu máy tính CS-HEDSPI2019 Computer Systems 220 The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west In addition, NKK-HUST each processor has a port to the collective network, used for broadcasting data to all processors The barrier port is used to speed up synchronization operations, giving each processor fast access to a specialized synchronization network At the next level up, IBM designed a custom card that holds one of the chips shown in Fig 8-38 along with GB of DDR2 DRAM The chip and the card are shown in Fig 8-39(a)–(b) respectively IBM Blue Gene/P 2-GB DDR2 DRAM Chip: processors 8-MB L3 cache (a) Card Chip CPUs GB Board 32 Cards 32 Chips 128 CPUs 64 GB Cabinet 32 Boards 1024 Cards 1024 Chips 4096 CPUs TB System 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB (b) (c) (d) (e) Figure 8-39 The BlueGene/P: (a) chip (b) card (c) board (d) cabinet (e) system The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board Since each card contains GB of DRAM, the boards contain 64 GB apiece One board is illustrated in Fig 8-39(c) At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet A cabinet is illustrated in Fig 8-39(d) Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig 8-39(e) A PowerPC 450 can issue up to instructions/cycle, thus CS-HEDSPI2019 Computer Systems 221 NKK-HUST Cluster n n n n n n n Nhiều máy tính kết nối với mạng liên kết tốc độ cao (~ Gbps) Mỗi máy tính làm việc độc lập (PC SMP) Mỗi máy tính gọi node Các máy tính quản lý làm việc song song theo nhóm (cluster) Tồn hệ thống coi máy tính song song Tính sẵn sàng cao Khả chịu lỗi lớn CS-HEDSPI2019 Computer Systems 222 NKK-HUST PC Cluster Google SEC 8.4 MESSAGE-PASSING MULTICOMPUTERS 635 hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster OC-12 Fiber 128-port Gigabit Ethernet switch OC-48 Fiber 128-port Gigabit Ethernet switch Two gigabit Ethernet links 80-PC rack Figure 8-44 A typical Google cluster CS-HEDSPI2019 Power density is also a key Computer issue A typical PC burns about 120 watts or about Systems 10 kW per rack A rack needs about m so that maintenance personnel can in- 223 NKK-HUST 4.4 Bộ xử lý đồ họa đa dụng n n n n Kiến trúc SIMD Xuất phát từ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D 3D: xử lý liệu song song GPGPU – General purpose Graphic Processing Unit Hệ thống lai CPU/GPGPU n n CS-HEDSPI2019 CPU host: thực theo GPGPU: tính tốn song song Computer Systems 224 NKK-HUST Bộ xử lý đồ họa máy tính CS-HEDSPI2019 Computer Systems 225 NKK-HUST GPGPU: NVIDIA Tesla Streaming multiprocessor n n CS-HEDSPI2019 Computer Systems × Streaming processors 226 cores A CUDA core executes a floating point or integer instruction per clock for a thread The NKK-HUST 512 CUDA cores are organized in 16 SMs of 32 cores each The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers GPGPU: NVIDIA Fermi CS-HEDSPI2019 Fermi’s 16 SM are positioned around a common L2 cache Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Computer Systems (execution units), and light blue portions (register file and L1 cache) 227 NKK-HUST NVIDIA Fermi Instruction Cache n n Third Generation Streaming Multiprocessor Có 16 Streaming Multiprocessors (SM) Mỗi SM có 32 CUDA cores Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU 01 IU The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient 512 High Performance CUDA cores n CS-HEDSPI2019 Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) Core Core Core Core LD/ST LD/ST Core Core Core Core SFU LD/ST LD/ST Each SM features 32 CUDA LD/ST CUDA Core Core Core Core Core Dispatch Port LD/ST processors—a fourfold Operand Collector LD/ST increase over prior SM Core Core Core Core LD/ST designs Each CUDA FP Unit INT Unit LD/ST processor has a fully Core Core Core Core LD/ST Result Queue pipelined integer arithmetic LD/ST logic unit (ALU) and floating Core Core Core Core LD/ST point unit (FPU) Prior GPUs used IEEE 754-1985 LD/ST floating point arithmetic The Fermi architecture Core Core Core Core LD/ST implements the new IEEE 754-2008 floating-point LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST instruction for both single and double precision arithmetic FMA improves over a multiply-add Interconnect Network (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no Uniform Cache Cache Uniform loss of precision in the addition FMA is more Fermi Streaming Multiprocessor (SM) accurate than performing the operations separately GT200 implemented double precision FMA SFU SFU SFU Computer 228 In GT200, the integer ALU was Systems limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic In Fermi, the newly NKK-HUST Hết CS-HEDSPI2019 Computer Systems 229

Định dạng
Số trang	32
Dung lượng	8,1 MB