Bài giảng Kiến trúc máy tính (Computer Architecture): Chương 9 - Nguyễn Kim Khánh

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	32
Dung lượng	17,72 MB

Nội dung

Chương 9 - Các kiến trúc song song. Những nội dung chính được trình bày trong chương này gồm có: Phân loại kiến trúc máy tính, đa xử lý bộ nhớ dùng chung, đa xử lý bộ nhớ phân tán, bộ xử lý đồ họa đa dụng.

NKK-HUST c om Kiến trúc máy tính cu u du o ng th an co ng Chương CÁC KIẾN TRÚC SONG SONG Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội 2017 Kiến trúc máy tính CuuDuongThanCong.com 481 https://fb.com/tailieudientucntt NKK-HUST Nội dung học phần cu u du o ng th an co ng c om Chương Giới thiệu chung Chương Cơ logic số Chương Hệ thống máy tính Chương Số học máy tính Chương Kiến trúc tập lệnh Chương Bộ xử lý Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song 2017 Kiến trúc máy tính CuuDuongThanCong.com 482 https://fb.com/tailieudientucntt NKK-HUST c om Nội dung chương cu u du o ng th an co ng 9.1 Phân loại kiến trúc máy tính 9.2 Đa xử lý nhớ dùng chung 9.3 Đa xử lý nhớ phân tán 9.4 Bộ xử lý đồ họa đa dụng 2017 Kiến trúc máy tính CuuDuongThanCong.com 483 https://fb.com/tailieudientucntt NKK-HUST c om 9.1 Phân loại kiến trúc máy tính Phân loại kiến trúc máy tính (Michael Flynn -1966) SISD - Single Instruction Stream, Single Data Stream n SIMD - Single Instruction Stream, Multiple Data Stream n MISD - Multiple Instruction Stream, Single Data Stream n MIMD - Multiple Instruction Stream, Multiple Data Stream cu u du o ng th an co ng n 2017 Kiến trúc máy tính CuuDuongThanCong.com 484 https://fb.com/tailieudientucntt NKK-HUST SISD IS n n n ng co an cu u n th n CU: Control Unit PU: Processing Unit MU: Memory Unit Một xử lý Đơn dòng lệnh Dữ liệu lưu trữ nhớ Chính Kiến trúc von Neumann (tuần tự) ng n MU du o n PU c om CU DS 2017 Kiến trúc máy tính CuuDuongThanCong.com 485 https://fb.com/tailieudientucntt NKK-HUST SIMD c om DS LM1 co ng PU1 th an IS cu u du o ng CU PU2 2017 DS LM2 PUn DS LMn Kiến trúc máy tính CuuDuongThanCong.com 486 https://fb.com/tailieudientucntt NKK-HUST SIMD (tiếp) c om u n du o ng n th an co n Đơn dòng lệnh điều khiển đồng thời đơn vị xử lý PUs Mỗi đơn vị xử lý có nhớ liệu riêng LM (local memory) Mỗi lệnh thực tập liệu khác Các mơ hình SIMD ng n n Vector Computer Array processor cu n 2017 Kiến trúc máy tính CuuDuongThanCong.com 487 https://fb.com/tailieudientucntt NKK-HUST Một luồng liệu truyền đến tập xử lý Mỗi xử lý thực dãy lệnh khác Chưa tồn máy tính thực tế Có thể có tương lai cu u n du o n ng th an n co ng n c om MISD 2017 Kiến trúc máy tính CuuDuongThanCong.com 488 https://fb.com/tailieudientucntt NKK-HUST MIMD c om ng cu n Multiprocessors (Shared Memory) Multicomputers (Distributed Memory) du o n u n th an co n Tập xử lý Các xử lý đồng thời thực dãy lệnh khác liệu khác Các mơ hình MIMD ng n 2017 Kiến trúc máy tính CuuDuongThanCong.com 489 https://fb.com/tailieudientucntt NKK-HUST MIMD - Shared Memory IS du o cu u PU2 co DS DS ng CU2 PU1 an CU1 th IS ng c om Đa xử lý nhớ dùng chung (shared memory mutiprocessors) CUn 2017 IS PUn Bộ nhớ dùng chung DS Kiến trúc máy tính CuuDuongThanCong.com 490 https://fb.com/tailieudientucntt NKK-HUST Các dạng tổ chức xử lý đa lõi 18.3 / MULTICORE ORGANIZATION CPU Core n CPU Core CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache ng L2 cache c om CPU Core co I/O Main memory an Main memory I/O (b) Dedicated L2 cache du o ng th (a) Dedicated L1 cache 675 CPU Core n L1-D L1-I L1-D L1-I cu u CPU Core CPU Core CPU Core n L1-D L1-I L1-D L1-I L2 cache L2 cache L2 cache Main memory L3 cache I/O Main memory (c) Shared L2 cache I/O (d ) Shared L3 cache Figure 18.8 Multicore Organization Alternatives 2017 Kiến trúc máy tính Interprocessor communication is easy to implement, via shared memory locations TheCuuDuongThanCong.com use of a shared L2 cache confines the cache coherency problem to the L1 https://fb.com/tailieudientucntt 498 ng 2MiB shared L2 cache 32-kB L1 Caches Execution resources Arch state Arch state Execution resources Thermal control Thermal control APIC APIC Power management logic MB L2 shared cache Bus interface cu u n 32KiB instruction and 32KiB data du o n th an n 32-kB L1 Caches n 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core co n c om Intel - Core Duo ng NKK-HUST Each core has an independent thermal control unit With the high transistor density of today’s chips, thermal management is a fundamental capability, espe cially for laptop and mobile systems The Core Duo thermal control unit is designed to manage chip heat dissipation to maximize processor performance within therma constraints Thermal management also improves ergonomics with a cooler system and lower fan acoustic noise In essence, the thermal management unit monitors digital sensors for high-accuracy die temperature measurements Each core can be defined as an independent thermal zone The maximum temperature for each Front-side bus Figure 18.9 Intel Core Duo Block Diagram 2017 Kiến trúc máy tính CuuDuongThanCong.com 499 https://fb.com/tailieudientucntt NKK-HUST Intel Core i7-990X Core Core Core Core Core 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache an co ng Core c om CHAPTER 18 / MULTICORE COMPUTERS ng th 678 12 MB L3 Cache QuickPath Interconnect cu u du o DDR3 Memory Controllers ؋ 8B @ 1.33 GT/s ؋ 20B @ 6.4 GT/s Figure 18.10 Intel Core i7-990X Block Diagram 2017 The general structure of the Intel Core i7-990X is shown in Figure 18.10 Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache trúc máy more tính effective is prefetching, in which One mechanism Intel uses to makeKiến its caches the hardware examines memory access patterns and attempts to fill the caches specCuuDuongThanCong.com https://fb.com/tailieudientucntt 500 nected by a high-speed interconnection network of the types we discussed in Sec 8.3.3 Many different topologies, switching schemes, and routing algorithms are used What all multicomputers have in common is that when an application proNKK-HUST gram executes the send primitive, the communication processor is notified and transmits a block of user data to the destination machine (possibly after first asking for and getting permission) A generic multicomputer is shown in Fig 8-36 9.3 Đa xử lý nhớ phân tán … … Disk and I/O … Local interconnect Disk and I/O co Local interconnect c om Node Memory ng CPU an Communication processor ng th High-performance interconnection network Máy tính qui mơ lớn (Warehouse Scale Computers Interconnection NetworksProcessors – MPP) or 8.4.1 Massively Parallel In Fig 8-36 we see that multicomputers are held together by interconnection Máy tínhNow cụm (clusters) networks it is time to look more closely at these interconnection networks cu u n du o Figure 8-36 A generic multicomputer n 2017 Interestingly enough, multiprocessors and multicomputers are surprisingly similar in this respect because multiprocessors often have multiple memory modules that must also be interconnected with one another and with the CPUs Thus the matertrúc máy tính of systems ial in this section frequently appliesKiến to both kinds The fundamental reason why multiprocessor and multicomputer interconCuuDuongThanCong.com https://fb.com/tailieudientucntt 501 NKK-HUST Mạng liên kết MESSAGE-PASSING MULTICOMPUTERS 619 c om SEC 8.4 (b) (d) (e) (f) (g) (h) cu u du o ng (c) th an co ng (a) 2017 Figure 8-37 Various topologies The heavy dots represent switches The CPUs Kiến tính interconnect (c) A tree and memories are not shown (a) Atrúc star máy (b) A complete (d) A ring (e) A grid (f) A double torus (g) A cube (h) A 4D hypercube CuuDuongThanCong.com 502 https://fb.com/tailieudientucntt NKK-HUST ng du o cu u n ng th n co n Hệ thống qui mơ lớn Đắt tiền: nhiều triệu USD Dùng cho tính tốn khoa học tốn có số phép tốn liệu lớn Siêu máy tính an n c om Massively Parallel Processors 2017 Kiến trúc máy tính CuuDuongThanCong.com 503 https://fb.com/tailieudientucntt miss on L2 that hits on L3 takes about 28 cycles Finally, a miss on L3 that has to go to the main DRAM takes about 75 cycles The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west In addition, NKK-HUST each processor has a port to the collective network, used for broadcasting data to all processors The barrier port is used to speed up synchronization operations, giving each processor fast access to a specialized synchronization network At the next level up, IBM designed a custom card that holds one of the chips shown in Fig 8-38 along with GB of DDR2 DRAM The chip and the card are shown in Fig 8-39(a)–(b) respectively co ng c om IBM Blue Gene/P (a) (b) (c) Cabinet 32 Boards 1024 Cards 1024 Chips 4096 CPUs TB (d) System 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB th Board 32 Cards 32 Chips 128 CPUs 64 GB ng Card Chip CPUs GB du o Chip: processors 8-MB L3 cache an 2-GB DDR2 DRAM (e) cu u Figure 8-39 The BlueGene/P: (a) chip (b) card (c) board (d) cabinet (e) system The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board Since each card contains GB of DRAM, the boards contain 64 GB apiece One board is illustrated in Fig 8-39(c) At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet A cabinet is illustrated in Fig 8-39(d) Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig 8-39(e) A PowerPC 450 can issue up to instructions/cycle, thus 2017 Kiến trúc máy tính CuuDuongThanCong.com 504 https://fb.com/tailieudientucntt NKK-HUST Cluster an cu u n du o ng n th n co ng n Nhiều máy tính kết nối với mạng liên kết tốc độ cao (~ Gbps) Mỗi máy tính làm việc độc lập (PC SMP) Mỗi máy tính gọi node Các máy tính quản lý làm việc song song theo nhóm (cluster) Tồn hệ thống coi máy tính song song Tính sẵn sàng cao Khả chịu lỗi lớn c om n n n 2017 Kiến trúc máy tính CuuDuongThanCong.com 505 https://fb.com/tailieudientucntt NKK-HUST PC Cluster Google SEC 8.4 MESSAGE-PASSING MULTICOMPUTERS 635 hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster OC-48 Fiber c om OC-12 Fiber 128-port Gigabit Ethernet switch Two gigabit Ethernet links th an co ng 128-port Gigabit Ethernet switch cu u du o ng 80-PC rack Figure 8-44 A typical Google cluster 2017 Power density is also a key issue A typical PC burns about 120 watts or about Kiến trúc máy tính 10 kW per rack A rack needs about m2 so that maintenance personnel can install and remove PCs and for the air conditioning to function These parameters CuuDuongThanCong.com https://fb.com/tailieudientucntt 506 NKK-HUST 9.4 Bộ xử lý đồ họa đa dụng th n CPU host: thực theo GPGPU: tính tốn song song u n cu n du o ng n an co ng n Kiến trúc SIMD Xuất phát từ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D 3D: xử lý liệu song song GPGPU – General purpose Graphic Processing Unit Hệ thống lai CPU/GPGPU c om n 2017 Kiến trúc máy tính CuuDuongThanCong.com 507 https://fb.com/tailieudientucntt NKK-HUST cu u du o ng th an co ng c om Bộ xử lý đồ họa máy tính 2017 Kiến trúc máy tính CuuDuongThanCong.com 508 https://fb.com/tailieudientucntt NKK-HUST GPGPU: NVIDIA Tesla Streaming multiprocessor cu u du o ng th an co ng c om n × Streaming processors n 2017 Kiến trúc máy tính CuuDuongThanCong.com 509 https://fb.com/tailieudientucntt The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores A CUDA core executes a floating point or integer instruction per clock for a thread The NKK-HUST 512 CUDA cores are organized in 16 SMs of 32 cores each The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers cu u du o ng th an co ng c om GPGPU: NVIDIA Fermi 2017 Fermi’s 16 SM are positioned around a common L2 cache Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Kiến trúc máy tính (execution units), and light blue portions (register file and L1 cache) CuuDuongThanCong.com https://fb.com/tailieudientucntt 510 NKK-HUST NVIDIA Fermi The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient ng n Có 16 Streaming Multiprocessors (SM) Mỗi SM có 32 CUDA cores Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU 01 IU co n 512 High Performance CUDA cores Core Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) Core Core Core LD/ST LD/ST Core Core Core Core LD/ST an th du o SFU SFU SFU cu u SFU LD/ST LD/ST Each SM features 32 CUDA CUDA Core Core Core Core Core Dispatch Port LD/ST processors—a fourfold Operand Collector LD/ST increase over prior SM Core Core Core Core LD/ST designs Each CUDA FP Unit INT Unit LD/ST processor has a fully Core Core Core Core LD/ST Result Queue pipelined integer arithmetic LD/ST logic unit (ALU) and floating Core Core Core Core LD/ST point unit (FPU) Prior GPUs used IEEE 754-1985 LD/ST floating point arithmetic The Fermi architecture Core Core Core Core LD/ST implements the new IEEE 754-2008 floating-point LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST instruction for both single and double precision arithmetic FMA improves over a multiply-add Interconnect Network (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no Uniform Cache Cache Uniform loss of precision in the addition FMA is more Fermi Streaming Multiprocessor (SM) accurate than performing the operations separately GT200 implemented double precision FMA ng n c om Instruction Cache Third Generation Streaming Multiprocessor 2017 tính 511 In GT200, the integerKiến ALUtrúc wasmáy limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard CuuDuongThanCong.com https://fb.com/tailieudientucntt cu u du o ng th an co ng Hết c om NKK-HUST 2017 Kiến trúc máy tính CuuDuongThanCong.com 512 https://fb.com/tailieudientucntt ... Kiến trúc máy tính CuuDuongThanCong.com 499 https://fb.com/tailieudientucntt NKK-HUST Intel Core i7 -9 9 0X Core Core Core Core Core 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D... https://fb.com/tailieudientucntt NKK-HUST c om 9. 1 Phân loại kiến trúc máy tính Phân loại kiến trúc máy tính (Michael Flynn - 196 6) SISD - Single Instruction Stream, Single Data Stream n SIMD - Single Instruction...NKK-HUST Nội dung học phần cu u du o ng th an co ng c om Chương Giới thiệu chung Chương Cơ logic số Chương Hệ thống máy tính Chương Số học máy tính Chương Kiến trúc tập lệnh Chương Bộ xử lý Chương

Ngày đăng: 29/05/2021, 10:35