Bài giảng Kiến trúc máy tính tiên tiến

Bài giảng Kiến trúc máy tính tiên tiến ACA2015 TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Hanoi University of Science and Technology KIẾN TRÚC MÁY TÍNH TIÊN TIẾN Advanced Computer Architecture Nguyễn Kim Khánh Bộ môn Kỹ thuật máy tính Viện Công nghệ thông tin Truyền thông Department of Computer Engineering (DCE) School of Information and Communication Technology (SoICT) Version: ACA2015 NKK-HUST Contact Information n  n  n  Address: 502-B1 Mobile: 091-358-5533 e-mail: khanhnk@soict.hust.edu.vn khanh.nguyenkim@hust.edu.vn Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Mục tiêu học phần n  n  Học viên trang bị kiến thức tổng quan kiến trúc máy tính tiên tiến để tiếp cận, khai thác nắm bắt xu phát triển hệ thống máy tính đại Sau học xong học phần này, học viên có khả năng: n  n  n  n  Tìm hiểu kiến trúc cụ thể máy tính đại đặc biệt kiến trúc song song Đánh giá hiệu máy tính cải thiện hiệu chương trình Khai thác quản trị hiệu hệ thống máy tính đại Phân tích thiết kế hệ thống máy tính đại Mar 2015 Advanced Computer Architecture NKK-HUST Tài liệu học tập n  Bài giảng Kiến trúc máy tính tiên tiến: ACA2015 download tại: ftp://dce.hust.edu.vn/khanhnk/ACA/ n  Sách tham khảo: [1] John L Hennessy, David A Patterson - Computer Architecture: A Quantitative Approach, 5th ed 2012 [2] David A Patterson, John L Hennessy - Computer Organization and Design – 2014, 5th edition [3] William Stallings - Computer Organization and Architecture – 2013, 9th edition [4] Andrew S Tanenbaum - Structured Computer Organization – 2013, 6th edition Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Nội dung học phần 1.  Máy tính phân loại 2.  Hiệu máy tính 3.  Kiến trúc tập lệnh 4.  Kỹ thuật song song mức lệnh 5.  Phân cấp nhớ máy tính 6.  Các kiến trúc song song 7.  Siêu máy tính máy tính qui mô lớn 8.  Chuyên đề nghiên cứu Mar 2015 Advanced Computer Architecture NKK-HUST Kiến trúc máy tính tiên tiến Máy tính phân loại Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Phân loại máy tính kỷ nguyên PC n  Máy tính cá nhân (Personal Computers) n  n  n  Máy chủ (Servers) – máy phục vụ n  n  n  n  Dùng mạng để quản lý cung cấp dịch vụ Hiệu độ tin cậy cao Hàng nghìn đến hàng triệu USD Siêu máy tính (Supercomputers) n  n  n  Desktop computers, Laptop computers Máy tính đa dụng Dùng cho tính toán cao cấp khoa học kỹ thuật Hàng triệu đến hàng trăm triệu USD Máy tính nhúng (Embedded Computers) n  n  Đặt ẩn thiết bị khác Được thiết kế chuyên dụng Mar 2015 Advanced Computer Architecture NKK-HUST Phân loại máy tính kỷ nguyên sau PC n  Thiết bị di động cá nhân (PMD - Personal Mobile Devices) n  n  n  Smartphones, Tablet Kết nối Internet Điện toán đám mây (Cloud Computing) n  n  n  n  Sử dụng máy tính qui mô lớn (Warehouse Scale Computers), gồm nhiều servers kết nối với Cho công ty thuê phần để cung cấp dịch vụ phần mềm Software as a Service (SaaS): phần phần mềm chạy PMD, phần chạy Cloud Ví dụ: Amazon, Google Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Khái niệm kiến trúc máy tính n  Kiến trúc máy tính bao gồm: n  n  n  n  Kiến trúc tập lệnh (Instruction Set Architecture): nghiên cứu máy tính theo cách nhìn người lập trình Tổ chức máy tính (Computer Organization) hay Vi kiến trúc (Microarchitecture): nghiên cứu thiết kế máy tính mức cao (thiết kế CPU, hệ thống nhớ, cấu trúc bus, ) Phần cứng (Hardware): nghiên cứu thiết kế logic chi tiết công nghệ đóng gói máy tính Cùng kiến trúc tập lệnh có nhiều sản phẩm (tổ chức, phần cứng) khác Mar 2015 Advanced Computer Architecture NKK-HUST Phân lớp máy tính n  Người sử dụng n  Người lập trình Phần mềm ứng dụng Phần mềm ứng dụng n  Người lập trình hệ thống Được viết theo ngôn ngữ bậc cao Phần mềm hệ thống n  n  Chương trình dịch (Compiler): dịch mã ngôn ngữ bậc cao thành ngôn ngữ máy Hệ điều hành (Operating System) Phần mềm hệ thống n  Phần cứng n  Lập lịch cho nhiệm vụ chia sẻ tài nguyên Quản lý nhớ lưu trữ n  Điều khiển vào-ra n  Phần cứng n  Mar 2015 Nguyễn Kim Khánh DCE-HUST Bộ xử lý, nhớ, mô-đun vào-ra Advanced Computer Architecture 10 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 15 1.3 Below Your Program The recognition that a program could be written to translate a more powerful language into computer instructions was one of the great breakthroughs in the early days of computing Programmers today owe their productivity—and their sanity—to the creation of high-level programming languages and compilers that translate programs in such languages into instructions Figure 1.4 shows the relationships among these programs and languages, which are more examples of the power of abstraction NKK-HUST Các mức mã chương trình n  Ngôn ngữ bậc cao n  n  n  n  Hợp ngữ n  n  n  High-level language – HLL Mức trừu tượng gần với vấn đề cần giải Hiệu linh động Assembly language Mô tả lệnh dạng text High-level language program (in C) n  n  Assembly language program (for MIPS) swap: multi add lw lw sw sw jr $2, $5,4 $2, $4,$2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 Assembler Machine language Mô tả theo phần cứng Các lệnh liệu mã hóa theo nhị phân Mar 2015 language such as C, Cϩϩ, Java, or Visual Basic that is composed of words and algebraic notation that can be translated by a compiler into assembly language Compiler Ngôn ngữ máy n  high-level programming language A portable swap(int v[], int k) {int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Binary machine language program (for MIPS) 00000000101000100000000100011000 00000000100000100001000000100001 10001101111000100000000000000000 10001110000100100000000000000100 10101110000100100000000000000000 10101101111000100000000000000100 00000011111000000000000000001000 Advanced Computer Architecture 11 FIGURE 1.4 C program compiled into assembly language and then assembled into binary machine language Although the translation from high-level language to binary machine language is shown in two steps, some compilers cut out the middleman and produce binary machine language directly These languages and this program are examined in more detail in Chapter NKK-HUST Các thành phần máy tính n  n  CPU Bộ nhớ Giống với tất loại máy tính Bộ xử lý trung tâm (Central Processing Unit – CPU) n  Bus hệ thống n  Bộ nhớ (Main Memory) n  Hệ thống vào-ra n  Nguyễn Kim Khánh DCE-HUST Trao đổi thông tin máy tính với bên Bus hệ thống (System bus) n  Mar 2015 Chứa chương trình thực Hệ thống vào-ra (Input/Output) n  n  Điều khiển hoạt động máy tính xử lý liệu Kết nối vận chuyển thông tin Advanced Computer Architecture 12 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Bộ xử lý trung tâm (CPU) n  Chức năng: n  n  n  Nguyên tắc hoạt động bản: n  n  điều khiển hoạt động máy tính xử lý liệu CPU hoạt động theo chương trình nằm nhớ Là thành phần nhanh hệ thống Mar 2015 Advanced Computer Architecture 13 NKK-HUST Các thành phần CPU n  n  Đơn vị điều khiển Đơn vị số học logic Đơn vị điều khiển n  Bus hệ thống n  Đơn vị số học logic n  n  n  Tập ghi Nguyễn Kim Khánh DCE-HUST Arithmetic and Logic Unit (ALU) Thực phép toán số học phép toán logic Tập ghi n  n  Mar 2015 Control Unit (CU) Điều khiển hoạt động máy tính theo chương trình định sẵn Register File (RF) Gồm ghi chứa thông tin phục vụ cho hoạt động CPU Advanced Computer Architecture 14 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Bộ nhớ máy tính n  n  Chức năng: nhớ chương trình liệu (dưới dạng nhị phân) Các thao tác với nhớ: n  n  n  Thao tác ghi (Write) Thao tác đọc (Read) Các thành phần chính: n  n  n  Bộ nhớ (Main memory) Bộ nhớ đệm (Cache memory) Thiết bị lưu trữ (Storage Devices) CPU Mar 2015 Bộ nhớ đệm Bộ nhớ Các thiết bị lưu trữ Advanced Computer Architecture 15 NKK-HUST Bộ nhớ (Main memory) n  n  n  n  n  n  Tồn máy tính Chứa lệnh liệu chương trình thực Sử dụng nhớ bán dẫn Tổ chức thành ngăn nhớ đánh địa (thường đánh địa cho byte nhớ) Nội dung ngăn nhớ thay đổi, song địa vật lý ngăn nhớ cố định CPU muốn đọc/ghi ngăn nhớ cần phải biết địa ngăn nhớ Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture Nội dung Địa 0100 1101 00 0000 0101 0101 00 0001 1010 1111 00 0010 0000 1110 00 0011 0111 0100 00 0100 1011 0010 00 0101 0010 1000 00 0110 1110 1111 00 0111 0110 0010 11 1110 0010 0001 11 1111 16 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Bộ nhớ đệm (Cache memory) n  n  n  n  n  n  Bộ nhớ có tốc độ nhanh đặt đệm CPU nhớ nhằm tăng tốc độ CPU truy cập nhớ Dung lượng nhỏ nhớ Sử dụng nhớ bán dẫn tốc độ nhanh Cache thường chia thành số mức (L1, L2, L3) Cache thường tích hợp chip xử lý Cache có không Mar 2015 Advanced Computer Architecture 17 NKK-HUST Thiết bị lưu trữ (Storage Devices) n  n  Còn gọi nhớ Chức đặc điểm n  Lưu giữ tài nguyên phần mềm máy tính Được kết nối với hệ thống dạng thiết bị vào-ra Dung lượng lớn n  Tốc độ chậm n  n  n  Các loại thiết bị lưu trữ n  n  n  Bộ nhớ từ: ổ đĩa cứng HDD Bộ nhớ bán dẫn: ổ thể rắn SSD, ổ nhớ flash, thẻ nhớ Bộ nhớ quang: CD, DVD Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 18 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Hệ thống vào-ra Chức năng: Trao đổi thông tin máy tính với giới bên Các thao tác bản: n  n  n  n  Bus hệ thống Thiết bị vào-‐ra Mô-đun vào-ra Thiết bị vào-‐ra Vào liệu (Input) Ra liệu (Output) Các thành phần chính: n  n  n  Các thiết bị vào-ra (IO devices) Các mô-đun vào-ra (IO modules) Mar 2015 Thiết bị vào-‐ra Mô-đun vào-ra Advanced Computer Architecture 19 NKK-HUST Đơn vị liệu máy tính n  n  Byte = 8-bit Qui ước đơn vị liệu máy tính: Theo thập phân Đơn vị Theo nhị phân Viết tắt Giá trị kilobyte KB 103 megabyte MB gigabyte Viết tắt Giá trị kibibyte KiB 210 = 1024 106 mebibyte MiB 220 GB 109 gibibyte GiB 230 terabyte TB 1012 tebibyte TiB 240 petabyte PB 1015 pebibyte PiB 250 exabyte EB 1018 exbibyte EiB 260 Mar 2015 Nguyễn Kim Khánh DCE-HUST Đơn vị Advanced Computer Architecture 20 10 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Máy tính vector n  n  Kiến trúc SIMD Ý tưởng bản: Đọc tập phần tử liệu vào ghi vector n  Thao tác ghi n  Trả kết nhớ n  n  Thời gian thực phụ thuộc vào: n  n  n  Độ dài vector toán hạng Hazard cấu trúc Sự phụ thuộc liệu Mar 2015 Advanced Computer Architecture 151 NKK-HUST Kiến trúc vector Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 152 76 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Bộ xử lý đồ họa đa dụng n  n  n  n  Kiến trúc SIMD Xuất phát từ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D 3D: xử lý liệu song song GPGPU – General purpose Graphic Processing Unit Hệ thống lai CPU/GPGPU n  n  CPU host: thực theo GPGPU: tính toán song song Mar 2015 Advanced Computer Architecture 153 NKK-HUST Bộ xử lý đồ họa máy tính Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 154 77 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST GPGPU: NVIDIA Tesla n Streaming multiprocessor Hardware Execution Mar 2015 CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads The SM executes n 8 × Streaming threads in groups of 32 threads called a warp While programmers can generally ignore warp processors execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses Advanced Computer Architecture 155 An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores A CUDA core executes a floating point or integer instruction per clock for a thread The NKK-HUST 512 CUDA cores are organized in 16 SMs of 32 cores each The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers GPGPU: NVIDIA Fermi Mar 2015 Fermi’s 16 SM are positioned around a common L2 cache Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Computer Architecture (execution units),Advanced and light blue portions (register file and L1 cache) 156 Nguyễn Kim Khánh DCE-HUST 78 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST NVIDIA Fermi Instruction Cache n  n  Third Generation Streaming Multiprocessor Có 16 Streaming Multiprocessors (SM) Mỗi SM có 32 CUDA cores Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU 01 IU The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) Core Core Core Core LD/ST LD/ST 512 High Performance CUDA cores n  Warp Scheduler Core Core Core Core SFU LD/ST LD/ST Each SM features 32 CUDA LD/ST CUDA Core Core Core Core Core Dispatch Port LD/ST processors—a fourfold Operand Collector LD/ST increase over prior SM Core Core Core Core LD/ST designs Each CUDA FP Unit INT Unit LD/ST processor has a fully Core Core Core Core LD/ST Result Queue pipelined integer arithmetic LD/ST logic unit (ALU) and floating Core Core Core Core LD/ST point unit (FPU) Prior GPUs used IEEE 754-1985 LD/ST floating point arithmetic The Fermi architecture Core Core Core Core LD/ST implements the new IEEE 754-2008 floating-point LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST instruction for both single and double precision arithmetic FMA improves over a multiply-add Interconnect Network (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no Uniform Cache Cache Uniform loss of precision in the addition FMA is more Fermi Streaming Multiprocessor (SM) accurate than performing the operations separately GT200 implemented double precision FMA SFU SFU SFU Mar 2015 Advanced Computer Architecture 157 In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements The integer ALU is also optimized to efficiently support 64-bit and extended precision operations Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count NKK-HUST 16 Load/Store Units Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock Supporting units load and store the data at each address to cache or DRAM Đa xử lý nhớ dùng chung n  n  n  n  MIMD với nhớ dùng chung Hệ thống đa xử lý đối xứng (SMPSymmetric Multiprocessors) Hệ thống đa xử lý không đối xứng (NUMA – Non-Uniform Memory Access) Bộ xử lý đa lõi (Multicore Processors) Mar 2015 Nguyễn Kim Khánh DCE-HUST 8 Advanced Computer Architecture 158 79 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST SMP hay UMA (Uniform Memory Access) Mar 2015 Advanced Computer Architecture 159 NKK-HUST SMP (tiếp) n  n  n  n  n  n  n  Một máy tính có n >= xử lý giống Các xử lý dùng chung nhớ hệ thống vào-ra Thời gian truy cập nhớ với xử lý Các xử lý thực chức giống Hệ thống điều khiển hệ điều hành phân tán Hiệu năng: Các công việc thực song song Khả chịu lỗi Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 160 80 Bài giảng Kiến system is called CC-NUMA (at least by the hardware people) The software people often call it hardware DSM because it is basically the same as software distributed shared memory but implemented by the hardware using a small page size One of the first NC-NUMA machines (although the name had not yet been coined) was the Carnegie-Mellon Cm*, illustrated in simplified form in Fig 8-32 (Swan et al., 1977) It consisted of a collection of LSI-11 CPUs, each with some trúc máy tính tiên tiến memory addressed over a local bus (The LSI-11 was a single-chip version of the DEC PDP-11, a minicomputer popular in the 1970s.) In addition, the LSI-11 systems were connected by a system bus When a memory request came into the (specially modified) MMU, a check was made to see if the word needed was in the local memory If so, a request was sent over the local bus to get the word If not, the request was routed over the system bus to the system containing the word, NKK-HUST which then responded Of course, the latter took much longer than the former While a program could run happily out of remote memory, it took 10 times longer to execute than the same program running out of local memory ACA2015 NUMA (Non-Uniform Memory Access) CPU Memory MMU Local bus CPU Memory CPU Memory Local bus CPU Memory Local bus Local bus System bus 8-32 A NUMA gian machine based two levels of buses The Cm* the CPU CóFigure không địa onchỉ chung cho tấtwascả first multiprocessor to use this design n  Mỗi CPU truy cập từ xa sang nhớ Memory coherence is guaranteed in an NC-NUMA machine because no caching isCPU present.khác Each word of memory lives in exactly one location, so there is no danger of one copy having stale data: there are no copies Of course, it now matn  nhập nhớ từ xamemory chậm hơnthetruy nhậppenalty ters aTruy great deal whichbộ page is in which because performance for being the wrong nhớincục place is so high Consequently, NC-NUMA machines use elaborate software to move pages around to maximize performance n  Typically, a daemon process called a page scanner runs every few seconds job is to examine the usage statistics andArchitecture move pages around in an attempt to 161 Advanced Computer improve performance If a page appears to be in the wrong place, the page scanner unmaps it so that the next reference to it will cause a page fault When the fault occurs, a decision is made about where to place the page, possibly in a different memory To prevent thrashing, usually there is some rule saying that once a page is placed, it is frozen in place for a time ∆T Various algorithms have been studied, NKK-HUST but the conclusion is that no one algorithm performs best under all circumstances (LaRowe and Ellis, 1991) Best performance depends on the application Its Mar 2015 Bộ xử lý đa lõi (multicores) Thay đổi xử lý: Tuần tự n  Pipeline n  Siêu vô hướng n  Đa luồng n  Đa lõi: nhiều CPU chip Program counter Instruction fetch unit Issue logic Single-thread register file Execution units and queues L1 instruction cache L1 data cache L2 cache (a) Superscalar n  PC PC n Issue logic Instruction fetch unit Registers n n  CHAPTER 18 / MULTICORE COMPUTERS Register 666 Execution units and queues L1 instruction cache L1 data cache L2 cache Processor (superscalar or SMT) L1-I L1-D Processor n (superscalar or SMT) Processor (superscalar or SMT) L1-I L1-D L1-I L1-D Processor (superscalar or SMT) L1-I L1-D (b) Simultaneous multithreading L2 cache (c) Multicore Figure 18.1 Alternative Chip Organizations Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced ComputerFor Architecture 162 to each of these innovations, designers have over the years attempted increase the performance of the system by adding complexity In the case of pipelining, simple three-stage pipelines were replaced by pipelines with five stages, and then many more stages, with some implementations having over a dozen stages There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic, more interconnections, and more control signals With superscalar organization, increased performance can be achieved by increasing the number of parallel pipelines Again, there are diminishing returns as the number of pipelines increases More logic is required to manage hazards and to stage instruction resources Eventually, a single thread of execution reaches the point where hazards and resource dependencies prevent the full use of the multiple 81 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Các dạng tổ chức xử lý đa lõi 675 18.3 / MULTICORE ORGANIZATION CPU Core CPU Core n CPU Core CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache Main memory I/O L2 cache I/O Main memory (b) Dedicated L2 cache (a) Dedicated L1 cache CPU Core CPU Core n L1-D L1-I L1-D L1-I 676 CPU Core CPU Core n L1-D L1-I L1-D L1-I cache L2 cache CHAPTER 18 / L2 MULTICORE COMPUTERS L2 cache L3 cache 18.4 INTEL x86 MULTICORE ORGANIZATION Main memory I/O (c) Shared L2 cache Main memory I/Omulticore products in recent years In this section, Intel has introduced a number of we look at two examples: the Intel Core Duo and the Intel Core i7-990X (d ) Shared L3 cache Figure 18.8 Multicore Organization IntelAlternatives Core Duo The IntelArchitecture Core Duo, introduced in 2006, implements two x86 superscalar processors Advanced Computer 163 Mar 2015 Interprocessor communication is with easy atoshared implement, via shared L2 cache (Figurememory 18.8c) locations The general of problem the Intel to Core The use of a shared L2 cache confines the cache structure coherency theDuo L1 is shown in Figure 18.9 Let us the key elements startingadvantage from the top of the figure As is common in mulcache level, which may provideconsider some additional performance ticore systems, each core has its own dedicated L1 cache In this case, each core has A potential advantage to having only dedicated caches the chip that a 32-kB instructionL2 cache and aon 32-kB data is cache each core enjoys more rapid access to its private L2 has cache This is advantageous for unit With the high transistor Each core an independent thermal control threads that exhibit strong locality density of today’s chips, thermal management is a fundamental capability, espeAs both the amount of memory available andand themobile number of cores cially for laptop systems Thegrow, Core the Duo thermal control unit is designed to manage heat dissipation to dedicated maximize processor performance within thermal use of a shared L3 cache combined with either chip a shared L2 cache or perconstraints Thermal management also aimproves core L2 caches seems likely to provide better performance than simply massiveergonomics with a cooler system and lower fan acoustic noise In essence, the thermal management unit monitors shared L2 cache sensors for high-accuracy temperature Another organizational designdigital decision in a multicore system die is whether the measurements Each core can bewill defined as an independent thermal zone The maximum temperature for each individual cores will be superscalar or implement simultaneous multithreading (SMT) For example, the Intel Core Duo uses superscalar cores, whereas the Intel Core i7 uses SMT cores SMT has the effect of scaling up the number of hardwarelevel threads that the multicore system supports Thus, a multicore system with four cores and SMT that supports four simultaneous threads in each core appears the same to the application level as a multicore system with 16 cores As software is developed to more fully exploit parallel resources, an SMT approach appears to be more attractive than a superscalar approach NKK-HUST n  n  n  32-kB L1 Caches Execution resources Arch state Arch state n  2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core 32-kB L1 Caches n  Execution resources Intel - Core Duo Thermal control Thermal control APIC APIC Power management logic 32KiB instruction and 32KiB data MB L2 shared cache 2MiB shared L2 cache Bus interface Front-side bus Figure 18.9 Intel Core Duo Block Diagram Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 164 82 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Intel Core i7-990X 678 CHAPTER 18 / MULTICORE COMPUTERS Core Core Core Core Core Core 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 12 MB L3 Cache DDR3 Memory Controllers QuickPath Interconnect ؋ 8B @ 1.33 GT/s ؋ 20B @ 6.4 GT/s Figure 18.10 Intel Core i7-990X Block Diagram The general structure of the Intel Core i7-990X is shown in Figure 18.10 Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache Mar 2015 One mechanism Intel uses toAdvanced make itsComputer caches Architecture more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that’s likely to be requested soon It is interesting to compare the performance of this three-level on chip cache organization with a comparable twolevel organization from Intel Table 18.1 shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency The Core Quad has a shared L2 cache, similar to the Core Duo The Core i7 NKK-HUST improves on L2 cache performance with the use of the dedicated L2 caches, and provides a relatively high-speed access to the L3 cache The Core i7-990X chip supports two forms of external communications to other chips The DDR3 memory controller brings the memory controller for the DDR main memory2 onto the chip The interface supports three channels that are bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s With the memory controller on the chip, the Front Side Bus is eliminated 165 Đa xử lý nhớ phân tán Table 18.1 Cache Latency (in clock cycles) CPU Clock Frequency L1 Cache L2 Cache L3 Cache Core Quad 2.66 GHz cycles 15 cycles — Core i7 2.66 GHz cycles 11 cycles 39 cycles The DDR synchronous RAM memory is discussed in Chapter Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 166 83 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Mạng liên kết SEC 8.4 MESSAGE-PASSING MULTICOMPUTERS (a) (b) (c) (d) (e) (f) (g) (h) 619 Figure 8-37 Various topologies The heavy dots represent switches The CPUs Advanced Computer and memories are not shown (a) A star (b) A Architecture complete interconnect (c) A tree (d) A ring (e) A grid (f) A double torus (g) A cube (h) A 4D hypercube Mar 2015 167 Interconnection networks can be characterized by their dimensionality For our purposes, the dimensionality is determined by the number of choices there are to get from the source to the destination If there is never any choice (i.e., there is only one path from each source to each destination), the network is zero dimensional If there is one dimension in which a choice can be made, for example, go NKK-HUST Kiến trúc máy tính tiên tiến Siêu máy tính máy tính qui mô lớn Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 168 84 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Siêu máy tính (Supercomputers) n  n  n  n  Mar 6242015 Hệ thống qui mô lớn Đắt tiền: nhiều triệu USD Dùng cho tính toán khoa học toán có số phép toán liệu lớn Tham khảo website: www.top500.org Advanced Computer Architecture PARALLEL COMPUTER ARCHITECTURES CHAP 169 coherency between the L1 caches on the four CPUs Thus when a shared piece of memory resides in more than one cache, accesses to that storage by one processor will be immediately visible to the other three processors A memory reference that misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles A miss on L2 that hits on L3 takes about 28 cycles Finally, a miss on L3 that has to go to the main DRAM takes about 75 cycles The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west In addition, NKK-HUST each processor has a port to the collective network, used for broadcasting data to all processors The barrier port is used to speed up synchronization operations, giving each processor fast access to a specialized synchronization network At the next level up, IBM designed a custom card that holds one of the chips shown in Fig 8-38 along with GB of DDR2 DRAM The chip and the card are shown in Fig 8-39(a)–(b) respectively IBM Blue Gene/P 2-GB DDR2 DRAM Chip: processors 8-MB L3 cache (a) Card Chip CPUs GB Board 32 Cards 32 Chips 128 CPUs 64 GB Cabinet 32 Boards 1024 Cards 1024 Chips 4096 CPUs TB System 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB (b) (c) (d) (e) Figure 8-39 The BlueGene/P: (a) chip (b) card (c) board (d) cabinet (e) system The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board Since each card contains GB of DRAM, the boards contain 64 GB apiece One board is illustrated in Fig 8-39(c) At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet A cabinet is illustrated in Fig 8-39(d) Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig 8-39(e) A PowerPC 450 can issue up to instructions/cycle, thus Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 170 85 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Cluster n  n  n  n  n  n  n  Nhiều máy tính kết nối với mạng liên kết tốc độ cao (~ Gbps) Mỗi máy tính làm việc độc lập (PC SMP) Mỗi máy tính gọi node Các máy tính quản lý làm việc song song theo nhóm (cluster) Toàn hệ thống coi máy tính song song Tính sẵn sàng cao Khả chịu lỗi lớn Mar 2015 Advanced Computer Architecture 171 NKK-HUST PC Cluster Google SEC 8.4 MESSAGE-PASSING MULTICOMPUTERS 635 hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster OC-12 Fiber OC-48 Fiber 128-port Gigabit Ethernet switch 128-port Gigabit Ethernet switch Two gigabit Ethernet links 80-PC rack Figure 8-44 A typical Google cluster Mar 2015 Power density is alsoAdvanced a key issue A typical PC burns about 120 watts or about Computer Architecture 10 kW per rack A rack needs about m2 so that maintenance personnel can install and remove PCs and for the air conditioning to function These parameters give a power density of over 3000 watts/m2 Most data centers are designed for 600–1200 watts/m2 , so special measures are required to cool the racks Google has learned three key things about running massive Web servers that bear repeating 172 Components will fail so plan for it Replicate everything for throughput and availability Nguyễn Kim Khánh DCE-HUST Optimize price/performance 86 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Máy tính qui mô lớn n  n  Warehouse-scale computers (WSC) Cung cấp dịch vụ Internet n  n  Khác biệt với HPC “clusters”: n  n  n  Search, social networking, online maps, video sharing, online shopping, email, cloud computing, Clusters có xử lý mạng hiệu cao Clusters tập trung song song mức luồng, WSCs tập trung vào song song mức yêu cầu Khác biệt với Datacenters: n  n  Datacenters hợp máy khác phần mềm vào vị trí Datacenters tập trung vào máy ảo phần cứng không đồng để phục vụ khách hàng khác Mar 2015 Advanced Computer Architecture 173 NKK-HUST Các yếu tố quan trọng thiết kế WSC n  n  n  n  n  n  n  n  Giá thành – Hiệu Hiệu lượng Độ tin cậy thông qua dự phòng Kết nối mạng Khối lượng tải tương tác lớn xử lý theo lô Song song tính toán yếu tố quan trọng n  Hầu hết công việc độc lập n  Song song mức yêu cầu Cần tính chi phí vận hành n  Năng lượng tiêu thụ vấn đề chính, cần ý ràng buộc thiết kế Qui mô đôi với hội thách thức Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 174 87 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Kiến trúc WSC n  n  n  WSC thường kết nối theo mạng phân cấp Mỗi tủ rack chứa nhiều servers, servers kết nối rack switch tốc độ cao Hệ thống lưu trữ: Các HDD servers n  Mạng kết nối HDD: SAN (Storage Area Networks) n  n  Mỗi server truy nhập DRAM HDD servers khác theo kiểu NUMA Mar 2015 Advanced Computer Architecture 175 NKK-HUST Hạ tầng WSC n  Vị trí đặt WSC n  n  n  Cung cấp điện: điện lưới, máy phát, máy biến áp, UPS, Hệ thống làm mát, hút ẩm: n  n  Kết nối với Internet backbones, giá điện năng, chi phí bảo trì, thuế, rủi ro thấp thảm họa (động đất, bão lụt, ), đảm bảo an ninh, 18-20 độ C Hệ thống phòng chống cháy Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 176 88 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Chi phí WSC n  n  Chi phí xây dựng WSC Chi phí vận hành WSC Mar 2015 Advanced Computer Architecture 177 NKK-HUST Kiến trúc máy tính tiên tiến Chuyên đề nghiên cứu Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 178 89 Bài giảng Kiến trúc máy tính tiên tiến ACA2015 NKK-HUST Một số chuyên đề: 1.  2.  3.  4.  5.  6.  Kiến trúc Intel Core hệ Kiến trúc GPGPU Nvidia Kepler Kiến trúc Snapdragon Quancomm Kiến trúc iPhone Kiến trúc Galaxy S6 Kiến trúc Nokia Lumia 930 Mar 2015 Advanced Computer Architecture 179 NKK-HUST Hết Mar 2015 Nguyễn Kim Khánh DCE-HUST Advanced Computer Architecture 180 90

Định dạng
Số trang	90
Dung lượng	9,11 MB