KIẾN TRÚC MÁY TÍNHđại học bách khoa hà nội

Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Hanoi University of Science and Technology Contact Information n n KIẾN TRÚC MÁY TÍNH n Computer Architecture Address: 502-B1 Mobile: 091-358-5533 e-mail: khanhnk@soict.hust.edu.vn khanh.nguyenkim@hust.edu.vn Nguyễn Kim Khánh Bộ mơn Kỹ thuật máy tính Viện Cơng nghệ thông tin Truyền thông Department of Computer Engineering (DCE) School of Information and Communication Technology (SoICT) Version: CA-2017 2017 NKK-HUST Kiến trúc máy tính NKK-HUST Mục tiêu học phần n n Tài liệu học tập n Sinh viên trang bị kiến thức sở kiến trúc tập lệnh tổ chức máy tính, nguyên tắc thiết kế máy tính Sau học xong học phần này, sinh viên có khả năng: n n n n n 2017 n [1] William Stallings Computer Organization and Architecture – 2013, 9th edition [2] David A Patterson, John L Hennessy Computer Organization and Design – 2012, Revised 4th edition [3] David Money Harris, Sarah L Harris Digital Design and Computer Architecture – 2013, 2nd edition Tìm hiểu kiến trúc tập lệnh xử lý cụ thể Lập trình hợp ngữ Đánh giá hiệu máy tính cải thiện hiệu chương trình Khai thác quản trị hiệu hệ thống máy tính Phân tích thiết kế máy tính Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST Bài giảng Kiến trúc máy tính ftp://dce.soict.hust.edu.vn/khanhnk/CA/ Sách tham khảo: [4] Andrew S Tanenbaum Structured Computer Organization – 2013, 6th edition n Phần mềm lập trình hợp ngữ mô cho MIPS: MARS (MIPS Assembler and Runtime Simulator) download tại: http://courses.missouristate.edu/KenVollmar/MARS/ 2017 Kiến trúc máy tính Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST 2017 NKK-HUST Nội dung học phần Content Chương Giới thiệu chung Chương Cơ logic số Chương Hệ thống máy tính Chương Số học máy tính Chương Kiến trúc tập lệnh Chương Bộ xử lý Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song Chapter Introduction Chapter The Basics of Digital Logic Chapter Computer Systems Chapter Computer Arithmetic Chapter Instruction Set Architecture Chapter The Processors Chapter Computer Memory Chapter Input-Output Systems Chapter Parallel Architectures Kiến trúc máy tính NKK-HUST 2017 Kiến trúc máy tính NKK-HUST Kiến trúc máy tính Nội dung chương 1.1 Máy tính phân loại máy tính 1.2 Khái niệm kiến trúc máy tính 1.3 Sự tiến hóa cơng nghệ máy tính 1.4 Hiệu máy tính Chương GIỚI THIỆU CHUNG Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 2017 Kiến trúc máy tính Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST 1.1 Máy tính phân loại máy tính n Mơ hình đơn giản máy tính Máy tính (Computer) thiết bị điện tử thực công việc sau: n n n xử lý liệu Nhận liệu vào, Xử lý liệu theo dãy lệnh nhớ sẵn bên trong, Đưa liệu (thông tin) Các thiết bị vào (Input Devices) Dãy lệnh nằm nhớ để yêu cầu máy tính thực cơng việc cụ thể gọi chương trình (program) Máy tính hoạt động theo chương trình n 2017 Kiến trúc máy tính liệu vào 2017 n Desktop computers, Laptop computers Máy tính đa dụng n n Thiết bị di động cá nhân (PMD - Personal Mobile Devices) n Máy chủ (Servers) – máy phục vụ n n Dùng mạng để quản lý cung cấp dịch vụ Hiệu độ tin cậy cao Hàng nghìn đến hàng triệu USD n n n n Dùng cho tính tốn cao cấp khoa học kỹ thuật Hàng triệu đến hàng trăm triệu USD n Máy tính nhúng (Embedded Computers) n n Đặt ẩn thiết bị khác Được thiết kế chuyên dụng Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST n 11 2017 Smartphones, Tablet Kết nối Internet Điện toán đám mây (Cloud Computing) Siêu máy tính (Supercomputers) n 2017 10 Phân loại máy tính kỷ nguyên sau PC Máy tính cá nhân (Personal Computers) n n liệu NKK-HUST n n chương trình thực Kiến trúc máy tính Phân loại máy tính kỷ nguyên PC n Các thiết bị (Output Devices) Bộ nhớ (Main Memory) NKK-HUST n Bộ xử lý trung tâm (Central Processing Unit) Sử dụng máy tính qui mơ lớn (Warehouse Scale Computers), gồm nhiều servers kết nối với Cho công ty thuê phần để cung cấp dịch vụ phần mềm Software as a Service (SaaS): phần phần mềm chạy PMD, phần chạy Cloud Ví dụ: Amazon, Google Kiến trúc máy tính 12 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST 1.2 Khái niệm kiến trúc máy tính n Kiến trúc máy tính bao gồm: n n n n Phân lớp máy tính Kiến trúc tập lệnh (Instruction Set Architecture): nghiên cứu máy tính theo cách nhìn người lập trình Người sử dụng Tổ chức máy tính (Computer Organization) hay Vi kiến trúc (Microarchitecture): nghiên cứu thiết kế máy tính mức cao (thiết kế CPU, hệ thống nhớ, cấu trúc bus, ) NKK-HUST 13 Ngôn ngữ bậc cao Hợp ngữ n n n Assembly language Mô tả lệnh dạng text n n Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST Lập lịch cho nhiệm vụ chia sẻ tài nguyên n Quản lý nhớ lưu trữ n Điều khiển vào-ra Phần cứng 2017 Bộ xử lý, nhớ, mô-đun vào-ra Kiến trúc máy tính NKK-HUST Các thành phần máy tính n language such as C, Cϩϩ, Java, or Visual Basic that is composed of words and algebraic notation that can be translated by a compiler into assembly language n CPU Bộ nhớ Giống với tất loại máy tính Bộ xử lý trung tâm (Central Processing Unit – CPU) n Assembly language program (for MIPS) swap: multi add lw lw sw sw jr $2, $5,4 $2, $4,$2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 Bus hệ thống n n n 2017 Trao đổi thơng tin máy tính với bên ngồi Bus hệ thống (System bus) n 15 FIGURE 1.4 C program compiled into assembly language and then assembled into binary machine language Although the translation from high-level language to binary machine language is shown in two steps, some compilers cut out the middleman and produce binary machine language directly These languages and this program are examined in more detail in Chapter Chứa chương trình thực Hệ thống vào-ra (Input/Output) n 00000000101000100000000100011000 00000000100000100001000000100001 10001101111000100000000000000000 10001110000100100000000000000100 10101110000100100000000000000000 10101101111000100000000000000100 00000011111000000000000000001000 Điều khiển hoạt động máy tính xử lý liệu Bộ nhớ (Main Memory) n Assembler Binary machine language program (for MIPS) 14 15 Hệ thống vào-ra Machine language Mô tả theo phần cứng Các lệnh liệu mã hóa theo nhị phân Hệ điều hành (Operating System) Phần cứng high-level programming language A portable swap(int v[], int k) {int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Chương trình dịch (Compiler): dịch mã ngôn ngữ bậc cao thành ngôn ngữ máy n Compiler Ngôn ngữ máy n 2017 High-level language – HLL Mức trừu tượng gần với vấn đề cần giải Hiệu linh động High-level language program (in C) Được viết theo ngôn ngữ bậc cao Phần mềm hệ thống Các mức mã chương trình n n n The recognition that a program could be written to translate a more powerful language into computer instructions was one of the great breakthroughs in the early days of computing Programmers today owe their productivity—and their sanity—to the creation of high-level programming languages and compilers that translate programs in such languages into instructions Figure 1.4 shows the relationships among these programs and languages, which are more examples of the power of abstraction n Người lập trình hệ thống n 1.3 Below Your Program n Phần mềm hệ thống Cùng kiến trúc tập lệnh có nhiều sản phẩm (tổ chức, phần cứng) khác Kiến trúc máy tính n n n Phần cứng (Hardware): nghiên cứu thiết kế logic chi tiết cơng nghệ đóng gói máy tính 2017 n Phần mềm ứng dụng n Người lập trình Phần mềm ứng dụng n Kết nối vận chuyển thơng tin Kiến trúc máy tính 16 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST 1.3 Sự tiến hóa cơng nghệ máy tính n Máy tính dùng đèn điện tử chân khơng (1950s) n n n n n n SSI - Small Scale Integration MSI - Medium Scale Integration LSI - Large Scale Integration n Máy tính dùng vi mạch VLSI (1980s) n n Máy tính ENIAC: máy tính (1946) Máy tính IAS: máy tính von Neumann (1952) Máy tính dùng transistors (1960s) Máy tính dùng vi mạch SSI, MSI LSI (1970s) n n Máy tính đầu tiên: ENIAC IAS n VLSI - Very Large Scale Integration n Máy tính dùng vi mạch ULSI (1990s-nay) n n ULSI - Ultra Large Scale Integration 2017 n Kiến trúc máy tính 17 NKK-HUST Electronic Numerical Intergator and Computer Dự án Bộ Quốc phòng Mỹ Do John Mauchly đại học Pennsylvania thiết kế 30 Xử lý theo số thập phân 2017 n n n Thực Princeton Institute for Advanced Studies Do John von Neumann thiết kế theo ý tưởng “stored program” Xử lý theo số nhị phân Trở thành mơ hình máy tính Kiến trúc máy tính 18 NKK-HUST Máy tính ngày Một số loại vi mạch số điển hình Massive Cluster Gigabit Ethernet n Bộ vi xử lý (Microprocessors) n Vi mạch điều khiển tổng hợp (Chipset) n Clusters n n Refrigerators Sensor Nets n n n Routers Routers Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 19 2017 Vi mạch thực chức nối ghép thành phần máy tính với ROM, RAM, Flash memory Hệ thống chip (SoC – System on Chip) hay Bộ vi điều khiển (Microcontrollers) n RobotsRobots Một vài CPU chế tạo chip Bộ nhớ bán dẫn (Semiconductor Memory) Cars 2017 n Tích hợp thành phần máy tính chip vi mạch Được sử dụng chủ yếu smartphone, tablet máy tính nhúng Kiến trúc máy tính 20 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Sự phát triển vi xử lý n n n 1971: vi xử lý 4-bit Intel 4004 1972: xử lý 8-bit 1978: xử lý 16-bit n n n n 1.4 Hiệu máy tính Hiệu = 1/(thời gian thực hiện) hay là: P = 1/t Máy tính cá nhân IBM-PC đời năm 1981 “Máy tính A nhanh máy B k lần” 1985: xử lý 32-bit 2001: xử lý 64-bit 2006: xử lý đa lõi (multicores) n Định nghĩa hiệu P (Performance): n PA / PB = tB / tA = k Ví dụ: Thời gian chạy chương trình: n Nhiều CPU chip n n n 2017 Kiến trúc máy tính 21 NKK-HUST 2017 Kiến trúc máy tính Thời gian thực CPU Về mặt thời gian, CPU hoạt động theo xung nhịp (clock) có tốc độ xác định n n n 2017 Chu kỳ xung nhịp T0 (Clock period): thời gian chu kỳ Tốc độ xung nhịp f0 (Clock rate) Tần số xung nhịp: số chu kỳ 1s n f0 = 1/T0 VD: Bộ xử lý có f0 = 4GHz = 4×109Hz T0 = 1/(4x109) = 0.25x10–9s = 0.25ns Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST Để đơn giản, ta xét thời gian CPU thực chương trình (CPU time): Thời gian thực CPU = Số chu kỳ xung nhịp x Thời gian chu kỳ T0 n 22 NKK-HUST Tốc độ xung nhịp CPU n 10s máy A, 15s máy B tB / tA = 15s / 10s = 1.5 Vậy máy A nhanh máy B 1.5 lần tCPU = n × T0 = n f0 n: số chu kỳ xung nhịp n Hiệu tăng lên cách: n n 23 2017 Giảm số chu kỳ xung nhịp n Tăng tốc độ xung nhịp f0 Kiến trúc máy tính 24 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Ví dụ n n Hai máy tính A B chạy chương trình Máy tính A: n n n Ta có: Tốc độ xung nhịp CPU: fA = 2GHz Thời gian CPU thực chương trình: tA = 10s t= n n f Số chu kỳ xung nhịp chạy chương trình máy A: nA = t A × f A = 10s × 2GHz = 20 ×10 Máy tính B: n n Ví dụ (tiếp) Thời gian CPU thực chương trình: tB = 6s Số chu kỳ xung nhịp chạy chương trình máy B (nB) nhiều 1.2 lần số chu kỳ xung nhịp chạy chương trình máy A (nA) Số chu kỳ xung nhịp chạy chương trình máy B: nB = 1.2 × nA = 24 ×10 Hãy xác định tốc độ xung nhịp cần thiết cho máy B (fB)? Tốc độ xung nhịp cần thiết cho máy B: fB = 2017 Kiến trúc máy tính 25 NKK-HUST nB 24 ×10 = = ×10 Hz = 4GHz tB 2017 Kiến trúc máy tính 26 NKK-HUST Số lệnh số chu kỳ lệnh Ví dụ Số chu kỳ xung nhịp chương trình: n Số chu kỳ = Số lệnh chương trình x Số chu kỳ lệnh n n = IC × CPI n n n n n - số chu kỳ xung nhịp IC - số lệnh chương trình (Instruction Count) CPI - số chu kỳ lệnh (Cycles per Instruction) n n n n IC × CPI f0 Chu kỳ xung nhịp: TA = 250ps Số chu kỳ/ lệnh trung bình: CPIA = 2.0 Máy tính B: n Vậy thời gian thực CPU: tCPU = IC × CPI × T0 = Hai máy tính A B có kiến trúc tập lệnh Máy tính A có: Chu kỳ xung nhịp: TB = 500ps Số chu kỳ/ lệnh trung bình: CPIB = 1.2 Hãy xác định máy nhanh nhanh ? Trong trường hợp lệnh khác có CPI khác nhau, cần tính CPI trung bình 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 27 2017 Kiến trúc máy tính 28 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Ví dụ (tiếp) Ta có: CPI trung bình tCPU = IC × CPITB × T0 n Hai máy kiến trúc tập lệnh, số lệnh chương trình hai máy nhau: Nếu loại lệnh khác có số chu kỳ khác nhau, ta có tổng số chu kỳ: K ICA = ICB = IC n = ∑ (CPI i × ICi ) Thời gian thực chương trình máy A máy B: i=1 t A = ICA × CPI A × TA = IC × 2.0 × 250 ps = IC × 500 ps n CPI trung bình: t B = ICB × CPI B × TB = IC ×1.2 × 500 ps = IC × 600 ps Từ ta có: CPITB = t B IC × 600 ps = = 1.2 t A IC × 500 ps n K = ∑ (CPI i × ICi ) IC IC i=1 Kết luận: máy A nhanh máy B 1.2 lần 2017 Kiến trúc máy tính 29 NKK-HUST 2017 Kiến trúc máy tính NKK-HUST Ví dụ n Ví dụ Cho bảng dãy lệnh sử dụng lệnh thuộc loại A, B, C Tính CPI trung bình? n Cho bảng dãy lệnh sử dụng lệnh thuộc loại A, B, C Tính CPI trung bình? Loại lệnh A B C Loại lệnh A B CPI theo loại lệnh CPI theo loại lệnh IC dãy lệnh 20 10 20 IC dãy lệnh 20 10 20 IC dãy lệnh 40 10 10 IC dãy lệnh 40 10 10 n Dãy lệnh 1: Số lệnh = 50 n Số chu kỳ = = 1x20 + 2x10 + 3x20 = 100 n CPITB = 100/50 = 2.0 n 2017 30 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 31 2017 Kiến trúc máy tính C Dãy lệnh 2: Số lệnh = 60 Số chu kỳ = = 1x40 + 2x10 + 3x10 = 90 n CPITB = 90/60 = 1.5 n 32 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST MIPS thước đo hiệu Tóm tắt Hiệu CPU Time = Instructions Clock cycles Seconds ´ ´ Program Instruction Clock cycle n MIPS: Millions of Instructions Per Second (Số triệu lệnh giây) Thời gian CPU = Số lệnh chương trình x Số chu kỳ/lệnh x Số giây chu kỳ tCPU n MIPS = IC × CPI = IC × CPI × T0 = f0 Instruction count Instruction count Clock rate = = Execution time ×106 Instruction count × CPI ×106 CPI×106 Clock rate Hiệu phụ thuộc vào: n n n n n MIPS = Thuật giải Ngơn ngữ lập trình Chương trình dịch Kiến trúc tập lệnh Phần cứng 2017 Kiến trúc máy tính 33 NKK-HUST f0 CPI ´ 10 2017 CPI = f0 MIPS ´ 10 Kiến trúc máy tính 34 NKK-HUST Ví dụ Ví dụ Tính MIPS xử lý với: clock rate = 2GHz CPI = Tính MIPS xử lý với: clock rate = 2GHz CPI = 0.5ns 2ns § § § § 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 35 2017 Chu kỳ T0 = 1/(2x109) = 0.5ns CPI = thời gian thực lệnh = x 0.5ns = 2ns Số lệnh thực 1s = (109ns)/(2ns) = 5x108 lệnh Vậy xử lý thực 500 MIPS Kiến trúc máy tính 36 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Ví dụ Ví dụ Tính CPI xử lý với: clock rate = 1GHz 400 MIPS Tính CPI xử lý với: clock rate = 1GHz 400 MIPS 1ns § § § § 2017 Kiến trúc máy tính 37 NKK-HUST 2017 Kiến trúc máy tính 38 NKK-HUST Các ý tưởng tuyệt vời kiến trúc máy tính MFLOPS Design for Moore’s Law Thiết kế theo luật Moore Use abstraction to simplify design Sử dụng trừu tượng hóa để đơn giản thiết kế Make the common case fast Làm cho trường hợp phổ biến thực nhanh Performance via parallelism Tăng hiệu qua xử lý song song Performance via pipelining Tăng hiệu qua kỹ thuật đường ống Performance via prediction Tăng hiệu thơng qua dự đốn Hierarchy of memories Phân cấp nhớ Dependability via redundancy Tăng độ tin cậy thơng qua dự phòng § Sử dụng cho hệ thống tính tốn lớn § Millions of Floating Point Operations per Second § Số triệu phép toán số dấu phẩy động giây MFLOPS = Executed floating point operations Execution time ×106 GFLOPS（109 ） TFLOPS（1012） PFLOPS (1015) 2017 Chu kỳ T0 = 1/109 = 1ns Số lệnh thực s 400MIPS = 4x108 lệnh Thời gian thực lệnh = 1/(4x108)s = 2.5ns Vậy ta có: CPI = 2.5 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 39 2017 Kiến trúc máy tính 40 10 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Vào-ra điều khiển ngắt n Nguyên tắc chung: n n n n 2017 Chuyển điều khiển đến chương trình ngắt Chương trình thực CPU đợi trạng thái sẵn sàng mơ-đun vào-ra, CPU thực chương trình Khi mơ-đun vào-ra sẵn sàng phát tín hiệu ngắt CPU CPU thực chương trình xử lý ngắt vào-ra tương ứng để trao đổi liệu CPU trở lại tiếp tục thực chương trình bị ngắt Kiến trúc máy tính lệnh Ngắt n n n lệnh lệnh i lệnh RETURN lệnh 453 2017 Kiến trúc máy tính 454 NKK-HUST Hoạt động vào liệu: nhìn từ CPU n Mơ-đun vào-ra nhận tín hiệu điều khiển đọc từ CPU Mơ-đun vào-ra nhận liệu từ thiết bị vào-ra, CPU làm việc khác Khi có liệu mơ-đun vào-ra phát tín hiệu ngắt CPU CPU u cầu liệu Mô-đun vào-ra chuyển liệu đến CPU n n n Phát tín hiệu điều khiển đọc Làm việc khác Cuối chu trình lệnh, kiểm tra tín hiệu yêu cầu ngắt Nếu bị ngắt: n n n 2017 lệnh lệnh lệnh Hoạt động vào liệu: nhìn từ mô-đun vào-ra n lệnh lệnh i+1 NKK-HUST n Chương trình xử lý ngắt lệnh Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 455 2017 Cất ngữ cảnh (nội dung ghi liên quan) Thực chương trình xử lý ngắt để vào liệu Khôi phục ngữ cảnh chương trình thực Kiến trúc máy tính 456 114 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Các vấn đề nảy sinh thiết kế n n Các phương pháp nối ghép ngắt Làm để xác định mô-đun vào-ra phát tín hiệu ngắt ? CPU làm có nhiều yêu cầu ngắt xẩy ? n n n n 2017 Kiến trúc máy tính 457 NKK-HUST 2017 Kiến trúc máy tính Thanh ghi yêu cầu ngắt CPU n n n 2017 458 NKK-HUST Nhiều đường yêu cầu ngắt n Sử dụng nhiều đường yêu cầu ngắt Hỏi vòng phần mềm (Software Poll) Hỏi vòng phần cứng (Daisy Chain or Hardware Poll) Sử dụng điều khiển ngắt (PIC) Hỏi vòng phần mềm INTR3 Cờ ngắt INTR2 INTR1 INTR0 Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra CPU n Mỗi mô-đun vào-ra nối với đường yêu cầu ngắt CPU phải có nhiều đường tín hiệu yêu cầu ngắt Hạn chế số lượng mô-đun vào-ra Các đường ngắt qui định mức ưu tiên Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST n n 459 2017 INTR Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra CPU thực phần mềm hỏi mô-đun vào-ra Chậm Thứ tự mô-đun hỏi vòng thứ tự ưu tiên Kiến trúc máy tính 460 115 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Hỏi vòng phần cứng Hỏi vòng phần cứng (tiếp) n Bus liệu Cờ ngắt INTR n INTA Mô-đun vào-ra CPU Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra n 2017 Kiến trúc máy tính 461 NKK-HUST 2017 Kiến trúc máy tính Đặc điểm vào-ra điều khiển ngắt INTR n n Bus liệu INTRn-1 INTR CPU PIC INTR1 INTA n Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra n n 2017 PIC – Programmable Interrupt Controller PIC có nhiều đường vào yêu cầu ngắt có qui định mức ưu tiên PIC chọn yêu cầu ngắt khơng bị cấm có mức ưu tiên cao gửi tới CPU Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST Có kết hợp phần cứng phần mềm n INTR0 Mô-đun vào-ra n 462 NKK-HUST Bộ điều khiển ngắt lập trình n CPU phát tín hiệu chấp nhận ngắt (INTA) đến mô-đun vào-ra Nếu mô-đun vào-ra khơng gây ngắt gửi tín hiệu đến mô-đun xác định mô-đun gây ngắt Thứ tự mô-đun vào-ra kết nối chuỗi xác định thứ tự ưu tiên n 463 2017 Phần cứng: gây ngắt CPU Phần mềm: trao đổi liệu CPU với mô-đun vào-ra CPU trực tiếp điều khiển vào-ra CPU đợi mô-đun vào-ra, hiệu sử dụng CPU tốt Kiến trúc máy tính 464 116 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST DMA (Direct Memory Access) n Vào-ra chương trình ngắt CPU trực tiếp điều khiển: n n Bộ đếm liệu Các đường liệu 465 2017 Đọc Yêu cầu DMA Ghi Chấp nhận DMA 466 NKK-HUST Hoạt động DMA CPU “nói” cho DMAC n Thanh ghi liệu: chứa liệu trao đổi Thanh ghi địa chỉ: chứa địa ngăn nhớ liệu Bộ đếm liệu: chứa số từ liệu cần trao đổi Logic điều khiển: điều khiển hoạt động DMAC n n n n n n n n Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 467 2017 Vào hay Ra liệu Địa thiết bị vào-ra (cổng vào-ra tương ứng) Địa đầu mảng nhớ chứa liệu nạp vào ghi địa Số từ liệu cần truyền nạp vào đếm liệu CPU làm việc khác DMAC điều khiển trao đổi liệu Sau truyền từ liệu thì: n 2017 Logic điều khiển Kiến trúc máy tính n n Điều khiển ghi Ngắt Các thành phần DMAC n Điều khiển đọc Yêu cầu bus Chuyển nhượng bus NKK-HUST n Thanh ghi địa Các đường địa Sử dụng mô-đun điều khiển vào-ra chuyên dụng, gọi DMAC (Controller), điều khiển trao đổi liệu mơ-đun vào-ra với nhớ Kiến trúc máy tính n Thanh ghi liệu Chiếm thời gian CPU Để khắc phục dùng kỹ thuật DMA n 2017 Sơ đồ cấu trúc DMAC nội dung ghi địa tăng nội dung đếm liệu giảm Khi đếm liệu = 0, DMAC gửi tín hiệu ngắt CPU để báo kết thúc DMA Kiến trúc máy tính 468 117 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Các kiểu thực DMA n n n Cấu hình DMA (1) DMA truyền theo khối (Block-transfer DMA): DMAC sử dụng bus để truyền xong khối liệu DMA lấy chu kỳ (Cycle Stealing DMA): DMAC cưỡng CPU treo tạm thời chu kỳ bus, DMAC chiếm bus thực truyền từ liệu DMA suốt (Transparent DMA): DMAC nhận biết chu kỳ CPU không sử dụng bus chiếm bus để trao đổi từ liệu 2017 Kiến trúc máy tính System Bus CPU n n NKK-HUST I/O Module I/O Module Giữa mô-đun vào-ra với DMAC Giữa DMAC với nhớ 2017 Kiến trúc máy tính 470 NKK-HUST Cấu hình DMA (2) Cấu hình DMA (3) System Bus System Bus CPU DMAC DMAC I/O Module n n CPU Memory DMAC Memory IO Bus I/O Module I/O Module I/O Module DMAC điều khiển vài mô-đun vào-ra Mỗi lần trao đổi liệu, DMAC sử dụng bus lần n 2017 Memory Mỗi lần trao đổi liệu, DMAC sử dụng bus hai lần n 469 DMAC n n Giữa DMAC với nhớ Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 2017 I/O Module Bus vào-ra tách rời hỗ trợ tất thiết bị cho phép DMA Mỗi lần trao đổi liệu, DMAC sử dụng bus lần n 471 I/O Module Giữa DMAC với nhớ Kiến trúc máy tính 472 118 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Đặc điểm DMA n n n 2017 Bộ xử lý vào-ra n CPU khơng tham gia q trình trao đổi liệu DMAC điều khiển trao đổi liệu nhớ với mơ-đun vào-ra (hồn tồn phần cứng)à tốc độ nhanh Phù hợp với yêu cầu trao đổi mảng liệu có kích thước lớn Kiến trúc máy tính n n 473 NKK-HUST Việc điều khiển vào-ra thực xử lý vào-ra chuyên dụng Bộ xử lý vào-ra hoạt động theo chương trình riêng Chương trình xử lý vào-ra nằm nhớ nằm nhớ riêng 2017 Kiến trúc máy tính NKK-HUST Nối ghép song song 8.3 Nối ghép thiết bị vào-ra Các kiểu nối ghép vào-ra n Nối ghép song song n Nối ghép nối tiếp Đến bus hệ thống n n n 2017 474 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 475 2017 Mô-đun vào-ra song song Đến thiết bị vào-ra Truyền nhiều bit song song Tốc độ nhanh Cần nhiều đường truyền liệu Kiến trúc máy tính 476 119 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Nối ghép nối tiếp Các cấu hình nối ghép n Điểm tới điểm (Point to Point) n Đến bus hệ thống Mô-đun vào-ra nối tiếp Đến thiết bị vào-ra n Điểm tới đa điểm (Point to Multipoint) n n n n n n Truyền bit Cần có chuyển đổi từ liệu song song sang nối tiếp hoặc/và ngược lại Tốc độ chậm Cần đường truyền liệu 2017 Kiến trúc máy tính Thơng qua cổng vào-ra nối ghép với thiết bị Thông qua cổng vào-ra cho phép nối ghép với nhiều thiết bị Ví dụ: n n n 477 NKK-HUST 2017 USB (Universal Serial Bus): 127 thiết bị IEEE 1394 (FireWire): 63 thiết bị Thunderbolt Kiến trúc máy tính 478 NKK-HUST Thunderbolt 7.7 / THE EXTERNAL INTERFACE: THUNDERBOLT AND INFINIBAND 251 COMPUTER Memory Graphics Subsystem Processor Hết chương Platform controller hub (PCH) DisplayPort DisplayPort PCIe x4 TC Thunderbolt controller Thunderbolt connector Thunderbolt 20 Gbps (max) Daisy chain TC TC Figure 7.17 Example Computer Configuration with Thunderbolt 2017 Kiến trúc máy tính THUNDERBOLT PROTOCOL ARCHITECTURE Figure 7.18 illustrates the Thunderbolt protocol architecture The cable and connector layer provides transmission medium access This layer specifies the physical and electrical attributes of the connector port The Thunderbolt protocol physical layer is responsible for link maintenance including hot-plug3 detection and data encoding to provide highly efficient data transfer The physical layer has been designed to introduce very minimal overhead and provides full-duplex 10 Gbps of usable capacity to the upper layers The common transport layer is the key to the operation of Thunderbolt and what makes it attractive as a high-speed peripheral I/O technology Some of the features include: Nguyễn Kim Khánh DCE-HUST 479 2017 Kiến trúc máy tính 480 120 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Kiến trúc máy tính Nội dung học phần Chương Giới thiệu chung Chương Cơ logic số Chương Hệ thống máy tính Chương Số học máy tính Chương Kiến trúc tập lệnh Chương Bộ xử lý Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song Chương CÁC KIẾN TRÚC SONG SONG Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội 2017 Kiến trúc máy tính 481 NKK-HUST 2017 Kiến trúc máy tính 482 NKK-HUST Nội dung chương 9.1 Phân loại kiến trúc máy tính Phân loại kiến trúc máy tính (Michael Flynn -1966) 9.1 Phân loại kiến trúc máy tính 9.2 Đa xử lý nhớ dùng chung 9.3 Đa xử lý nhớ phân tán 9.4 Bộ xử lý đồ họa đa dụng 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 483 2017 n SISD - Single Instruction Stream, Single Data Stream n SIMD - Single Instruction Stream, Multiple Data Stream n MISD - Multiple Instruction Stream, Single Data Stream n MIMD - Multiple Instruction Stream, Multiple Data Stream Kiến trúc máy tính 484 121 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST SIMD SISD CU n n n n n n n IS PU DS MU PU1 CU: Control Unit PU: Processing Unit MU: Memory Unit Một xử lý Đơn dòng lệnh Dữ liệu lưu trữ nhớ Chính Kiến trúc von Neumann (tuần tự) 2017 Kiến trúc máy tính CU PUn 485 2017 n n LM2 DS LMn 486 NKK-HUST MISD Đơn dòng lệnh điều khiển đồng thời đơn vị xử lý PUs Mỗi đơn vị xử lý có nhớ liệu riêng LM (local memory) Mỗi lệnh thực tập liệu khác Các mơ hình SIMD n n 2017 DS Kiến trúc máy tính SIMD (tiếp) n PU2 LM1 NKK-HUST n IS DS n n n n Một luồng liệu truyền đến tập xử lý Mỗi xử lý thực dãy lệnh khác Chưa tồn máy tính thực tế Có thể có tương lai Vector Computer Array processor Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 487 2017 Kiến trúc máy tính 488 122 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST MIMD n n n MIMD - Shared Memory Đa xử lý nhớ dùng chung (shared memory mutiprocessors) Tập xử lý Các xử lý đồng thời thực dãy lệnh khác liệu khác Các mơ hình MIMD n n CU1 CU2 Multiprocessors (Shared Memory) Multicomputers (Distributed Memory) Kiến trúc máy tính 489 NKK-HUST PU1 PU2 DS DS IS 2017 PUn Bộ nhớ dùng chung DS Kiến trúc máy tính 490 NKK-HUST MIMD - Distributed Memory Phân loại kỹ thuật song song Đa xử lý nhớ phân tán (distributed memory mutiprocessors or multicomputers) n Song song mức lệnh n n CU1 CU2 IS IS PU1 PU2 CUn DS DS IS PUn LM1 LM2 DS n n Mạng liên kết hiệu cao n n Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 491 2017 SIMD Song song mức luồng n LMn pipeline superscalar Song song mức liệu MIMD Song song mức yêu cầu n 2017 IS CUn 2017 IS Cloud computing Kiến trúc máy tính 492 123 598 PARALLEL COMPUTER ARCHITECTURES CHAP Memory consistency is not a done deal Researchers are still proposing new models (Naeem et al., 2011, Sorin et al., 2011, and Tu et al., 2010) Bài giảng Kiến trúc máy tính CA-20172 8.3.3 UMA Symmetric Multiprocessor Architectures The simplest multiprocessors are based on a single bus, as illustrated in Fig 8-26(a) Two or more CPUs and one or more memory modules all use the same bus for communication When a CPU wants to read a memory word, it first checks to see whether the bus is busy If the bus is idle, the CPU puts the address of the word it wants on the bus, asserts a few control signals, and waits until the memory puts the desired word on the bus NKK-HUST NKK-HUST 9.2 Đa xử lý nhớ dùng chung n n n SMP hay UMA (Uniform Memory Access) Hệ thống đa xử lý đối xứng (SMPSymmetric Multiprocessors) Hệ thống đa xử lý không đối xứng (NUMA – Non-Uniform Memory Access) Bộ xử lý đa lõi (Multicore Processors) CPU CPU M Shared memory Private memory Shared memory CPU CPU M CPU CPU Cache Bus (a) SEC 8.3 (b) (c) SHARED-MEMORY MULTIPROCESSORS Figure 8-26 Three bus-based multiprocessors (a) Without caching (b) With caching (c) With caching and private memories 2017 Kiến trúc máy tính 493 NKK-HUST SMP (tiếp) n n n n n n n 2017 M 607 system is called CC-NUMA (at least by the hardware people) The software people often because basically the memory, same as software If thecall busitishardware busy whenDSM a CPU wantsittois read or write the CPUdisjust tributed shared but implemented thethe hardware small page size waits until the memory bus becomes idle Hereinby lies problemusing with athis design With of the firstcontention NC-NUMA (although the namewith had32 notor yet twoOne or three CPUs, formachines the bus will be manageable; 64 itbeen will coined) was the The Carnegie-Mellon illustrated in the simplified formofintheFig be unbearable system will beCm*, totally limited by bandwidth bus,8-32 and (Swan et al., 1977) It consisted of a collection of LSI-11 CPUs, each with some most of the CPUs will be idle most of the time 2017 memory addressed over a local bus Kiến(The trúc máy tính 494 LSI-11 was a single-chip version of the The solution is to add a cache to each CPU, as depicted in Fig 8-26(b) The DEC a minicomputer popular In on addition, the LSI-11 sys-or cachePDP-11, can be inside the CPU chip, next in to the the 1970s.) CPU chip, the processor board, tems connected a system bus.many When a memory cameout intoofthe some were combination of by all three Since reads can nowrequest be satisfied the (specially modified) MMU, a check to see thesystem word needed was inmore the local cache, there will be much lesswas busmade traffic, andifthe can support local If so, a isrequest the localasbus get the If not, CPUs.memory Thus caching a big was win sent here.over However, wetoshall see word in a moment, the request routed over the bus toisthe keeping the was caches consistent withsystem one another not system trivial containing the word, NKK-HUST which responded Of iscourse, the latter muchin longer than CPU the former Yetthen another possibility the design of Fig.took 8-26(c), which each has not While program happily outmemory of remote memory, it took over 10 times longer only aacache but could also arun local, private which it accesses a dedicated to executebus thanTo theuse same running out of local (private) thisprogram configuration optimally, the memory compiler should place all the program text, strings, constants and other read-only data, stacks, and local variCPU Memory CPU The Memory CPU Memory ables in the private memories shared memory is then used CPU only Memory for writable shared variables In most cases, this careful placement will greatly reduce bus traffic, but it does require active cooperation from the compiler NUMA (Non-Uniform Memory Access) Một máy tính có n >= xử lý giống Các xử lý dùng chung nhớ hệ thống vào-ra Thời gian truy cập nhớ với xử lý Các xử lý thực chức giống Hệ thống điều khiển hệ điều hành phân tán Hiệu năng: Các cơng việc thực song song Khả chịu lỗi Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST MMU Local bus Local bus Local bus Local bus System bus 8-32 A NUMA gian machine based two levels of buses The Cm* the CPU CóFigure khơng địaonchỉ chung cho tấtwascả first multiprocessor to use this design n Mỗi CPU truy cập từ xa sang nhớ Memory coherence is guaranteed in an NC-NUMA machine because no caching isCPU present.khác Each word of memory lives in exactly one location, so there is no danger of one copy having stale data: there are no copies Of course, it now matn Truy nhập nhớ từ xa chậm truy nhập ters a great deal which page is in which memory because the performance penalty for being the wrong nhớincục place is so high Consequently, NC-NUMA machines use n 495 2017 elaborate software to move pages around to maximize performance Typically, a daemon process called a page scanner runs every few seconds Its job is to examine the usage statistics andtínhmove pages around in an attempt to 496 Kiến trúc máy improve performance If a page appears to be in the wrong place, the page scanner unmaps it so that the next reference to it will cause a page fault When the fault occurs, a decision is made about where to place the page, possibly in a different memory To prevent thrashing, usually there is some rule saying that once a page is placed, it is frozen in place for a time ∆T Various algorithms have been studied, but the conclusion is that no one algorithm performs best under all circumstances (LaRowe and Ellis, 1991) Best performance depends on the application 124 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Bộ xử lý đa lõi (multicores) 666 Thay đổi xử lý: n Tuần tự Pipeline Siêu vô hướng Đa luồng Đa lõi: nhiều CPU chip 676 CPU Core CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache Main memory I/O Issue logic Registers n PC n Register Instruction fetch unit L2 cache I/O Main memory Execution units and queues L1 instruction cache (b) Dedicated L2 cache (a) Dedicated L1 cache L1 data cache L2 cache (b) Simultaneous multithreading CPU Core Processor n (superscalar or SMT) n CPU Core n L1-D L1-I (a) Superscalar Processor (superscalar or SMT) n CPU Core 675 L1 data cache L2 cache PC n L1 instruction cache Processor (superscalar or SMT) n 18.3 / MULTICORE ORGANIZATION Issue logic Program counter Single-thread register file Instruction fetch unit Execution units and queues Processor (superscalar or SMT) n Các dạng tổ chức xử lý đa lõi CHAPTER 18 / MULTICORE COMPUTERS CHAPTER 18 / MULTICORE COMPUTERS CPU Core n L1-D L1-I L1-D L1-I CPU Core CPU Core n L1-D L1-I L1-D L1-I L2 cache L2 cache L2 cache L3 cache L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D 18.4 INTEL x86 MULTICORE ORGANIZATION Main memory L2 cache Intel has introduced a number of multicore products in recent years In this section, we look at two examples: the Intel Core Duo and the Intel Core i7-990X KiếnThe trúc Intel máyFor tính 497 to processors eachDuo, of these innovations, have over the years Core introduced in designers 2006, implements two x86attempted superscalar the performance of the system by adding complexity In the case of pipelinwithincrease a shared L2 cache (Figure 18.8c) ing, simple three-stage pipelines were replaced by pipelines with five stages, and The general structure theimplementations Intel Core Duo is over shown in Figure then many more stages, with of some having a dozen stages 18.9 Let us consider elements thecan topbeoftaken, the figure is more common in mulThere the is a key practical limit tostarting how far from this trend becauseAs with stages, there iseach the need more logic, dedicated more interconnections, morecase, control ticore systems, corefor has its own L1 cache.and In this each core has signals With superscalar organization, increased performance can be achieved by a 32-kB instruction cache and a 32-kB data cache increasing the number of parallel pipelines Again, there are diminishing returns as Each core has an independent controltounit With the and high transistor the number of pipelines increases Morethermal logic is required manage hazards density of instruction today’s chips, thermal management is aoffundamental capability, espeto stage resources Eventually, a single thread execution reaches the point hazards and resource dependencies prevent thethermal full use ofcontrol the multiple cially forwhere laptop and mobile systems The Core Duo unit is designed n n n 32KiB instruction and 32KiB data 2MiB shared L2 cache 32-kB L1 Caches Execution resources 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core Arch state n Arch state n Execution resources Intel - Core Duo to manage chip heat dissipation to maximize processor performance within thermal constraints Thermal management also improves ergonomics with a cooler system and lower fan acoustic noise In essence, the thermal management unit monitors digital sensors for high-accuracy die temperature measurements Each core can be defined as an independent thermal zone The maximum temperature for each 32-kB L1 Caches NKK-HUST I/O (d ) Shared L3 cache Figure 18.8 Multicore Organization Alternatives Figure 18.1 Alternative Chip Organizations Intel Core Duo 2017 Main memory I/O (c) Shared L2 cache (c) Multicore Thermal control Thermal control APIC APIC 2017 Kiến trúc máy tính Interprocessor communication is easy to implement, via shared memory locations The use of a shared L2 cache confines the cache coherency problem to the L1 cache level, which may provide some additional performance advantage 498 A potential advantage to having only dedicated L2 caches on the chip is that each core enjoys more rapid access to its private L2 cache This is advantageous for threads that exhibit strong locality As both the amount of memory available and the number of cores grow, the NKK-HUST use of a shared L3 cache combined with either a shared L2 cache or dedicated percore L2 caches seems likely to provide better performance than simply a massive shared L2 cache Another organizational design decision in a multicore system is whether the cores will be COMPUTERS superscalar or will implement simultaneous multithreading 678 CHAPTERindividual 18 / MULTICORE (SMT) For example, the Intel Core Duo uses superscalar cores, whereas the Intel Core i7 uses SMT cores SMT has the effect of scaling up the number of hardwarelevel threads that the multicore system supports Thus, a multicore system with four cores and SMT that supports four simultaneous threads in each core appears the Core to the application Core level Core with 16 Core Core is same as a 2multicoreCore system cores As software developed to more fully exploit parallel resources, an SMT approach appears to be more attractive than a superscalar approach Intel Core i7-990X 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache Power management logic 12 MB L3 Cache MB L2 shared cache Bus interface DDR3 Memory Controllers QuickPath Interconnect ؋ 8B @ 1.33 GT/s ؋ 20B @ 6.4 GT/s Figure 18.10 Intel Core i7-990X Block Diagram Front-side bus Figure 18.9 Intel Core Duo Block Diagram 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 499 2017 The general structure of the Intel Core i7-990X is shown in Figure 18.10 Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache trúc máy more tính effective is prefetching, in which One mechanism Intel uses to makeKiến its caches the hardware examines memory access patterns and attempts to fill the caches speculatively with data that’s likely to be requested soon It is interesting to compare the performance of this three-level on chip cache organization with a comparable twolevel organization from Intel Table 18.1 shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency The Core Quad has a shared L2 cache, similar to the Core Duo The Core i7 improves on L2 cache performance with the use of the dedicated L2 caches, and provides a relatively high-speed access to the L3 cache The Core i7-990X chip supports two forms of external communications to 500 125 grams on multicomputer CPUs interact using primitives like send and receive to explicitly pass messages because they cannot get at each other’s memory with LOAD and STORE instructions This difference completely changes the programming model Each node in a multicomputer consists of one or a few CPUs, some RAM (conceivably shared among the CPUs at that node only), a disk and/or other I/O devices, and a communication processor The communication processors are connected by a high-speed interconnection network of the types we discussed in Sec 8.3.3 Many different topologies, switching schemes, and routing algorithms are used What all multicomputers have in common is that when an application proNKK-HUST gram executes the send primitive, the communication processor is notified and transmits a block of user data to the destination machine (possibly after first asking for and getting permission) A generic multicomputer is shown in Fig 8-36 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST 9.3 Đa xử lý nhớ phân tán CPU … SEC 8.4 MESSAGE-PASSING MULTICOMPUTERS 619 Node Memory Local interconnect Mạng liên kết Disk and I/O … … Local interconnect Disk and I/O (a) (b) (c) (d) (e) (f) (g) (h) Communication processor High-performance interconnection network Figure 8-36 A generic multicomputer n n Máy tính qui mơ lớn (Warehouse Scale Computers Interconnection NetworksProcessors – MPP) or 8.4.1 Massively Parallel In Fig 8-36 we see that multicomputers are held together by interconnection Máy tínhNow cụm (clusters) networks it is time to look more closely at these interconnection networks Interestingly enough, multiprocessors and multicomputers are surprisingly similar in this respect because multiprocessors often have multiple memory modules that must also be interconnected with one another and with the CPUs Thus the matertrúc máy tính of systems ial in this section frequently appliesKiến to both kinds The fundamental reason why multiprocessor and multicomputer interconnection networks are similar is that at the very bottom both of them use message 2017 501 Massively Parallel Processors n n n PARALLEL Figure 8-37 Various topologies The heavy dots represent switches The CPUs Kiến tính8 interconnect (c) A tree and ARCHITECTURES memories are not shown (a) Atrúc star máy (b)CHAP A complete COMPUTER (d) A ring (e) A grid (f) A double torus (g) A cube (h) A 4D hypercube 502 coherency between the L1 caches on the four CPUs Thus when a shared piece of Interconnection networks can be characterized by their dimensionality For memory resides in more than oneour cache, accesses to that storageisby one processor purposes, the dimensionality determined by the number of choices there are will be immediately visible to the to other threethe processors A memory reference get from source to the destination If therethat is never any choice (i.e., there is misses on the L1 cache but hits on theone L2path cache takes cycles A the network is zero dimenonly from eachabout source11toclock each destination), miss on L2 that hits on L3 takes about 28 cycles Finally, a miss on L3 that has sional If there is one dimension in which a choicetocan be made, for example, go go to the main DRAM takes about 75 cycles The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west In addition, NKK-HUST each processor has a port to the collective network, used for broadcasting data to all processors The barrier port is used to speed up synchronization operations, giving each processor fast access to a specialized synchronization network At the next level up, IBM designed a custom card that holds one of the chips shown in Fig 8-38 along with GB of DDR2 DRAM The chip and the card are shown in Fig 8-39(a)–(b) respectively NKK-HUST n 2017 624 IBM Blue Gene/P Hệ thống qui mô lớn Đắt tiền: nhiều triệu USD Dùng cho tính tốn khoa học tốn có số phép tốn liệu lớn Siêu máy tính 2-GB DDR2 DRAM Chip: processors 8-MB L3 cache (a) Card Chip CPUs GB Board 32 Cards 32 Chips 128 CPUs 64 GB Cabinet 32 Boards 1024 Cards 1024 Chips 4096 CPUs TB System 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB (b) (c) (d) (e) Figure 8-39 The BlueGene/P: (a) chip (b) card (c) board (d) cabinet (e) system The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board Since each card contains GB of DRAM, the boards contain 64 GB apiece One board is illustrated in Fig 8-39(c) At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet A cabinet is illustrated in Fig 8-39(d) Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig 8-39(e) A PowerPC 450 can issue up to instructions/cycle, thus 2017 Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 503 2017 Kiến trúc máy tính 504 126 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST Cluster n n n n n n n PC Cluster Google SEC 8.4 Nhiều máy tính kết nối với mạng liên kết tốc độ cao (~ Gbps) Mỗi máy tính làm việc độc lập (PC SMP) Mỗi máy tính gọi node Các máy tính quản lý làm việc song song theo nhóm (cluster) Tồn hệ thống coi máy tính song song Tính sẵn sàng cao Khả chịu lỗi lớn 2017 Kiến trúc máy tính 635 MESSAGE-PASSING MULTICOMPUTERS hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster OC-12 Fiber OC-48 Fiber 128-port Gigabit Ethernet switch 128-port Gigabit Ethernet switch Two gigabit Ethernet links 80-PC rack Figure 8-44 A typical Google cluster 505 2017 Power density is also a key issue A typical PC burns about 120 watts or about Kiến trúc máy tính 10 kW per rack A rack needs about m2 so that maintenance personnel can install and remove PCs and for the air conditioning to function These parameters give a power density of over 3000 watts/m Most data centers are designed for 600–1200 watts/m2 , so special measures are required to cool the racks Google has learned three key things about running massive Web servers that bear repeating 506 Components will fail so plan for it NKK-HUST NKK-HUST 9.4 Bộ xử lý đồ họa đa dụng n n n n Optimize price/performance Bộ xử lý đồ họa máy tính Kiến trúc SIMD Xuất phát từ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D 3D: xử lý liệu song song GPGPU – General purpose Graphic Processing Unit Hệ thống lai CPU/GPGPU n n 2017 Replicate everything for throughput and availability CPU host: thực theo GPGPU: tính tốn song song Kiến trúc máy tính Nguyễn Kim Khánh DCE-HUST 507 2017 Kiến trúc máy tính 508 127 threads in groups of 32 threads called a warp While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses Bài giảng Kiến trúc máy tính CA-20172 An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores A CUDA core executes a floating point or integer instruction per clock for a thread The NKK-HUST NKK-HUST 512 CUDA cores are organized in 16 SMs of 32 cores each The GPU has six 64-bit memory GPGPU: NVIDIA Tesla GPGPU: NVIDIA Fermi partitions, for a 384-bit memory interface, supporting up to a total of GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers nStreaming multiprocessor n 2017 × Streaming processors Kiến trúc máy tính 509 2017 Fermi’s 16 SM are positioned around a common L2 cache Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Kiến trúc máy tính (execution units), and light blue portions (register file and L1 cache) 510 NKK-HUST NKK-HUST NVIDIA Fermi Instruction Cache n n Third Generation Streaming Multiprocessor Có 16 Streaming Multiprocessors (SM) Mỗi SM có 32 CUDA cores Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU 01 IU The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient 512 High Performance CUDA cores n 2017 Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) Core Core Core Core Core Core Core Core Hết LD/ST LD/ST SFU LD/ST LD/ST Each SM features 32 CUDA LD/ST CUDA Core Core Core Core Core Dispatch Port LD/ST processors—a fourfold Operand Collector LD/ST increase over prior SM Core Core Core Core LD/ST designs Each CUDA FP Unit INT Unit LD/ST processor has a fully Core Core Core Core LD/ST Result Queue pipelined integer arithmetic LD/ST logic unit (ALU) and floating Core Core Core Core LD/ST point unit (FPU) Prior GPUs used IEEE 754-1985 LD/ST floating point arithmetic The Fermi architecture Core Core Core Core LD/ST implements the new IEEE 754-2008 floating-point LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST instruction for both single and double precision arithmetic FMA improves over a multiply-add Interconnect Network (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no Uniform Cache Cache Uniform loss of precision in the addition FMA is more Fermi Streaming Multiprocessor (SM) accurate than performing the operations separately GT200 implemented double precision FMA SFU SFU SFU tính 511 In GT200, the integerKiến ALUtrúc wasmáy limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements The integer ALU is also optimized to efficiently support 64-bit and extended precision operations Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count 16 Load/Store Units Nguyễn Kim Khánh DCE-HUST Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock Supporting units load and store the data at each address to cache or DRAM 2017 Kiến trúc máy tính 512 128 ... Trường Đại học Bách khoa Hà Nội 2017 Kiến trúc máy tính 89 NKK-HUST 2017 Kiến trúc máy tính NKK-HUST Nội dung chương 3.1 Các thành phần máy tính n 3.1 Các thành phần máy tính 3.2 Hoạt động máy tính... Architectures Kiến trúc máy tính NKK-HUST 2017 Kiến trúc máy tính NKK-HUST Kiến trúc máy tính Nội dung chương 1.1 Máy tính phân loại máy tính 1.2 Khái niệm kiến trúc máy tính 1.3 Sự tiến hóa cơng nghệ máy. .. Amazon, Google Kiến trúc máy tính 12 Bài giảng Kiến trúc máy tính CA-20172 NKK-HUST NKK-HUST 1.2 Khái niệm kiến trúc máy tính n Kiến trúc máy tính bao gồm: n n n n Phân lớp máy tính Kiến trúc tập lệnh

Định dạng
Số trang	128
Dung lượng	16,09 MB