Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 106 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
106
Dung lượng
1,29 MB
Nội dung
Multi Processor Instructors: Mr Tran Ngoc Thinh Ph.D Group 3: • 13070223 – Võ Thanh Biết • 13070229 – Lưu Nguyễn Hoàng Hạnh • 13070232 – Nguyễn Duy Hoàng • 13070243 _ Trần Duy Linh • 13070244 – Nguyễn Thị Thúy Loan • 13070251 – Phạm Ích Trí Nhân • 13070258 – Nguyễn Anh Quốc • 13070269 – Lê Thị Minh Thùy • 12070558 – Cao Minh Vũ Contents • • • Multi processor • What is a multiprocessor system? • What category can it be in the Flynn Classification? • Synchronization: state some techniques: spin lock, barrier, advantage/disadvantage Synchronization for large scale multiprocessor • Memory consistency: state the relaxed consistency models • Multithreading: how to multithreading improve the performance of a uniprocessor without superscalar? With superscalar? Cache coherent problem in multicore systems • Why keeping cache coherence on multiprocessor is needed • Brief explain directory-based protocol? Where is it most applicable • Explain snoopy-based protocol? Where is it most applicable • Listing some popular protocols in modern processors • What is MESI protocol Sample Multiprocessing • What is a multiprocessor system ? • • Multiprocessing is a type of processing in which two or more processors work together to process more than one program simultaneously Advantages of Multiprocessor Systems: • Reduced Cost • Increased Reliability • Increased Throughput Flynn classification • Based on notions of instruction and data streams • SISD (Single Instruction stream over a Single Data stream ) • SIMD (Single Instruction stream over Multiple Data streams ) • MISD (Multiple Instruction streams over a Single Data stream) • MIMD (Multiple Instruction streams over Multiple Data stream) Flynn classification Synchronization • Why Synchronize? • Need to know when it is safe for different processes running on different processors to use shared data Synchronization P1 Lock(L) Load sharedvar Modify sharedvar Store sharedvar Release(L) P2 Lock(L) Load sharedvar Modify sharedvar Store sharedvar Release(L) Synchronization • Hardware support for synchronization • Atomic instruction to fetch and update memory (atomic operation) • Atomic exchange: • • test-and-set: • • tests a value and sets it if the value passes the test Fetch-and-increment: • • interchange a value stored in a register for a value stored in a memory location representing a lock returns the value of a memory location and atomically increments it after the fetch is done Atomic Read and Write for Multiprocessors • load-linked(LL) and store-conditional(SC) Synchronization • Initial Implementations • • Semaphores Current Trends • Spin Locks • Condition Variables • Read-Write Locks • Reference Locks Synchronization • spin locks: locks that a processor continuously tries to acquire, spinning around a loop until it succeeds While(!acquire(lock))/*spin*/ /*some computation on shared data(critical section)*/ release(lock) Acquire based on primitive: Read-Modify-Write Multi-threading What is multi-threading? What is the advantages and disadvantages of multithreading? Types of multithreading Coarse-grained multithreading • Fine-grained multithreading • Simultaneous multithreading • • What is multi-threading? A thread is placeholder information associated with a single use of the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler • Multi-threading processor execute multiple threads concurrently within the context of a single process and share the resources of a single core: the computing units, the CPU caches and the translation look aside buffer (TLB), while different processes not share these resources • On a single processor, multi-threading is generally implemented by time-division multiplexing • Multi-threading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism • Advantages of Multi threading If the main execution thread blocks on a long-running task , the other thread(s) can continue, taking advantage of the unused computing resources, which this can lead to faster overall execution, as running another thread can avoid leaving these resources idle • If several threads work on the same set of data, they can actually share their cache, leading to better cache usage or synchronization on its values • Multi-threading programming model can also be applied to a single process to enable parallel execution on a multiprocessing system • Disadvantages of Multi-threading Multiple threads can interfere with each other when sharing hardware resources such as caches or translation look aside buffers The programs can be degraded performance due to contention for shared resources • Execution of times of a single thread in a slot can be degraded due to slower frequencies and/or additional pipeline stages that are necessary to accommodate threadswitching hardware • Thread scheduling is also a major problem in multithreading • Multi-threading requiring more changes to both application programs and operating systems, so it is more visible to software • Coarse-grained Multi-threading l l l l l l When one thread is blocked by an event, a threaded scheduler must quickly choose among the list of ready-to-run threads to execute next as well as maintain the stalled thread list The goal of coarse-grained multi-threading is to allow quick switching between a blocked thread and another thread ready to run Another area of research is what type of events should cause a thread switch: cache misses, inter-thread communication,etc The main problem of coarse-grained multi-threading is that the system may make a context switch at an inappropriate time, causing lock convoy, priority inversion or other negative effects Best practice: In order to switch efficiently between active threads, each active threads needs to have its own program counter and register set Coarse-grained multi-threading runs one threads of execution for hundreds or thousands of cycles, while all other threads wait their turn That means the main processor pipeline contains only one thread at a time Fine-grained Multi-threading l l l In the fined-grained multi-threading, the main processor pipeline contain multiple threads With context switches effectively occurring between pipe stages The purpose of interleaved multi-threading is to remove all data dependency stalls from the execution pipeline Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline Fine-grained has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and Translation Look aside buffer (TLB) need to be larger to avoid thrashing between the different threads Fine-grained Multi-threading l l l l For real-time applications, a fined-grained multi-threading can guarantee that a “real time” thread can execute with precise timing, no matter what happens to the other threads - even if some other thread locks up in an infinite loop or is continuously interrupted The ability to useful work on the other threads while the stalled thread is waiting The fine-grained multi-threading never has a pipeline stall and doesn't need feed-forward circuits Disadvantage Sample Using parallel system for speeding up solving Dijkstra problem MIMD (Multi-computers) Explicit parallelism Problem All-pairs shortest path Dijkstra Application Algorithm G=(V, E, w) weighted, directed P: numbers of computers P = 1: sequence P |V|: Source-Parallel Dijkstra Single source shortest path Sequence Parallel All-pairs shortest path Sequence: |V| times sequential single-source shortest path Parallelism: Source-Partitioned:|V| times sequential single source, each vi runs on different computers Dijkstra All-pairs shortest path Parallelism: Source-Parallel Separating into p/n tasks for each vertex vi Apply parallel single-source shortest path for vertex vi with p/n process Dijkstra Turn up Read data parallelism Using butterfly topology in communication Optimized code Dijkstra Result Q&A ! ! ! S K N THA [...]... E.g at level 1 processors 0 and 1 synchronize on one barrier, processors 2 and 3 on another, etc At next level, pair up pairs • Processors 0 and 2 increment a count a level 2, processors 1 and 3 just wait for it to be released • At level 3, 0 and 4 increment counter, while 1, 2, 3, 5, 6, and 7 just spin until this level 3 barrier is released • At the highest level all processes will spin and a few... (20x50) 1000 Write miss by all waiting processors one successful lock (50) & invalidate all copies (19x50) Total time for 1 proc to acquire & release lock 1000 30 50 Each time one gets a lock, it drops out of competition, so avg.=1525 20 x 1525 = 30 ,000 cycles for 20 processors to pass through the lock Problem is contention for lock and serialization of lock access: once lock is free, all compete to see