tính toán song song thoại nam parallelprocessing 12 basicparallelalgorithms sinhvienzone com

om Si nh Vi en Zo ne C Parallel Algorithms SinhVienZone.com Thoai Nam https://fb.com/sinhvienzonevn om Outline to parallel algorithms development  Reduction algorithms  Broadcast algorithms  Prefix sums algorithms Si nh Vi en Zo ne C  Introduction https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -2- Introduction to Parallel Algorithm Development om Parallel algorithms mostly depend on destination parallel platforms and architectures  MIMD algorithm classification – According to M.J.Quinn (1994), there are design strategies for parallel algorithms Si  Zo – Pre-scheduled data-parallel algorithms Self-scheduled data-parallel algorithms Control-parallel algorithms nh Vi en – ne C  https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -3- – Target Architectures – – – – nh Vi en  ne – Reduction Broadcast Prefix sums Zo – C elementary problems to be considered Hypercube SIMD model 2D-mesh SIMD model UMA multiprocessor model Hypercube Multicomputer Si  om Basic Parallel Algorithms https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -4- Description: Given n values a0, a1, a2…an-1 associative operation , let’s use p processors to compute the sum: ne C  om Reduction Problem Design strategy – “If a cost optimal CREW PRAM algorithms exists and the way the PRAM processors interact through shared variables maps onto the target architecture, a PRAM algorithm is a reasonable starting point” Si  nh Vi en Zo S = a0  a1  a2  …  an-1 https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -5-  om Cost Optimal PRAM Algorithm for the Reduction Problem Cost optimal PRAM algorithm complexity: Example for n=8 and p=4 processors a1 a2 a3 a4 a5 a6 a7 P0 j=1 P0 j=2 P0 P1 Si j=0 nh Vi en Zo a0 ne  C O(logn) (using n div processors) P2 P3 P2 https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -6- Cost Optimal PRAM Algorithm for the Reduction Problem(cont’d) om Using p= n div processors to add n numbers: nh Vi en Zo ne C Global a[0 n-1], n, i, j, p; Begin spawn(P0, P1,… ,,Pp-1); for all Pi where ≤ i ≤ p-1 for j=0 to ceiling(logp)-1 if i mod 2j =0 and 2i + 2j < n then Si a[2i] := a[2i]  a[2i + 2j]; endif; endfor j; endforall; End Notes: the processors communicate in a biominal-tree pattern https://fb.com/sinhvienzonevn Khoa Coâng Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -7- nh Vi en P2 P0 P7 P5 P1 Si P3 Step 1: Zo P0 P2 P1 ne P6 P4 P0 C om Solving Reducing Problem on Hypercube SIMD Computer Reduce by dimension j=2 P1 P3 Step 2: Step 3: Reduce by dimension j=1 Reduce by dimension j=0 The total sum will be at P0 https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -8- Solving Reducing Problem on Hypercube SIMD Computer (cond’t) Si Allocate workload for each processors nh Vi en Zo ne C om Using p processors to add n numbers ( p 0 begin ne nh Vi en Compute the total sum Zo Stage 2: if i ≥ j/2 then partial[i]:=local_sum; flags[i]:=1; break; else while (flags[i+j/2]=0) do; local_sum:=local_sum  partial[i+j/2]; endif; j=j/2; end while; if i=0 then global_sum:=local_sum; end forall; End Si Each processor waits for the partial sum of its partner available https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -20- om Solving Reducing Problem on UMA Multiprocessor Model(cont’d) Algorithm complexity 0(n/p+p)  What is the advantage of this algorithm compared with another one using critical-section style to compute the total sum?  Design strategy 2: – nh Vi en Zo ne C  Look for a data-parallel algorithm before considering a control-parallel algorithm Si  On MIMD computer, we should exploit both data parallelism and control parallelism (try to develop SPMD program if possible) https://fb.com/sinhvienzonevn Khoa Coâng Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -21- Description: – – C Zo Things to be considered: Length of the message Message passing overhead and data-transfer time nh Vi en  Given a message of length M stored at one processor, let’s send this message to all other processors ne – Si  om Broadcast https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -22- .C P0 nh Vi en P0 P1 Si Step 1: P1 Send the number via the 1st dimension of the hypercube P2 P0 P2 P7 P5 P3 P1 P3 Step 2: Step 3: Send the number via the 2nd dimension of the hypercube Send the number via the 3rd dimension of the hypercube https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com P6 P4 ne  If the amount of data is small, the best algorithm takes logp communication steps on a p-node hypercube Examples: broadcasting a number on a 8-node hypercube Zo  om Broadcast Algorithm on Hypercube SIMD -23- Broadcast Algorithm on Hypercube SIMD(cont’d) i, {Loop iteration} p, {Partner processor} position; {Position in broadcast tree} value; {Value to be broadcast} ne C Local om Broadcasting a number from P0 to all other processors Si nh Vi en Zo Begin spawn(P0, P1,… ,,Pp-1); for j:=0 to logp-1 for all Pi where ≤ i ≤ p-1 if i < 2j then partner := i+2j; [partner]value:=value; endif; endforall; end forj; End https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -24-  om Broadcast Algorithm on Hypercube SIMD(cont’d) The previous algorithm Uses at most p/2 out of plogp links of the hypercube – Requires time Mlogp to broadcast a length M msg not efficient to broadcast long messages – – nh Vi en Johhsson and Ho (1989) have designed an algorithm that executes logp times faster by: Breaking the message into logp parts Broadcasting each parts to all other nodes through a different biominal spanning tree Si  Zo ne C – https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -25- om Johnsson and Ho’s Broadcast Algorithm on Hypercube SIMD C A B C C A B B B A C B A C Time to broadcast a msg of length M is Mlogp/logp = M The maximum number of links used simultaneously is plogp, much greater than that of the previous algorithm Si  B Zo B C  C nh Vi en A ne C A A https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -26-  Design strategy nh Vi en Zo ne C As problem size grow, use the algorithm that makes best use of the available resources Si – om Johnsson and Ho’s Broadcast Algorithm on Hypercube SIMD(cont’d) https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -27- Description:      A[0] A[0]  A[1] A[0]  A[1]  A[2] … A[0]  A[1]  A[2]  …  A[n-1] Zo  ne C Given an associative operation  and an array A containing n elements, let’s compute the n quantities nh Vi en – Cost-optimal PRAM algorithm: – Si  om Prefix SUMS Problem ”Parallel Computing: Theory and Practice”, section 2.3.2, p 32 https://fb.com/sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -28- Processor Processor (a) (b) 18 (c) 18 35 43 62 (d) 12 18 nh Vi en 17 Processor ne C Finding the prefix sums of 16 values Zo  om Prefix SUMS Problem on Multicomputers Processor 19 35 43 62 18 35 43 62 18 35 43 62 18 23 27 35 37 37 38 43 45 48 56 62 Si 18 https://fb.com/sinhvienzonevn Khoa Coâng Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -29- Prefix SUMS Problem on Multicomputers(cont’d)  Step (c) –  Each processor computes the sum of its local elements Zo – ne Step (b) nh Vi en  Each processor is allocated with its share of values C – om Step (a) The prefix sums of the local sums are computed and distributed to all processor Step (d) – Each processor computes the prefix sum of its own elements and adds to each result the sum of the values held in lower-numbered processors Si  https://fb.com/sinhvienzonevn Khoa Coâng Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone.com -30- ... https://fb .com/ sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone. com -29- Prefix SUMS Problem on Multicomputers(cont’d)  Step (c) –  Each processor computes the... https://fb .com/ sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone. com -5-  om Cost Optimal PRAM Algorithm for the Reduction Problem Cost optimal PRAM algorithm complexity:... endforall; End Notes: the processors communicate in a biominal-tree pattern https://fb .com/ sinhvienzonevn Khoa Công Nghệ Thông Tin – Đại Học Bách Khoa Tp.HCM SinhVienZone. com -7- nh Vi en P2 P0 P7 P5

Định dạng
Số trang	30
Dung lượng	549,3 KB