Parallel Programming: for Multicore and Cluster Systems- P39 potx

10 291 0
Parallel Programming: for Multicore and Cluster Systems- P39 potx

Đang tải... (xem toàn văn)

Thông tin tài liệu

7.1 Gaussian Elimination 373 5. Computation of the elimination factors: The function compute elim fact loc() is used to compute the elimination factors l ik for all elements a ik owned by the processor. The elimination factors are stored in buffer elim buf. 5a. Distribution of the elimination factors: A single-broadcast operation is used to send the elimination factors to all processors in the same row group Rop(q); the corresponding communicator comm(Rop(q)) is used. The number (rank) of the root processor q for this broadcast operation in a group G is determined by the function rank(q,G). 6. Computation of the matrix elements: The computation of the matrix elements by compute local entries() and the backward substitution performed by backward substitution() are similar to the pseudocode in Fig. 7.2. The main differences to the program in Fig. 7.2 are that more communication is required and that almost all collective communication operations are performed on subgroups of the set of processors and not on the entire set of processors. 7.1.4 Analysis of the Parallel Execution Time The analysis of the parallel execution time of the Gaussian elimination uses func- tions expressing the computation and communication times depending on the char- acteristics of the parallel machine, see also Sect. 4.4. The function describing the parallel execution time of the program in Fig. 7.6 additionally contains the param- eters p 1 , p 2 , b 1 , and b 2 of the parameterized data distribution in Formula (7.7). In the following, we model the parallel execution time of the Gaussian elimination with checkerboard distribution, neglecting pivoting and backward substitution for simplicity, see also [147]. These are the phases 4, 5, 5a, and 6 of the Gaussian elimination. For the derivation of functions reflecting the parallel execution time, these four SPMD computation phases can be considered separately, since there is a barrier synchronization after each phase. For a communication phase, a formula describing the time of a collective com- munication operation is used which describes the communication time as a function of the number of processors and the message size. For the Gaussian elimination (without pivoting), the phases 4 and 5a implement communication with a single- broadcast operation. The communication time for a single-broadcast with p pro- cessors and message size m is denoted as T sb (p, m). We assume that independent communication operations on disjoint processor sets can be done in parallel. The values for p and m have to be determined for the specific situation. These parame- ters depend on the data distribution and the corresponding sizes of row and column groups as well as on the step k, k = 1, ,n, of the Gaussian elimination, since messages get smaller for increasing k. Also, the modeling of the computation times of phases 5 and 6 depends on the step number k, since less elimination factors or matrix elements have to be computed for increasing k and thus the number of arithmetic operations decreases with increas- ing k. The time for an arithmetic operation is denoted as t op . Since the processors 374 7 Algorithms for Systems of Linear Equations perform an SPMD program, the processor computing the most arithmetic operations determines the computation time for the specific computation phase. The following modeling of communication and computation times for one step k uses the index sets I q ={(i, j) ∈{1 n}×{1 n}|P q owns a ij } , which contain the indices of the matrix elements stored locally in the memory of processor P q : • The broadcasting of the pivot row k in phase 4 of step k sends the elements of row k with column index ≥ k to the processors needing the data for computations in rows ≥ k. Since the pivot row is distributed across the processors of the row group Ro(k), all the processors of Ro(k) send their part of row k. The amount of data sent by one processor q ∈ Ro(k) is the number of elements of row k with column indices ≥ k (i.e., with indices ((k, k), ,(k, n))) owned by processor q. This is the number N row≥k q := #{(k, j) ∈ I q |j ≥ k} . (7.8) (The symbol #X for a set X denotes the number of elements of this set X.) The processor q ∈ Ro(k) sends its data to those processors owning elements in the rows with row index ≥ k which have the same column indices as the elements of processor q. These are the processors in the column group Cop(q)ofthe processor q and thus these processors are the recipients of the single-broadcast operation of processor q. Since all column groups of the processors q ∈ Ro(k)are disjoint, the broadcast operation can be done in parallel and the communication time is max q∈Ro(k) T sb (#Cop(q), N row≥k q ) . • In phase 5 of step k, the elimination factors using the elements a (k) kk and the ele- ments a (k) ik for i > k are computed by the processors owning these elements of column k, i.e., by the processors q ∈ Co(k), according to Formula (7.2). Each of the processors computes the elimination factors of its part, which are N col>k q := #{(i, k) ∈ I q |i > k} (7.9) elimination factors for processor q ∈ Co(k). Since the computations are done in parallel, this results in the computation time max q∈Co(k) N col>k q ·t op . • In phase 5a the elimination factors are sent to all processors which recalculate the matrix elements with indices (i, j), i > k, j > k. Since the elimination 7.1 Gaussian Elimination 375 factors l (k) ik , l = k + 1, ,n, are needed within the same row i, a row-oriented single-broadcast operation is used to send the data to the processors owning parts of row i. A processor q ∈ Co(k) sends its data to the processors in its row group Rop(q). These are the data elements computed in the previous phase, i.e., N col>k q data elements, and the communication time is max q∈Co(k) T sb (#Rop(q), N col>k q ) . • In phase 6 of step k, all matrix elements in the lower right rectangular area are recalculated. Each processor q recalculates the entries it owns; these are the num- ber of elements per column for rows with indices > k (i.e., N col>k q ) multiplied by the number of elements per row for columns with indices > k (i.e., N row>k q ). Since two arithmetic operations are performed for one entry according to Formula (7.4), the computation time is max q∈P N col>k q · N row>k q ·2t op . In total, the parallel execution for all phases and all steps is T (n, p) = n−1  k=1  max q∈Ro(k) T sb (#Cop(q), N row≥k q ) + max q∈Co(k) N col>k q ·t op (7.10) + max q∈Co(k) T sb (#Rop(q), N col>k q ) +max q∈P N col>k q · N row>k q ·2t op  . This parallel execution time can be expressed in terms of the parameters of the data distribution ((p 1 , b 1 ), (p 2 , b 2 )), the problem size n, and the step number k by estimating the sizes of messages and the number of arithmetic operations. For the estimation, larger blocks of data, called superblocks, are considered. Superblocks consist of p 1 × p 2 consecutive blocks of size b 1 × b 2 , i.e., it has p 1 b 1 rows and p 2 b 2 columns. There are  n p 1 b 1  superblocks in the row direction and  n p 2 b 2  in the column direction. Each of the p processors owns one data block of size b 1 × b 2 of a superblock. The two-dimensional matrix A is covered by these superblocks and from this covering, it can be estimated how many elements of smaller matrices A (k) are owned by a specific processor. The number of elements owned by a processor q in row k for column indices ≥ k can be estimated by N row≥k q ≤  n−k+1 p 2 b 2  b 2 ≤  n−k+1 p 2 b 2 +1  b 2 = n−k+1 p 2 +b 2 , (7.11) 376 7 Algorithms for Systems of Linear Equations where  n−k+1 p 2 b 2  is the number of superblocks covering row k for column indices ≥ k, which are n −k +1 indices, and b 2 is the number of column elements that each processor of Ro(k) owns in a complete superblock. For the covering of one row, the number of columns p 2 b 2 of a superblock is needed. Analogously, the number of ele- ments owned by a processor q in column k for row indices > k can be estimated by N col>k q ≤  n−k p 1 b 1  b 1 ≤  n−k p 1 b 1 +1  b 1 = n − k p 1 +b 1 , (7.12) where  n−k p 1 b 1  is the number of superblocks covering column k for row indices > k, which are n −k row indices, and b 1 is the number of row elements that each proces- sor of Co(k) owns in a complete superblock. Using these estimations, the parallel execution time in Formula (7.10) can be approximated by T (n, p) ≈ n−1  k=1  T sb  p 1 , n − k +1 p 2 +b 2  +  n − k p 1 +b 1  ·t op + T sb  p 2 , n − k p 1 +b 1  +  n − k p 1 +b 1  n − k p 2 +b 2  ·2t op  . Suitable parameters leading to a good performance can be derived from this mod- eling. For the communication time of a single-broadcast operation, we assume a communication time T sb (p, m) = log p ·(τ + m ·t c ) with a startup time τ and a transfer time t c . This formula models the communication time in many interconnection networks, like a hypercube. Using the summation for- mula  n−1 k=1 (n −k +1) =  n k=2 k = (  n k=1 k) −1 = n(n+1) 2 −1 the communication time in phase 4 results in n−1  k=1 T sb  p 1 , n − k +1 p 2 +b 2  = n−1  k=1 log p 1  n − k +1 p 2 +b 2  t c +τ  = log p 1  n(n + 1) 2 −1  1 p 2 t c +(n − 1)b 2 t c +(n − 1)τ  . For the second and third terms the summation formula  n−1 k=1 (n − k) = n(n−1) 2 is used, so that the computation time 7.1 Gaussian Elimination 377 n−1  k=1  n − k p 1 +b 1  ·t op =  n(n − 1) 2p 1 +(n − 1)b 1  ·t op and the communication time n−1  k=1 T sb  p 2 , n − k p 1 +b 1  = n−1  k=1 log p 2  n − k p 1 +b 1  t c +τ  = log p 2  n(n − 1) 2  1 p 1 t c +(n − 1)b 1 t c +(n − 1)τ  result. For the last term, the summation formula  n−1 k=1 n−k p 1 · n−k p 2 = 1 p  n−1 k=1 k 2 = 1 p n(n−1)(2n−1) 6 is used. The total parallel execution time is T (n, p) =log p 1  n(n + 1) 2 −1  t c p 2 +(n − 1)b 2 t c +(n − 1)τ  +  n(n − 1) 2 1 p 1 +(n − 1)b 1  t op +log p 2  n(n − 1) 2 t c p 1  +(n − 1)b 1 t c +(n − 1)τ  +  n(n − 1)(2n − 1) 6p + n(n−1) 2  b 2 p 1 + b 1 p 2  +(n−1)b 1 b 2  2t op . The block sizes b i ,1 ≤ b i ≤ n/ p i ,fori = 1, 2 are contained in the execution time as factors and, thus, the minimal execution time is achieved for b 1 = b 2 = 1. In the resulting formula the terms (log p 1 +log p 2 ) ( (n−1)(τ + t c ) ) = log p ( (n−1)(τ + t c ) ) ,(n −1) · 3t op , and n(n−1)(2n−1) 3p · t op are independent of the specific choice of p 1 and p 2 and need not be considered. The terms n(n−1) 2 1 p 1 t op and t c p 2 (n − 1) log p 1 are asymmetric in p 1 and p 2 . For simplicity we ignore these terms in the analysis, which is justified since these terms are small compared to the remaining terms; the first term has t op as operand, which is usually small, and the second term with t c as operand has a factor only linear in n. The remaining terms of the execution time are symmetric in p 1 and p 2 and have constants quadratic in n. Using p 2 = p/ p 1 this time can be expressed as T S (p 1 ) = n(n−1) 2  p 1 log p 1 p + log p−log p 1 p 1  t c + n(n−1) 2  1 p 1 + p 1 p  2t op . 378 7 Algorithms for Systems of Linear Equations The first derivation is T  S (p 1 ) = n(n−1) 2  1 p ·ln 2 + log p 1 p − log p p 2 1 + log p 1 p 2 1 − 1 p 2 1 ·ln 2  t c + n(n−1) 2  1 p − 1 p 2 1  2t op . For p 1 = √ p it is T  S (p 1 ) = 0 since 1 p − 1 p 2 1 = 1 p − 1 p = 0, 1 p ln 2 − 1 p 2 1 ln 2 = 0, and log p 1 p − log p p 2 1 + log p 1 p 2 1 = 0. The second derivation T  (p 1 )ispositiveforp 1 = √ p and, thus, there is a minimum at p 1 = p 2 = √ p. In summary, the analysis of the most influential parts of the parallel execution time of the Gaussian elimination has shown that p 1 = p 2 = √ p, b 1 = b 2 = 1is the best choice. For an implementation, the values for p 1 and p 2 have to be adapted to integer values. 7.2 Direct Methods for Linear Systems with Banded Structure Large linear systems with banded structure often arise when discretizing partial differential equations. The coefficient matrix of a banded system is sparse with non-zero elements in the main diagonal of the matrix and a few further diagonals. As a motivation, we first present the discretization of a two-dimensional Poisson equation resulting in such a banded system in Sect. 7.2.1. In Sect. 7.2.2, the solu- tion methods recursive doubling and cyclic reduction are applied to the solution of tridiagonal systems, i.e., banded systems with only three non-zero diagonals, and the parallel implementation is discussed. General banded matrices are treated with cyclic reduction in Sect. 7.2.3 and the discretized Poisson equation is used as an example in Sect. 7.2.4. 7.2.1 Discretization of the Poisson Equation As a typical example of an elliptic partial differential equation we consider the Pois- son equation with Dirichlet boundary conditions. This equation is often called the model problem since its structure is simple but the numerical solution is very simi- lar to many other more complicated partial differential equations, see [60, 79, 166]. The two-dimensional Poisson equation has the form −Δu(x, y) = f (x, y) for all (x, y) ∈ Ω (7.13) with domain Ω ⊂ R 2 . The function u : R 2 → R is the unknown solution function and the function f : R 2 → R is the right-hand side, which is continuous in Ω and its boundary. The operator Δ is the two-dimensional Laplace operator 7.2 Direct Methods for Linear Systems with Banded Structure 379 Δ = ∂ 2 ∂x 2 + ∂ 2 ∂y 2 containing the second partial derivatives with respect to x or y.(∂/∂x and ∂/∂y denote the first partial derivatives with respect to x or y, and ∂ 2 /∂x 2 and ∂ 2 /∂y 2 denote the second partial derivatives with respect to x or y, respectively.) Using this notation, the Poisson equation (7.13) can also be written as − ∂ 2 u ∂x 2 − ∂ 2 u ∂y 2 = f (x, y) . The model problem (7.13) uses the unit square Ω = (0, 1) × (0, 1) and assumes a Dirichlet boundary condition u(x, y) = ϕ(x, y) for all (x, y) ∈ ∂Ω , (7.14) where ϕ is a given function and ∂Ω is the boundary of domain Ω, which is ∂Ω ={(x, y) | 0 ≤ x ≤ 1, y = 0ory = 1}∪{(x, y) | 0 ≤ y ≤ 1, x = 0orx = 1}. The boundary condition uniquely determines the solution u of the model problem. Figure 7.7 (left) illustrates the domain and the boundary of the model problem. An example of the Poisson equation from electrostatics is the equation Δu =− ρ ε 0 , where ρ is the charge density, ε 0 is a constant, and u is the unknown potential to be determined [97]. For the numerical solution of equation −Δu(x, y) = f (x, y), the method of finite differences can be used, which is based on a discretization of the domain Ω ∪ ∂Ω y x . . . . . . . . . . . . y (0,1) (1,1) u = fΔ− u = ϕ (0,0) (1,0) x Poisson equation mesh for the unit square boundary values inner mesh points Fig. 7.7 Left: Poisson equation with Dirichlet boundary condition on the unit square Ω = (0, 1) × (0, 1). Right: The numerical solution discretizes the Poisson equation on a mesh with equidistant mesh points with distance 1/(N + 1). The mesh has N 2 inner mesh points and additional mesh points on the boundary 380 7 Algorithms for Systems of Linear Equations in both directions. The discretization is given by a regular mesh with N + 2mesh points in x-direction and in y-direction, where N points are in the inner part and 2 points are on the boundary. The distance between points in the x-ory-direction is h = 1 N+1 . The mesh points are (x i , y j ) = (ih, jh)fori, j = 0, 1, ,N +1 . The points on the boundary are the points with x 0 = 0, y 0 = 0, x N+1 = 1, or y N+1 = 1. The unknown solution function u is determined at the points (x i , y j )of this mesh, which means that values u ij := u(x i , y j )fori, j = 0, 1, ,N + 1are to be found. For the inner part of the mesh, these values are determined by solving a linear equation system with N 2 equations which is based on the Poisson equation in the following way. For each mesh point (x i , y j ), i, j = 1, ,N, a Taylor expansion is used for the x or y-direction. The Taylor expansion in x-direction is u(x i +h, y j ) = u(x i , y j ) + h · u x (x i , y j ) + h 2 2 u xx (x i , y j ) + h 3 6 u xxx (x i , y j ) + O(h 4 ) , u(x i −h, y j ) = u(x i , y j ) − h · u x (x i , y j ) + h 2 2 u xx (x i , y j ) − h 3 6 u xxx (x i , y j ) + O(h 4 ) , where u x denotes the partial derivative in x-direction (i.e., u x = ∂u/∂x) and u xx denotes the second partial derivative in x-direction (i.e., u xx = ∂ 2 u/∂ x 2 ). Adding these two Taylor expansions results in u(x i +h, y j ) + u(x i −h, y j ) = 2u(x i , y j ) + h 2 u xx (x i , y j ) + O(h 4 ) . Analogously, the Taylor expansion for the y-direction can be used to get u(x i , y j +h) +u(x i , y j −h) = 2u(x i , y j ) + h 2 u yy (x i , y j ) + O(h 4 ) . From the last two equations, an approximation for the Laplace operator Δu = u xx + u yy at the mesh points can be derived Δu(x i , y j ) =− 1 h 2 (4u ij −u i+1, j −u i−1, j −u i, j+1 −u i, j−1 ) , where the higher order terms O(h 4 ) are neglected. This approximation uses the mesh point (x i , y j ) itself and its four neighbor points; see Fig. 7.8. This pattern is known as five-point stencil. Using the approximation of Δu and the notation f ij := f (x i , y j ) 7.2 Direct Methods for Linear Systems with Banded Structure 381 Fig. 7.8 Five-point stencil resulting from the discretization of the Laplace operator with a finite difference scheme. The computation at one mesh point uses values at the four neighbor mesh points (i,j) (i,j–1) (i,j+1) (i–1,j) (i+1,j) y y xxx 0 0i y N+1 N+1 j for the values of the right-hand side, the discretized Poisson equation or five-point formula results: 1 h 2 (4u ij −u i+1, j −u i−1, j −u i, j+1 −u i, j−1 ) = f ij (7.15) for 1 ≤ i, j ≤ N. For the points on the boundary, the values of u ij result from the boundary condition (7.14) and are given by u ij = ϕ(x i , y j ) (7.16) for i = 0, N +1 and j = 0, ,N +1or j = 0, N +1 and i = 0, ,N +1. The inner mesh points which are immediate neighbors of the boundary, i.e., the mesh points with i = 1, i = N, j = 1, or j = N, use the boundary values in their five-point stencil; the four mesh points in the corners use two boundary values and all other points use one boundary value. For all points with i = 1, i = N, j = 1, or j = N, the values of u ij in the formulas (7.15) are replaced by the values (7.16). For the mesh point (x 1 , y 1 ) for example, the equation 1 h 2 (4u 11 −u 21 −u 12 ) = f 11 + 1 h 2 ϕ(0, y 1 ) + 1 h 2 ϕ(x 1 , 0) results. The five-point formula (7.15) including boundary values represents a linear equation system with N 2 equations, N 2 unknown values, and a coefficient matrix A ∈ R N 2 ×N 2 . In order to write the equation system (7.15) with boundary values (7.16) in matrix form Az = d,theN 2 unknowns u ij , i, j = 1, ,N, are arranged in row-oriented order in a one-dimensional vector z of size n = N 2 which has the form z = (u 11 , u 21 , ,u N1 , u 12 , u 22 , ,u N2 , ,u 1N , u 2N , ,u NN ) . The mapping of values u ij to vector elements z k is z k := u ij with k = i + ( j −1)N for i, j = 1, ,N . 382 7 Algorithms for Systems of Linear Equations Using the vector z, the five-point formula has the form 1 h 2  4z i+( j−1)N − z i+1+( j−1)N − z i−1+( j−1)N − z i+jN − z i+( j−2)N  = d i+( j−1)N with d i+( j−1)N = f ij and a corresponding mapping of the values f ij to a one-dimensional vector d. Replacing the indices by k = 1, ,n with k = i +( j −1)N results in 1 h 2 ( 4z k − z k+1 − z k−1 − z k+N − z k−N ) = d k . (7.17) Thus, the entries in row k of the coefficient matrix contain five entries which are a kk = 4 and a k,k+1 = a k,k−1 = a k,k+N = a k,k−N =−1. The building of the vector d and the coefficient matrix A = (a ij ), i, j = 1, ,N 2 , can be performed by the following algorithm, see [79]. The loops over i and j, i, j = 1, ,N, visit the mesh points (i, j) and build one row of the matrix A of size N 2 ×N 2 . When (i, j) is an inner point of the mesh, i.e., i, j = 1, N, the corre- sponding row of A contains five elements at the position k, k+1, k−1, k +N, k −N for k = i + ( j − 1)N. When (i, j) is at the boundary of the inner part, i.e., i = 1, j = 1, i = N,or j = N, the boundary values for ϕ are used. /* Algorithm for building the matrix A and the vector d */ Initialize all entries of A with 0; for ( j = 1; j <= N; j ++) for (i = 1; i <= N; i ++) { /* Build d k and row k of A with k = i + ( j − 1)N */ k = i + ( j −1) · N; a k,k = 4/ h 2 ; d k = f ij ; if (i > 1) a k,k−1 =−1/h 2 else d k = d k +1/ h 2 ϕ(0, y j ); if (i < N) a k,k+1 =−1/h 2 else d k = d k +1/ h 2 ϕ(1, y j ); if ( j > 1) a k,k−N =−1/h 2 else d k = d k +1/ h 2 ϕ(x i , 0); if ( j < N) a k,k+N =−1/h 2 else d k = d k +1/ h 2 ϕ(x i , 1); } The linear equation system resulting from this algorithm has the structure 1 h 2 ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ B −I 0 −IB . . . . . . . . . −I 0 −IB ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ · z = d , (7.18) . choice. For an implementation, the values for p 1 and p 2 have to be adapted to integer values. 7.2 Direct Methods for Linear Systems with Banded Structure Large linear systems with banded structure. operations are performed for one entry according to Formula (7.4), the computation time is max q∈P N col>k q · N row>k q ·2t op . In total, the parallel execution for all phases and all steps is T. 3t op , and n(n−1)(2n−1) 3p · t op are independent of the specific choice of p 1 and p 2 and need not be considered. The terms n(n−1) 2 1 p 1 t op and t c p 2 (n − 1) log p 1 are asymmetric in p 1 and

Ngày đăng: 03/07/2014, 16:21

Mục lục

  • to 1 Introduction

    • Classical Use of Parallelism

    • Parallelism in Today's Hardware

    • Overview of the Book

    • to 2 Parallel Computer Architecture

      • Processor Architecture and Technology Trends

      • Flynn's Taxonomy of Parallel Architectures

      • Memory Organization of Parallel Computers

        • Computers with Distributed Memory Organization

        • Computers with Shared Memory Organization

        • Reducing Memory Access Times

        • Thread-Level Parallelism

          • Simultaneous Multithreading

          • Architecture of Multicore Processors

          • Interconnection Networks

            • Properties of Interconnection Networks

            • Routing and Switching

              • Routing Algorithms

              • Routing in the Omega Network

              • Caches and Memory Hierarchy

                • Characteristics of Caches

                • to 3 Parallel Programming Models

                  • Models for Parallel Systems

                  • Levels of Parallelism

                    • Parallelism at Instruction Level

                    • Explicit and Implicit Representation of Parallelism

                    • Data Distributions for Arrays

                      • Data Distribution for One-Dimensional Arrays

                      • Data Distribution for Two-Dimensional Arrays

                      • Information Exchange

                        • Shared Variables

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan