7.3 Iterative Methods for Linear Systems 413 ˆ A · ˆ x = D R F ED B · ˆ x R ˆ x B = ˆ b 1 ˆ b 2 , (7.46) where ˆ x R denotes the subvector of size n R of the first (red) unknowns and ˆ x B denotes the subvector of size n B of the last (black) unknowns. The right-hand side b of the original equation system is reordered accordingly and has subvector ˆ b 1 for the first n R equations and subvector ˆ b 2 for the last n B equations. The matrix ˆ A consists of four blocks D R ∈ R n R ×n R , D B ∈ R n B ×n B , E ∈ R n B ×n R , and F ∈ R n R ×n B . The submatrices D R and D B are diagonal matrices and the submatrices E and F are sparse banded matrices. The structure of the original matrix of the discretized Poisson equation in Fig. 7.9 in Sect. 7.2.1 is thus transformed into a matrix ˆ A with the structure shown in Fig. 7.17(c). The diagonal form of the matrices D R and D B shows that a red unknown ˆ x i , i ∈{1, ,n R }, does not depend on the other red unknowns and a black unknown ˆ x j , j ∈{n R + 1, ,n R + n B }, does not depend on the other black unknowns. The matrices E and F specify the dependences between red and black unknowns. The row i of matrix F specifies the dependences of the red unknowns ˆ x i (i < n R ) on the black unknowns ˆ x j , j = n R + 1, ,n R + n B . Analogously, a row of matrix E specifies the dependences of the corresponding black unknowns on the red unknowns. The transformation of the original linear equation system Ax = b into the equivalent system ˆ A ˆ x = ˆ b can be expressed by a permutation π : {1, ,n}→ {1, ,n}. The permutation maps a node i ∈{1, ,n} of the rowwise numbering onto the number π(i) of the red–black numbering in the following way: x i = ˆ x π(i) , b i = ˆ b π(i) , i = 1, ,n or x = P ˆ x and b = P ˆ b with a permutation matrix P = (P ij ) i, j=1, ,n , P ij = 1if j = π(i) 0 otherwise .For the matrices A and ˆ A the equation ˆ A = P T AP holds. Since for a permutation matrix the inverse is equal to the transposed matrix, i.e., P T = P −1 , this leads to ˆ A ˆ x = P T APP T x = P T b = ˆ b. The easiest way to exploit the red–black ordering is to use an iterative solution method as discussed earlier in this section. 7.3.5.1 Gauss–Seidel Iteration for Red–Black Systems The solution of the linear equation system (7.46) with the Gauss–Seidel iteration is based on a splitting of the matrix ˆ A of the form ˆ A = ˆ D − ˆ L − ˆ U, ˆ D, ˆ L, ˆ U ∈ R n×n , ˆ D = D R 0 0 D B , ˆ L = 00 −E 0 , ˆ U = 0 −F 00 , with a diagonal matrix ˆ D, a lower triangular matrix ˆ L, and an upper triangular matrix ˆ U. The matrix 0 is a matrix in which all entries are 0. With this notation, iteration step k of the Gauss–Seidel method is given by 414 7 Algorithms for Systems of Linear Equations D R 0 ED B · x (k+1) R x (k+1) B = b 1 b 2 − 0 F 00 · x (k) R x (k) B (7.47) for k = 1, 2, According to equation system (7.46), the iteration vector is split into two subvectors x (k+1) R and x (k+1) B for the red and the black unknowns, respec- tively. (To simplify the notation, we use x R instead of ˆ x R in the following discussion of the red–black ordering.) The linear equation system (7.47) can be written in vector notation for vectors x (k+1) R and x (k+1) B in the form D R · x (k+1) R = b 1 − F · x (k) B for k = 1, 2, , (7.48) D B · x (k+1) B = b 2 − E · x (k+1) R for k = 1, 2, , (7.49) in which the decoupling of the red subvector x (k+1) R and the black subvector x (k+1) B becomes obvious: In Eq. (7.48) the new red iteration vector x (k+1) R depends only on the previous black iteration vector x (k) B and in Eq. (7.49) the new black iteration vector x (k+1) B depends only on the red iteration vector x (k+1) R computed before in the same iteration step. There is no additional dependence. Thus, the potential degree of parallelism in Eq. (7.48) or (7.49) is similar to the potential parallelism in the Jacobi iteration. In each iteration step k, the components of x (k+1) R according to Eq. (7.48) can be computed independently, since the vector x (k) B is known, which leads to a potential parallelism with p = n R processors. Afterwards, the vector x (k+1) R is known and the components of the vector x (k+1) B can be computed independently according to Eq. (7.49), leading to a potential parallelism of p = n R processors. For a parallel implementation, we consider the Gauss–Seidel iteration of the red– black ordering (7.48) and (7.49) written out in a component-based form: x (k+1) R i = 1 ˆ a ii ˆ b i − j∈N(i) ˆ a ij ·(x (k) B ) j , i = 1, ,n R , x (k+1) B i = 1 ˆ a i+n R ,i+n R ˆ b i+n R − j∈N(i) ˆ a i+n R , j ·(x (k+1) R ) j , i = 1, ,n B . The set N(i) denotes the set of adjacent mesh points for mesh point i. According to the red–black ordering, the set N(i) contains only black mesh points for a red point i and vice versa. An implementation on a shared memory machine can employ at most p = n R or p = n B processors. There are no access conflicts for the par- allel computation of x (k) R or x (k) B but a barrier synchronization is needed between the two computation phases. The implementation on a distributed memory machine requires a distribution of computation and data. As discussed before for the paral- lel SOR method, it is useful to distribute the data according to the mesh structure 7.3 Iterative Methods for Linear Systems 415 such that the processor P q to which the mesh point i is assigned is responsible for the computation or update of the corresponding component of the approximation vector. In a row-oriented distribution of a squared mesh with √ n × √ n = n mesh points to p processors, √ n/ p rows of the mesh are assigned to each processor P q , q ∈{1, , p}. In the red–black coloring this means that each processor owns 1 2 n p red and 1 2 n p black mesh points. (For simplicity we assume that √ n is a multiple of p.) Thus, the mesh points (q − 1) · n R p +1, ,q · n R p for q = 1, ,p and (q − 1) · n B p +1 + n R , ,q · n B p +n R for q = 1, , p are assigned to processor P q . Figure 7.18 shows an SPMD program implement- ing the Gauss–Seidel iteration with red–black ordering. The coefficient matrix A is stored according to the pointer-based scheme introduced earlier in Fig. 7.3. After the computation of the red components xr, a function collect elements(xr) distributes the red vector to all other processors for the next computation. Analogously, the black vector xb is distributed after its computation. The function collect elements() can be implemented by a multi-broadcast operation. Fig. 7.18 Program fragment for the parallel implementation of the Gauss–Seidel method with the red–black ordering. The arrays xr and xb denote the unknowns corresponding to the red or black mesh points. The processor number of the executing processor is stored in me 416 7 Algorithms for Systems of Linear Equations 7.3.5.2 SOR Method for Red–Black Systems An SOR method for the linear equation system (7.46) with relaxation parameter ω can be derived from the Gauss–Seidel computation (7.48) and (7.49) by using the combination of the new and the old approximation vectors as introduced in Formula (7.41). One step of the SOR method has then the form ˜ x (k+1) R = D −1 R ·b 1 − D −1 R · F · x (k) B , ˜ x (k+1) B = D −1 B ·b 2 − D −1 B · E · x (k+1) R , x (k+1) R = x (k) R +ω ˜ x (k+1) R − x (k) R , (7.50) x (k+1) B = x (k) B +ω ˜ x (k+1) B − x (k) B , k = 1, 2, . The corresponding splitting of matrix ˆ A is ˆ A = 1 ω ˆ D − ˆ L − ˆ U − 1−ω ω ˆ D with the matrices ˆ D, ˆ L, ˆ U introduced above. This can be written using block matrices: D R 0 ωED B · x (k+1) R x (k+1) B (7.51) = (1 − ω) D R 0 0 D B · x (k) R x (k) B −ω 0 F 00 · x (k) R x (k) B +ω b 1 b 2 . For a parallel implementation the component form of this system is used. On the other hand, for the convergence results the matrix form and the iteration matrix have to be considered. Since the iteration matrix of the SOR method for a given linear equation system Ax = b with a certain order of the equations and the iter- ation matrix of the SOR method for the red–black system ˆ A ˆ x = ˆ b are different, convergence results cannot be transferred. The iteration matrix of the SOR method with red–black ordering is ˆ S ω = 1 ω ˆ D − ˆ L −1 1 − ω ω ˆ D + ˆ U . For a convergence of the method it has to be shown that ρ( ˆ S ω ) < 1 for the spectral radius of ˆ S ω and ω ∈ R. In general, the convergence cannot be derived from the convergence of the SOR method for the original system, since P T S ω P is not iden- tical to ˆ S ω , although P T AP = ˆ A holds. However, for the specific case of the model problem, i.e., the discretized Poisson equation, the convergence can be shown. Using the equality P T AP = ˆ A,itfollowsthat ˆ A is symmetric and positive definite and, thus, the method converges for the model problem, see [61]. Figure 7.19 shows a parallel SPMD implementation of the SOR method for the red–black ordered discretized Poisson equation. The elements of the coeffi- cient matrix are coded as constants. The unknowns are stored in a two-dimensional structure corresponding to the two-dimensional mesh and not as vector so that 7.4 Conjugate Gradient Method 417 Fig. 7.19 Program fragment of a parallel SOR method for a red–black ordered discretized Poisson equation unknowns appear as x[i][j] in the program. The mesh points and the correspond- ing computations are distributed among the processors; the mesh points belong- ing to a specific processor are stored in myregion. The color red or black of a mesh point (i, j) is an additional attribute which can be retrieved by the func- tions is red() and is black().Thevaluef[i][j] denotes the discretized right-hand side of the Poisson equation as described earlier, see Eq. (7.15). The functions exchange red borders() and exchange black borders() exchange the red or black data of the red or black mesh points between neighboring processors. 7.4 Conjugate Gradient Method The conjugate gradient method or CG method is a solution method for linear equa- tion systems Ax = b with symmetric and positive definite matrix A ∈ R n×n , which has been introduced in [86]. (A is symmetric if a ij = a ji and positive definite if x T Ax > 0 for all x ∈ R n with x = 0.) The CG method builds up a solution x ∗ ∈ R n in at most n steps in the absence of roundoff errors. Considering roundoff errors more than n steps may be needed to get a good approximation of the exact solution x ∗ . For sparse matrices a good approximation of the solution can be achieved in less than n steps, also with roundoff errors [150]. In practice, the CG method is often used as preconditioned CG method which combines a CG method with a precon- ditioner [154]. Parallel implementations are discussed in [72, 133, 134, 154]; [155] gives an overview. In this section, we present the basic CG method and parallel implementations according to [23, 71, 166]. 418 7 Algorithms for Systems of Linear Equations 7.4.1 Sequential CG Method The CG method exploits an equivalence between the solution of a linear equation system and the minimization of a function. More precisely, the solution x ∗ of the linear equation system Ax = b, A ∈ R n×n , b ∈ R n , is the minimum of the function Φ : M ⊂ R n → R with Φ(x) = 1 2 x T Ax −b T x , (7.52) if the matrix A is symmetric and positive definite. A simple method to determine the minimum of the function Φ is the method of the steepest gradient [71] which uses the negative gradient. For a given point x c ∈ R n the function decreases most rapidly in the direction of the negative gradient. The method computes the following two steps: (a) Computation of the negative gradient d c ∈ R n at point x c : d c =−grad Φ(x c ) =− ∂ ∂x 1 Φ(x c ), , ∂ ∂x n Φ(x c ) = b − Ax c . (b) Determination of the minimum of Φ in the set {x c +td c |t ≥ 0}∩M , which forms a line in R n (line search). This is done by inserting x c + td c into Formula (7.52). Using d c = b − Ax c and the symmetry of matrix A we get Φ(x c +td c ) = Φ(x c ) −td T c d c + 1 2 t 2 d T c Ad c . (7.53) The minimum of this function with respect to t ∈ R can be determined using the derivative of this function with respect to t. The minimum is t c = d T c d c d T c Ad c . (7.54) The steps (a) and (b) of the method of the steepest gradient are used to create a sequence of vectors x k , k = 0, 1, 2, , with x 0 ∈ R n and x k+1 = x k + t k d k . The sequence ( Φ(x k ) ) k=0,1,2, is monotonically decreasing which can be seen by inserting Formula (7.54) into Formula (7.53). The sequence converges toward the minimum but the convergence might be slow [71]. The CG method uses a technique to determine the minimum which exploits orthogonal search directions in the sense of conjugate or A-orthogonal vectors d k . For a given matrix A, which is symmetric and non-singular, two vectors x, y ∈ R n are called conjugate or A-orthogonal, if x T Ay = 0. If A is positive definite, k 7.4 Conjugate Gradient Method 419 pairwise conjugate vectors d 0 , ,d k−1 (with d i = 0, i = 0, ,k − 1 and k ≤ n) are linearly independent [23]. Thus, the unknown solution vector x ∗ of Ax = b can be represented as a linear combination of the conjugate vectors d 0 , ,d n−1 , i.e., x ∗ = n−1 k=0 t k d k . (7.55) Since the vectors are orthogonal, d T k Ax ∗ = n−1 l=0 d T k At l d l = t k d T k Ad k .This leads to t k = d k Ax ∗ d T k Ad k = d T k b d T k Ad k for the coefficients t k . Thus, when the orthogonal vectors are known, the values t k , k = 0, ,n − 1, can be computed from the right-hand side b. The algorithm for the CG method uses a representation x ∗ = x 0 + n−1 i=0 α i d i (7.56) of the unknown solution vector x ∗ as a sum of a starting vector x 0 and a term n−1 i=0 α i d i to be computed. The second term is computed recursively by Fig. 7.20 Algorithm of the CG method. (1) and (2) compute the values α k according to Eq. (7.58). The vector w k is used for the intermediate result Ad k . (3) is the computation given in Formula (7.57). (4) computes g k+1 for the next iteration step according to Formula (7.58) in a recursive way: g k+1 = Ax k+1 −b = A(x k +α k d k ) −b = g k + Aα k d k . This vector g k+1 represents the error between the approximation x k and the exact solution. (5) and (6) compute the next vector d k+1 of the set of conjugate gradients 420 7 Algorithms for Systems of Linear Equations x k+1 = x k +α k d k , k = 1, 2, , with (7.57) α k = −g T k d k d T k Ad k and g k = Ax k −b . (7.58) Formulas (7.57) and (7.58) determine x ∗ according to Eq. (7.56) by computing α i and adding α i d i in each step, i = 1, 2, Thus, the solution is computed after at most n steps. If not all directions d k are needed for x ∗ , less than n steps are required. Algorithms implementing the CG method do not choose the conjugate vectors d 0 , ,d n−1 before computing the vectors x 0 , , x n−1 but compute the next con- jugate vector from the given gradient g k by adding a correction term. The basic algorithm for the CG method is given in Fig. 7.20. 7.4.2 Parallel CG Method The parallel implementation of the CG method is based on the algorithm given in Fig. 7.20. Each iteration step of this algorithm implementing the CG method consists of the following basic vector and matrix operations. 7.4.2.1 Basic Operations of the CG Algorithm The basic operations of the CG algorithm are (1) a matrix–vector multiplication Ad k , (2) two scalar products g T k g k and d T k w k , (3) a so-called axpy-operation x k +α k d k (The name axpy comes from axplus y describing the computation.), (4) an axpy-operation g k +α k w k , (5) a scalar product g T k+1 g k+1 , and (6) an axpy-operation −g k+1 +β k d k . The result of g T k g k is needed in two consecutive steps and so the computation of one scalar product can be avoided by storing g T k g k in the scalar value γ k . Since there are mainly one matrix–vector product and scalar products, a parallel implementation can be based on parallel versions of these operations. Like the CG method many algorithms from linear algebra are built up from basic operations like matrix–vector operations or axpy-operations and efficient implementations of these basic operations lead to efficient implementations of the entire algorithms. The BLAS (Basic Linear Algebra Subroutines) library offers efficient implementations for a large set of basic operations. This includes many axpy-operations which denote that a vector x is multiplied by a scalar value a and then added to another vector y. The prefixes s in saxpy or d daxpy denote axpy- operations for simple precision and double precision, respectively. Introductory descriptions of the BLAS library are given in [43] or [60]. A standard way to par- allelize algorithms for linear algebra is to provide efficient parallel implementations of the BLAS operations and to build up a parallel algorithm from these basic parallel 7.4 Conjugate Gradient Method 421 operations. This technique is ideally suited for the CG method since it consists of such basic operations. Here, we consider a parallel implementation based on the parallel implemen- tations for matrix–vector multiplication or scalar product for distributed memory machines as presented in Sect. 3. These parallel implementations are based on a data distribution of the matrix and the vectors involved. For an efficient implementation of the CG method it is important that the data distributions of different basic opera- tions fit together in order to avoid expensive data re-distributions between the oper- ations. Figure 7.21 shows a data dependence graph in which the nodes correspond to the computation steps (1)–(6) of the CG algorithm in Fig. 7.20 and the arrows depict a data dependency between two of these computation steps. The arrows are annotated with data structures computed in one step (outgoing arrow) and needed for another step with incoming arrow. The data dependence graph for one iteration step k is a directed acyclic graph (DAG). There are also data dependences to the previous iteration step k − 1 and the next iteration step k + 1, which are shown as dashed arrows. There are the following dependences in the CG method: The computation (2) needs the result w k from computation (1) but also the vector d k and the scalar value γ k from the previous iteration step k − 1; γ k is used to store the intermediate result γ k = g T k g k . Computation (3) needs α k from computation step (2) and the vectors x k , d k from the previous iteration step k − 1. Computation (4) also needs α k from ( 3 ) x k x k+1 ( 2 ) ( 4 ) ( 5 ) α k β k γ k+1 g k+1 g k γ k w k w k ( 6 ) d k d k+1 ( 1 ) α k g k+1 k−1 k k+1 Iteration step Iteration step Iteration step Fig. 7.21 Data dependences between the computation steps (1)–(6) of the CG method in Fig. 7.20. Nodes represent the computation steps of one iteration step k. Incoming arrows are annotated by the data required and outgoing arrows are annotated by the data produced. Two nodes have an arrow between them if one of the nodes produces data which are required by the node with the incoming arrow. The data dependences to the previous iteration step k −1 or the next iteration step k +1aregivenasdashed arrows. The data are named in the same way as in Fig. 7.20; additionally the scalar γ k is used for the intermediate result γ k = g T k g k computed in step (5) and required for the computations of α k and β k in computation steps (2) and (5) of the next iteration step 422 7 Algorithms for Systems of Linear Equations computation step (2) and vector w k from computation (1). Computation (5) needs vector g k+1 from computation (4) and scalar value γ k from the previous iteration step k −1; computation (6) needs the scalar value from β k from computation (5) and vector d k from iteration step k −1. This shows that there are many data dependences between the different basic operations. But it can also be observed that computation (3) is independent of the computations (4)–(6). Thus, the computation sequence (1),(2),(3),(4),(5),(6) as well as the sequence (1),(2),(4),(5),(6),(3) can be used. The independence of computation (3) from computations (4)–(6) is also another source of parallelism, which is a coarse-grained parallelism of two linear algebra operations performed in parallel, in contrast to the fine-grained parallelism exploited for a sin- gle basic operation. In the following, we concentrate on the fine-grained parallelism of basic linear algebra operations. When the basic operations are implemented on a distributed memory machine, the data distribution of matrices and vectors and the data dependences between oper- ations might require data re-distribution for a correct implementation. Thus, the data dependence graph in Fig. 7.21 can also be used to study the communication require- ments for re-distribution in a message-passing program. Also the data dependences between two iteration steps may lead to communication for data re-distribution. To demonstrate the communication requirements, we consider an implementa- tion of the CG method in which the matrix A has a row-blockwise distribution and the vectors d k , ω k , g k , x k , and r k have a blockwise distribution. In one iteration step of a parallel implementation, the following computation and communication operations are performed. 7.4.2.2 Parallel CG Implementation with Blockwise Distribution The parallel CG implementation has to consider data distributions in the following way: (0) Before starting the computation of iteration step k, the vector d k computed in the previous step has to be re-distributed from a blockwise distribution of step k − 1 to a replicated distribution required for step k. This can be done with a multi-broadcast operation. (1) The matrix–vector multiplication w k = Ad k is implemented with a row- blockwise distribution of A as described in Sect. 3.6. Since d k is now replicated, no further communication is needed. The result vector w k is distributed in a blockwise way. (2) The scalar product d T k w k is computed in parallel with the same blockwise dis- tribution of both vectors. (The scalar product γ k = g T k g k is computed in the previous iteration step.) Each processor computes a local scalar product for its local vectors. The final scalar product is then computed by the root processor of a single-accumulation operation with addition as reduction operation. This processor owns the final result α k and sends it to all other processors by a single-broadcast operation. (3) The scalar value α k is known by each processor and thus the axpy-operation x k+1 = x k +α k d k can be done in parallel without further communication. Each . F 00 · x (k) R x (k) B +ω b 1 b 2 . For a parallel implementation the component form of this system is used. On the other hand, for the convergence results the matrix form and the iteration matrix have. another source of parallelism, which is a coarse-grained parallelism of two linear algebra operations performed in parallel, in contrast to the fine-grained parallelism exploited for a sin- gle basic. written in vector notation for vectors x (k+1) R and x (k+1) B in the form D R · x (k+1) R = b 1 − F · x (k) B for k = 1, 2, , (7.48) D B · x (k+1) B = b 2 − E · x (k+1) R for k = 1, 2, , (7.49) in