MPI parallelization of fast algorithm codes developed using SIE VIE and p FFT method

MPI PARALLELIZATION OF FAST ALGORITHM CODES DEVELOPED USING SIE/VIE AND P-FFT METHOD WANG YAOJUN (B.Eng Harbin Institute of Technology, China) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2003 ACKNOWLEDGEMENTS This project is financially supported by the Institute of High Performance Computing (IHPC) of Agency for Science, Technology and Research (A*STAR) The author wishes to thank A*STAR-IHPC very much for its Scholarship The author would like to thank Professor Li Le-Wei in Department of Electrical & Computer Engineering (ECE) and Dr Li Er-Ping, Programme Manager, Electronics & Electromagnetics Programme of Institute of High Performance Computing for their instructions on my research The author also expresses his thanks to Dr Nie Xiao-Chun in Temasek Laboratories at NUS for the discussion with him during my research The author again thanks Mr Sing Cheng Hiong in Microwave Research Lab for providing me many facilities in the lab he manages Finally, the author is grateful to his beloved wife and daughter in that they support him in one way or the other to complete the present research while they keep staying in China i TABLE OF CONTENTS ACKNOWLEDGEMENTS i TABLE OF CONTENTS ii SUMMARY iv LIST OF FIGURES vi LIST OF TABLES vii LIST OF SYMBOLS viii CHAPTER 1: INTRODUCTION………… …………………………………… CHAPTER 2: BACKGROUND OF PARALLEL ALGORITHM FOR THE SOLUTION OF SURFACE INTEGRAL EQUATION………… 2.1 Basic Concept of Parallelization…………………………………….5 2.1.1 Amdahl’s Law……………………….…………………………5 2.1.2 Communication Time………………….……………………….6 2.1.3 The Effective Bandwidth………………………………………7 2.1.4 Two Strategies on Communication….…………………………7 2.1.5 Three Guidelines on Parallelization….……………………… 2.2 Basic Formulation of Scattering in Free Space…………………… 2.3 The Precorrected-FFT Algorithm………………………………… 2.3.1 Projecting onto a Grid………………………………………11 2.3.2 Computing Grid Potentials………………………………….…11 2.3.3 Interpolating Grid Potentials……………………………….….12 2.3.4 Precorrecting…………………………………………….……12 2.3.5 Computational Cost and Memory Requirement………….… 13 2.4 RCS (Radar Cross Section)……… ……………………………… 13 2.5 MPI (Message Passing Interface)….……………………………… 14 2.6 FFT (Fast Fourier Transform)………………………………….……16 2.6.1 DFT (Discrete Fourier Transform)……………………… … 16 2.6.2 DIT (Decimation in Time) FFT and DIF (Decimation in Frequency) FFT……… ……….…16 2.6.2.1 Radix-2 Decimation-in-Time (DIT) FFT…………….17 2.6.2.2 Radix-2 Decimation-in-Frequency (DIF) FFT……….18 2.6.3 The Mixed-Radix FFTs……………………………………….19 2.6.4 Parallel 3-D FFT Algorithm………………………………….20 2.6.5 Communications on Distributed-memory Multiprocessors….21 2.7 The Platform……………………………………………………… 21 CHAPTER 3: PARALLEL PRECORRECTED-FFT ALGORITHM ON PERFECTLY CONDUCTING……………… ……………….25 3.1 Goal of Parallelization…………………………………………… 25 3.2 The Parallel Precorrected-FFT Algorithm…………………………27 3.2.1 The First Way of Parallelization…………………………… 27 ii 3.2.2 The Second Way of Parallelization………………………… 29 3.3 The Memory Allocation………………………………………… 33 3.3.1 The Memory Requirement of the Grid Projection O (32Np3)……………………………………………………34 3.3.2 The Memory Requirement of the FFT O (128Ng)………… 36 3.3.3 The Memory Requirement of the Interpolation O (16Np3)…36 3.3.4 The Memory Requirement of the Correction Process O (8Nnear) …………………………………………………………… 37 3.4 The Computational Cost………………….…………………………38 3.4.1 The Cost of Computing the Direct Interactions…………… 38 3.4.2 Cost of Performing the FFT……………………………… 38 CHAPTER 4: MONOSTATIC AND BISTATIC SIMULATION RESULTS OF PERFECT ELECTRIC CONDUCTOR……….……… 41 4.1 Parallelization of the First Way………………………………… 41 4.2 Parallelization of the Second Way (Only Parallelizing FFT)… 42 P a r a l l e l i z a t i o n o f t he Se c o n d Wa y ( On l y Pa r a l l e l i z i ng Correction)……………………………………………………… 44 4.4 Parallelization of the Second Way (Parallelizing Correction and FFT) ……………………………………………………………… 44 4.5 Bistatic RCS of a Metal Sphere…………………………………….45 4.6 Analysis of the Simulation Results……………………………….46 4.7 Experiments on Communication Time……………………………46 CHAPTER 5: PARALLEL ALGORITHM ON HOMOGENEOUS DIELETRIC OBJECTS…………………………………………………………48 CHAPTER 6: PARALLELIZATION OF PRECORRECTED-FFT SOLUTION OF THE VOLUME INTEGRAL EQUATIONS FOR INHOMOGENEOUS DIELECTRIC BODIES…………………51 6.1 Introduction………………………………………………………….51 6.2 Formulation………………………………………………………….53 6.2.1 The Formulation and Discretization of the Volume Integral Equation….………………………………… ………………53 6.2.2 The Precorrected-FFT Solution of the VIE…………… 55 6.3 Parallel Algorithm…… ……………………………………………57 6.4 Numerical Simulation Results……….…………………………… 58 6.4.1 The RCS of an Inhomogeneous Dielectric Sphere with 9,947 Unknowns………………………………………… … 58 6.4.2 The RCS of a Periodic and Uniform Dielectric Slab with 206,200 Unknowns…………………………… …….59 CHAPTER 7: CONCLUSION ON PARALLEL PRECORRECTED-FFT ALGORITHM ON SCATTERING…………………………… 62 REFERENCES…………………………………………………………………64 iii SUMMARY In this thesis, the author explores the parallelization of the Precorrected-fast Fourier transform (P-FFT) algorithm used to compute electromagnetic field The Precorrected-FFT algorithm is a useful tool to characterize the electromagnetic scattering from objects In order to improve the speed of this efficient algorithm, the author makes some efforts to implement this algorithm on high performance computers which can be a supercomputer of multiple processors or a cluster of computers The author utilizes the IBM supercomputer (Model p690) to achieve the objective The Precorrected-FFT algorithm includes four main steps After analyzing the four steps, it can be found that the computation in each step can be made parallel So the parallel proposed Precorrected-FFT algorithm has four steps The main idea of parallelization is to distribute the whole computation to processors available and gather final results from all the processors Because the parallel algorithm is based on Message Passing Interface (MPI), the cost of communication among processors is an important factor to affect the efficiency of parallel codes Considering that the speed of message passing among processors is much slower than that of processor’s computing and accessing to local memory, the parallel code makes the amount of data to be transferred among processors as little as possible The author applies the parallel algorithm to the solution of surface integral equation and volume integral equation with the Precorrected-FFT algorithm, respectively The computation of radar scattering cross sections of perfect iv electricity conductors and dielectric objects is implemented The simulation results support that the parallel algorithm is efficient During the M.Eng degree project, a few papers are resulted from the project work One journal paper and two conference papers are published, and one journal paper was submitted for publication in journal The list of the publications is shown in the end of Chapter v LIST OF FIGURES Figure 2.1 Communication time…………….………………………………….6 Figure 2.2 Side view of the P-FFT grid for a discretized sphere (p=3) ……11 Figure 2.3 The four steps of the Precorrected-FFT algorithm…… …………11 Figure 2.4 The Cooley-Turkey butterfly………………….……………………18 Figure 2.5 The Gentleman-Sande butterfly……….………………………… 19 Figure 2.6 The loading flow of parallel codes……………… ………………23 Figure 3.1 Relationship between grids spacing and execution time………….30 Figure 3.2 Steps 1-4………………………………………………………… 32 Figure 3.3 Basic structures of distributed-memory computers…….……….…34 Figure 3.4(a) The communication between the main processor and the slave processors: Step 1…………………… …………… ……… …35 Figure 3.4(b) The communication between the main processor and the slave processors: Step 2…………… …………….……… ……… …35 Figure 3.4(c) The communication between the main processor and the slave processors: Step 3… ………………………… ……… …36 Figure 3.4(d) The communication between the main processor and the slave processors: Step 4……… ……………………….… ……… …36 Figure 4.1 Parallel computing time I……… ………………………………….42 Figure 4.2 Parallel computing time II… …….…………………………………42 Figure 4.3 Parallel computing time III…… ……………………………….….43 Figure 4.4 Parallel computing time V………………………………………… 44 Figure 4.5 Parallel computing time VI………………………………………… 45 Figure 4.6 Bistatic RCS of a metal sphere………….…………… …………….45 Figure 4.7 The communication time………… ……… …………………….…47 Figure 6.1(a) Top view of a sphere…………………… …………………….…53 Figure 6.1(b) Outer surface of one-eighth of sphere…… …………………… 53 Figure 6.1(c) Interior subdivision of one-eighth of sphere into 27 tetrahedrons …………………………………………………………………….53 Figure 6.2 RCS on an inhomogeneous dielectric sphere……… …………… 59 Figure 6.3 Execution time with different processors…………….………… …59 Figure 6.4 Bi-RCS of a periodic and uniform dielectric slab at k0h=9.0…… 60 vi LIST OF TABLES Table 4.1 The communication time of different data transferred……… ………47 Table 6.1 Execution time with different number of processors……….…………60 vii LIST OF SYMBOLS Symbol Description Ei Es nˆ A incident plane wave scattered plane wave unit normal vector magnetic vector potential electric scalar potential Rao-Wilton-Glisson (RWG) basis functions current the unknown coefficients Green’s function impedance matrix the vector the inverse FFT the electric field strength of the incident plane wave at a target the electric field strength of the receiving antenna’s preferred polarization Φ f n(r) J In G (r , r ′) Z V F-1 Ein Er viii CHAPTER INTRODUCTION In this thesis, the author mainly delves how to apply the parallel precorrected-fast Fourier transform (P-FFT) algorithm to the computation of scattered electromagnetic fields The results show that the parallel Precorrected-FFT algorithm is an efficient algorithm to solve the electromagnetic scattering problems The thesis includes chapters The following lists the major content of each chapter (from Chapter to Chapter 7) In Chapter 2, some basic concepts relating to the Parallel Precorrected-FFT algorithm on scattering are introduced concisely These concepts are Message Passing Interface (MPI), Radar Cross Sections (RCS), the Precorrected-FFT algorithm, Fast Fourier Transform (FFT), the physical and virtual structures of high performance computers, the parallel theory and communication cost In Chapter 3, details of the Parallel Precorrected-FFT algorithm are given Two ways of applying the algorithm are analyzed The pseudo code of the algorithm is written In Chapter 4, the experimental results of scattering by perfect electrics conductors are presented and analyzed Figure 6.1 (a) Top view of a sphere [1] Figure 6.1 (b) Outer surface of one-eighth of sphere [1] Figure 6.1 (c) Interior subdivision of one-eighth of sphere into 27 tetrahedrons [1] 6.2 Formulation 6.2.1 The Formulation and Discretization of the Volume Integral Equation [5] First of all, consider a lossy, inhomogeneous dielectric body V which is illuminated by an incident field Ei Assume that the material is dielectric ( µ = µ ) and has complex dielectric constant of ε ( r ) = ε r ( r )ε − jσ( r ) ω , where εr and σ are the relative permittivity and conductivity at position r, respectively By 53 invoking the equivalence principle, the dielectric body is removed and replaced by a volume polarization current J Because the total electric field is composed by both the incident field and the scattered field due to J, we can obtain the following volume integral equation, D( r )/ε( r ) = E i ( r ) − jωA( r ) − ∇Φ ( r ) (6.1) where A(r ) and Φ (r ) are the vector and scalar potentials produced by the volume current J, and J is related to the total electric flux density by J ( r ) = jω( ε( r ) − ε ) D( r ) / ε( r ) (6.2) Only after the parameters in Equation (6.1) are discretized, Equation (6.1) can be solved on computer So the volume V should be discretized into a number of tetrahedral elements while the dielectric properties of each tetrahedral element are approximated as constant The following volumetric SWG basis functions are used to represent the unknown electric flux density: N D ( r ) = ∑ Dn f n ( r ) (6.3) n =1 where Dn represent the unknown expansion coefficients, and N denotes the number of faces that make up the tetrahedral model of V Replacing the parameter D(r) in equation (6.1) with equation (6.5) and applying the Galerkin’s testing procedure yield a N × N matrix equation of the form SD = E (6.4) For the reason of conciseness, the detailed description of the elements of the coefficient matrix S and the excitation vector E are omitted here since they can be easily derived from Equations (6.1)-(6.4) But for easier description of the 54 following P-FFT approach, we give the expressions of the contributions to A and Φ from a single basis function, which are needed in the computation of the elements of S , µ an 12 π  κ n+  + Vn − an Φ n (r ) = 4πjωε0  κ n+  + Vn An (r ) = + ∫T + ρn (r ′) n ∫T + n e − jk0 r − r ′ e − jk0 r − r ′ r- r ′ dv ′ + κ− dv ′ − n− r-r ′ Vn ∫T − n κ n− Vn− e − ∫T − ρn (r ′) e n − jk0 r − r ′ r-r ′  dv ′ r- r ′  − jk0 r − r ′ (κ dv ′ − + n (6.5) − jk r − r ′  − κ n− ) e ′ d s  ∫ an r-r ′ an  (6.6) The definitions of a n , ρn± , Vn± and κ n± can be found in [5] 6.2.2 The Precorrected-FFT Solution of the VIE The precorrected-FFT method is an excellent fast algorithm that has been successfully applied to solve the surface integral equations for electromagnetic scattering problems Now Dr Nie Xiaochun, Professor Li Le-Wei and Dr Yuan Ning have expanded its application to the solution of the volume integral equations The method separately considers near- and far- field interactions when evaluating a matrix-vector multiplication The method of computing far-field interactions is to project sources onto a regular grid by matching sources’ vector and scalar potentials at some given test points to guarantee the approximate equality of sources’ far fields Then a 3-D convolution can be used to evaluate the grid potentials (fields) The grid potentials can be interpolated to the elements to substitute the computation of fields on the scatterer To effectively utilize FFT during the convolution computation, the projection and interpolation operators are represented by sparse matrices Unfortunately, these grid currents not accurately match the elements radiated by the original sources in near fields 55 Therefore, near-field interactions need to be computed directly, and corrected for errors introduced by the far-field operator The implementation of the projection step for the VIE will be described in the following paragraph and the convolution, interpolation, and precorrection steps are omitted since they are similar to those which we have introduced in previous chapters, although they are more complicated Before the P-FFT method is applied, we should enclose the entire object in a uniform rectangular grid Next the uniform rectangular grid is further subdivided into small cells with each cell consisting of p grid points and containing only a few tetrahedral elements Assume the nth volumetric SWG basis function f n is contained in a given cell k For the projection of the electric charges (corresponding to ∇ ⋅ f n ), enforcing the scalar potential produced by the electric charges at the p grid points to match that produced by the original electric charge distributions on the two tetrahedral elements and the common triangular patches (if applicable) at N c test points, we can obtain the projection operator for the divergence operator of the nth basis function [ ] W (k , n ) = P gt P pt ,n + (6.7) where P pt ,n denotes the nth column of P pt and [P gt ] indicates the generalized + inverse of P gt P gt represents the mappings between the grid charges and the test-point potentials and P pt represent the mappings between the actual charge distributions and the test-point potentials, respectively, given by − jk0 rqt − rl e P (q,l ) = 4πε rqt − rl gt (6.8) 56  − an  κ n+ P (q, n ) = 4πjωε0 Vn+  pt − jk0 rqt − r ′ κ n− e ∫Tn+ r t − r ′ dv′ − V − n q − jk0 rqt − r ′ − jk0 rqt − r ′  ( κ n+ − κ n− ) e e ∫Tn− r t − r ′ dv′ − a ∫an r t − r ′ ds′ n q q  (6.9) where rqt and rl are the position vectors at the qth test point and the lth grid point, respectively, and qˆ l is the charge at the lth grid point For any basis function n in ) cell k , this projection operator generates a subset of the grid currents q The contribution to qˆ from the charges in cell k can be computed by summing over all the actual charges in this cell, i.e ) q = ∑W ( k , n ) Dn (6.10) n Following the above procedure, we can project the charges Dn∇ ⋅ fn onto the p grid points surrounding cell k It should be noted that the projection of the volume and surface charges are performed simultaneously in one step, which is a convenient and efficient scheme developed for the volume integral equation Similarly, by matching the vector potential due to the p grid currents and that due to the actual volume current distributions at the test points, we can obtain the projection operators for the electric currents 6.3 Parallel Algorithm As we have introduced in previous chapters, all critical four steps of the Precorrected-FFT algorithm can be parallelized theoretically However, considering the communication cost for the special case of VIE, some steps may not be parallelized For instance, when the number of the uniform grid enclosing 57 the computed object is small such as below 20 × 20 × 20, the serial convolution computation is faster than the parallel one on IBM p690 supercomputer whose processors have very fast speed and the communication speed among nodes is not faster enough In this case, the steps that are needed to be parallelized are projection, interpolation and pre-correction As the parallel procedure is similar to the method that we use in surface integral equation, we don’t waste the space to explain the procedure again here See Subsection 3.2.2 for more details 6.4 Numerical Simulation Results 6.4.1 The RCS of an Inhomogeneous Dielectric Sphere with 9,947 Unknowns To prove the correctness and efficiency of parallel precorrected-FFT on volume integral equation, scattering on an inhomogeneous dielectric sphere is computed with parallel precorrected-FFT algorithm The wavelength λ of the incident plane is set to be meter The sphere whose radius is 0.5 meter is divided into 4,802 tetrahedrons, 9,947 unknowns and 995 nodes Figure 6.2 shows the result of RCS which is same as the correct result Figure 6.3 displays the execution time with different number of processors to run the parallel precorrected-FFT codes 58 14 12 10 RCS 181 171 161 151 141 131 121 111 101 91 81 71 61 51 41 31 21 11 -2 -4 -6 angle CPU Time (Seconds) 3000 2500 2000 1500 1000 500 Ideal CPU Time(s) 2631.6 Practical CPU Time(s) Number of Processors 1315.8 657.9 328.9 164.5 2631.6 2067.08 2100 18 16 14 12 10 Number of CPUs Figure 6.2 RCS on an inhomogeneous dielectric sphere 2461.5 1789.03 16 Figure 6.3 Execution time with different processors 6.4.2 The RCS of a Periodic and Uniform Dielectric Slab with 206,200 Unknowns The second example is a five-periodic slab The sizes of this slab in the x-, y- and z-direction are m, m and 1.4 m, respectively The object is divided into 19,188 nodes and 100,800 tetrahedrons The number of unknowns is 206,200 Figure 6.4 59 shows the Bi-static RCS of the slab It can be seen that the results of the parallel P-FFT method and the normal P-FFT method in [2] are in good agreement, which demonstrating the correctness of the parallel algorithm Table 6.1 gives the execution time of the solution using the different number of processors Figure 6.4 Bi-RCS of a periodic and uniform dielectric slab at k0h=9.0 Table 6.1 Execution time with different number of processors Processors 10 Execution time(hour) 30.2 12.1 8.5 According to the paper [6], the serial solution of the second example needs 38 hours on a PC with 1G memory Compared with the serial P-FFT method, the 60 parallel P-FFT method reduces the execution time greatly Actually, the execution time is still longer than the ideal time There are two reasons which are responsible for the long running time The first reason is that the authors have the lowest priority on the supercomputer The second one is that there are usually more than one hundred jobs running on the supercomputer simultaneously It is obvious that the execution time is not reduced proportionally to the number of processors This is because there are communications between different processors The communication type is the blocking, which means that the code can not be run until the data communications between all processors are finished In addition, some fraction of the codes cannot be parallelized 61 CHAPTER CONCLUSION ON PARALLEL PRECORRECTED-FFT ALGORITHM ON SCATTERING As an excellent fast algorithm, the Precorrected-FFT algorithm has been widely applied in electromagnetic fields In this thesis, the author explores to expand the application of the Precorrected-FFT method to compute scattering from large objects Based on the platform of high performance computers, MPI-based algorithms are developed to accelerate the computation and deal with large scattering problems These algorithms concern the solutions of surface integral equation and volume integral equation In this thesis, the relevant knowledge on the parallel Precorrected-FFT algorithm is introduced first Then the parallel algorithms for surface integral equation and volume integral equation are given, respectively Actually, the main ideas of these two algorithms are same In the parallel algorithms, the cost of communication among processors is an important factor that should be balanced carefully The numerical simulations of the parallel Precorrected-FFT algorithm have proved that the efficiency of the algorithm is high and the capability of processing large scale objects is greatly improved Comparing the result of the parallel algorithm with that of the serial algorithm, we can find that the sizes of objects that the parallel code can solve are larger and the execution time is also shortened significantly 62 Because the author’s priority and authentication on IBM p690 is the lowest, the author can not get enough memory and guarantee the parallel codes not be suspended for the codes with higher priority So the size of objects can not be made larger Since the code is based on MPI, it is convenient to be transplanted to any other MPI platform In future, the author hopes to run the code on a MPIbased cluster of computers which is cheaper than supercomputer 63 References [1] J R Phillips and J K White, “A precorrected-FFT method for Capacitance Extraction of complicated 3-D structures”, Int Conf On Computer-Aided Design, Santa Clara, California, Nov., 1994 [2] J R Phillips and J K White, “A precorrected-FFT method for electrostatic analysis of complicated 3-D structures”, IEEE Trans Computer-Aided Design of Integrated Circuits and Systems, vol 16, no 10, pp 1059-1072, Oct., 1997 [3] Xiaochun Nie, Le-Wei Li, Ning Yuan and Yeo Tat Soon, “Precorrected-FFT Algorithm for Solving Combined Field Integral Equations in Electromagnetic Scattering”, Journal of Electromagnetic Waves and Applications, vol 16, no 8, pp 1171-1187, 2002 [4] Xiaochun Nie, Le-Wei Li, Ning Yuan and Jacob K White, “Fast Analysis of Sattering by Arbitrarily Shaped Three-Dimensional Objects Using the Precorrected-FFT Method”, Microwave and Optical Technology Letters, vol 34, no 6, pp 348-442, 2002 [5] Xiaochun Nie, Le-Wei Li, Ning Yuan, Tat Soon Yeo and Yeow Beng Gan, “Precorrected-FFT Solution of the Volume Integral Equations for Inhomogeneous Dielectric Bodies”, in Proc 2003 IEEE AP-S International Symposium And USNC/URSI National Radio Science Meeting, Columbus, Ohio, USA, June 23-27, 2003, Invited Paper [6] Xiao-Chun Nie, Le-Wei Li, Ning Yuan, Tat-Soon Yeo, and Yeow-Beng Gan, “Precorrected-FFT Solution of the Volume Integral Equation for 3-D Inhomogeneous Dielectric Objects”, IEEE Trans Antennas Propagat., submitted [7] N.R Aluru, V.B Nadkarni and J White, “A Parallel Precorrected FFT Based 64 Capacitance Extraction Program for Signal Integrity Analysis”, In Proc of 33rd Design Automation Conference, DAC 96-06/96 Las Vegas, NV, USA [8] D H Schaubert, D R Wilton and A W Glisson, “A Tetrahetral Modeling Method for Electromagnetic Scattering by Arbitrarily Shaped Inhomogeneous Dielectric Bodies”, IEEE Trans Antennas Propagat., vol AP-32, no 1, pp 77-85, Jan, 1984 [9] Eleanor Chu, Alan George, “Inside the FFT black box : serial and parallel fast Fourier transform algorithms”, Boca Raton, Fla.: CRC Press, c2000 [10] Weng Cho Chew, Jian-Ming Jin, Eric Michielssen, and Jiming Song, Fast and Efficient Algorithms in Computational Electromagnetics, 2001 ARTECH HOUSE, Inc 685 Canton Street Norwood, MA 02062 [11] T K Sarkar, E Arvas, and S.M Rao, “Application of FFT and the Conjugate Gradient Method for the Solution of Electromagnetic Radiation from Electrically Large and Small Conducting Bodies”, IEEE Trans Antennas Propagat., vol AP-34, no 5, pp 635-640, May, 1986 [12] C C Lu, “Multilevel fast multipole algorithm for electromagnetic scattering from conducting objects with material coating”, IEEE APS Int Symp Dig., vol 3, pp 770-773, 2001 [13] D H Schaubert, D R Wilton and A W Glisson, “A Tetrahetral Modeling Method for Electromagnetic Scattering by Arbitrarily Shaped Inhomogeneous Dielectric Bodies”, IEEE Trans Antennas Propagat., vol AP-32, No 1, pp 77-85, Jan 1984 [14] X C Nie, L W Li, N Yuan and T S Yeo, “Precorrected-FFT algorithm for solving combined field integral equations in electromagnetic scattering”, J of Electromagn Waves and Appl., vol 16, no 8, pp 1171-1187, 2002 65 [15] E Topsakal, M Carr, J Volakis and M Bleszynski, “Galerkin operators in adaptive integral method implementations”, IEE Pro.-Microw Antennas Propag., vol 148, no 2, pp 79-84, April, 2001 [16] C Guiffaut and K Mahdjoubi, “A Parallel FDTD Algorithm Using the MPI Library”, IEEE Antennas and Propagation Magazine, Vol AP-43, No 2, pp 94-103, April 2001 [17] Adamo and Jean-Marc, Multi-threaded Object-Oriented MPI-Based Message Passing Interface: The ARCH Library, Boston, Mass.: Kluwer Academic Publishers, 1998 [18] William Gropp, Ewing Lusk, and Rajeev Thakur, Using MPI-2: Advanced Features of the Message-Passing Interface, Cambridge, Mass.: MIT Press, 1999 [19] Henri J Nussbaumer, “Fast Fourier Transform and Convolution Algorithms”, Berlin ; New York : Springer-Verlag , 1981 [20] Konada Umashankar and Allen Taflove, Computational Electromagnetics, Artech House, Inc., 685 Canton Street, Norwood, MA 02062, 1993 [21] S.Lakshmivarahan and Sudarshan K Dhall, Analysis and Design of Parallel Algorithms : Arithmetic and Matrix Problems, New York : McGraw-Hill, 1990 [22] Zima, Hans., Supercompilers for Parallel and Vector Computers, New York, N.Y.: ACM Press; Workingham, England; Reading, Mass.: Addison-Wesley, 1991 [23] Michael Metcalf, John Reid, Fortran 90/95 Explained, Oxford: Oxford University Press, 1996 [24] Jonathan Schaeffer, High Performance Computing Systems and Applications, 66 Boston: Kluwer Academic Publishers, 1998 [25] David B Skillicorn, and Domenico Talia, Programming Languages for parallel processing, Los Alamitos, Calif.: IEEE Computer Society Press, 1995 [26] T.L Freeman, and C Phillips, Parallel Numerical Algorithms, New York: Prentice Hall, 1992 [27] Rene Husler, Intelligent Communication: A New Paradigm for Data-Parallel Programming, Hartung-Gorre Verlag Konstanz, 1997 [28] Thomas Braunl, Parallel Programming: An Introduction, Prentice Hall International (UK) Limited, 1993 [29] William Gropp, Ewing Lusk, and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, Cambridge, Mass : MIT Press, 1999 2nd ed 67 [...]... setenv MP_PGMMODEL MPMD This command sets the running model to multiple programs with multiple data Step 2 Set the MP_PROCS environment variable as follows: setenv MP_PROCS n This command indicates how many processors will be used to run the program Step 3 Invoke a MPMD program First, start the command poe, poe [options] Then the poe command will prompt the user to enter the program name on each processor... computers provide a good platform on which large problem can be solved In order to efficiently utilize high performance computers, it is necessary to explore how to parallelize the fast algorithm Although there are some compilers on high performance computers that can automatically compile serial codes into parallel codes and run them, the efficiency of the application codes complied by these compilers... procedure of application of the Precorrected- FFT algorithm [1] Figure 2.2 Side view of the P- FFT grid for a discretized sphere (p= 3) [1] 10 (1) (4) (2) (3) Figure 2.3 The four steps of the Precorrected -FFT algorithm [1] A brief description of the above procedure is given below 2.3.1 Projecting onto a Grid Initially, a projection operator should be defined The basic idea is that using the point current and. .. cluster of computers or the processors of multiprocessor parallel computer Someone also calls this kind of structure ‘grid’ The concept of computing grid is borrowed from electricity grid that supplies us electricity power The key problem that MPI- based programming relates is how to distribute the tasks to processors according to the capability of each processor There are two main types of MPI- based supercomputers:... research, one journal paper and two conference papers have been published and one paper has been submitted These papers include: (a) Book Chapter 1 Le-Wei Li, Yao-Jun Wang, and Er-Ping Li, MPI- based parallelized precorrected FFT algorithm for analyzing scattering by arbitrarily shaped threedimensional objects”, Progress in Electromagnetics Research, PIER 42, pp 247-259, 2003 (b) Journal Papers 1 Le-Wei Li,... on parallelization of the Precorrected -FFT algorithm for computing scattered electromagnetic fields, knowledge on parallel concepts, MPI, the Precorrected -FFT algorithm, the concept of scattering on objects by EM computations, Fast Fourier Transforms (FFTs), and the structure of 4 high performance computers (here referring to IBM model p6 90) is necessary Parallelization is a complex procedure which is... program to processors 1-3, respectively The details of the loading operation are given in the following slave program Main program POE Processor 0 Processor 1 Processor 2 Processor 3 memory memory memory memory Figure 2.6 The loading flow of parallel codes Referring to IBM p6 90, the practical steps of loading main program and the slave program to processors are explained below Step 1 Set the MP_PGMMODEL... a specific algorithm may not be readily high The best way of improving the efficiency is that programmer manually parallelizes the required algorithm case by case In this thesis, we adopt Message Passing Interface (MPI) library on IBM p6 90 as the platform that supports our parallel codes because MPI is a standard message passing protocol supported by many vendors Before starting our discussion on parallelization. .. the world Because it is a popular interface standard, the MPI- based codes can be transplanted to other computers easily That is, the compatibility is excellent This is the reason that we choose MPI as the platform Message Passing Interface (MPI) is the definition of interface among a cluster of computers or the processors of multiprocessor parallel computer It provides a platform on which users can... is easy to write programs on the MPI- based platform Only a few functions in MPI library are indispensable With these functions a vast number of useful and efficient codes can be written Here shown is the list of these functions [16, 17, 18]: (1) MPI_ Init(ierr) //Initialize MPI (2) MPI_ Comm_size (MPI_ COMM_WORLD, numprocs, ierr) //Find out how many processes there are (3) MPI_ Comm_rank (MPI_ COMM_WORLD,myid, ... we apply the Precorrected -FFT algorithm which requires less memory and provides faster speed than traditional Method of Moments (MoM) 2.3 The Precorrected -FFT Algorithm The Precorrected -FFT algorithm. .. procedure of application of the Precorrected- FFT algorithm [1] Figure 2.2 Side view of the P- FFT grid for a discretized sphere (p= 3) [1] 10 (1) (4) (2) (3) Figure 2.3 The four steps of the Precorrected -FFT. .. the physical structure and the operating environment of this supercomputer in the latter part of this chapter There are two aspects that parallelization algorithm should be developed to especially

Định dạng
Số trang	76
Dung lượng	877,69 KB