Applied Computational Fluid Dynamics Techniques - Wiley Episode 2 Part 6 pot

344 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Then: L1. Initialize pointer lists for elements, points and receive lists; L2. For each pointipoin: Get the smallest domain number idmin of the elements that surround it; store this number in lpmin(ipoin); For each element that surrounds this point: If the domain number of this element is larger than idmin: - Add this element to domain idmin; L3. For the points of each sub-domain idomn: If lpmin(ipoin).ne.idomn: add this information to the receive list for this sub-domain; Endif L4. Order the receive list of each sub-domain according to sub-domains; L5. Given the receive lists, build the send list for each sub-domain. Given the send and receive lists, the information transfer required for the parallel explicit flow solver is accomplished as follows: - Send the updated unknowns of all nodes stored in the send list; - Receive the updated unknowns of all nodes stored in the receive list; - Overwrite the unknowns for these received points. In order to demonstrate the use of explicit flow solvers on MIMD machines, we consider the same supersonic inlet problem as described above for shared-memory parallel machines (see Figure 15.24). The solution obtained on a 6-processor MIMD machine after 800 timesteps is shown in Figure 15.28(a). The boundaries of the different domains can be clearly distinguished. Figure 15.28(b) summarizes the speedups obtained for a variety of platforms using MPI as the message passing library, as well as the shared memory option. Observe that an almost linear speedup is obtained. For large-scale industrial applications of domain decomposition in conjunction with advanced compressible flow solvers, see Mavriplis and Pirzadeh (1999). 15.7. The effect of Moore’s law on parallel computing One of the most remarkable constants in a rapidly changing world has been the rate of growth for the number of transistors that are packaged onto a square inch. This rate, commonly known as Moore’s Law, is approximately a factor of two every 18 months, which translates into a factor of 10 every 5 years (Moore (1965, 1999)). As one can see from Figure 15.29 this rate, which governs the increase in computing speed and memory, has held constant for more than three decades, and there is no end in sight for the foreseeable future (Moore (2003)). One may argue that the raw numberof transistors does not translate into CPU performance. However, more transistors translate into more registers and more cache, both important elements to achieve higher throughput. At the same time, clock rates have increased, and pre-fetching and branch prediction have improved. Compiler development has also not stood still. Moreover, programmers have become conscious of the added cost of memory access, cache misses and dirty cache lines, employing the techniques described above to minimize their impact. The net effect, reflected in all current projections, is that CPU performance is going to continue advancing at a rate comparable to Moore’s Law. EFFICIENT USE OF COMPUTER HARDWARE 345 MachŦNumber: Usual vs. 6ŦProc Run ( min=0.825, max=3.000, incr=0.05 ) (a) 1 2 4 8 16 32 1 2 4 8 16 32 Speedup Nr. of Processors Ideal SGI-O2K SHM SGI-O2K MPI IBM-SP2 MPI HP-DAX MPI (b) Figure 15.28. Supersonic inlet: (a) MIMD results; (b) speedup for different machines Figure 15.29. Evolution of transistor density 346 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 15.7.1. THE LIFE CYCLE OF SCIENTIFIC COMPUTING CODES Let us consider the effects of Moore’s Law on the lifecycle of typical large-scale scientific computing codes. The lifecycle of these codes may be subdivided into the following stages: - conception; - demonstration/proof of concept; - production code; - widespread use and acceptance; - commodity tool; - embedding. In the conceptual stage, the basic purpose of the code is defined, the physics to be simulated identified and proper algorithms are selected and coded. The many possible algorithms are compared, and the best is kept. A run during this stage may take weeks or months to complete. A few of these runs may even form the core of a PhD thesis. The demonstration stage consists of several large-scale runs that are compared to exper- iments or analytical solutions. As before, a run during this stage may take weeks or months to complete. Typically, during this stage the relevant time-consuming parts of the code are optimized for speed. Once the basic code is shown to be useful, it may be adopted for production runs. This implies extensive benchmarking for relevant applications, quality assurance, bookkeeping of versions, manuals, seminars, etc. For commercial software, this phase is also referred to as industrialization of a code. It is typically driven by highly specialized projects that qualify the code for a particular class of simulations, e.g. air conditioning or external aerodynamics of cars. If the code is successful and can provide a simulation capability not offered by competi- tors, the fourth phase, i.e. widespread use and acceptance, will follow naturally. An important shift is then observed: the ‘missionary phase’ (why do we need this capability?) suddenly transitions into a ‘business as usual phase’ (how could we ever design anything without this capability?). The code becomes an indispensable tool in industrial research, development, design and analysis. It forms part of the widely accepted body of ‘best practices’ and is regarded as commercial off the shelf (COTS) technology. One can envision a fifth phase, where the code is embedded into a larger module, e.g. a control device that ‘calculates on the fly’ based on measurement input. The technology embodied by the code has then become part of the common knowledge and the source is freely available. The time from conception to widespread use can span more than two decades. During this time, computing power will have increased by a factor of 1:10000. Moreover, during a decade, algorithmic advances and better coding will improve performance by at least another factor of 1:10. Let us consider the role of parallel computing in light of these advances. During the demonstration stage, runs may take weeks or months to complete on the largest machine available at the time. This places heavy emphasis on parallelization. Given that optimal performance is key, and massive parallelism seems the only possible way of EFFICIENT USE OF COMPUTER HARDWARE 347 solving the problem, distributed memory parallelism on O(10 3 ) processors is perhaps the only possible choice. The figure of O(10 3 ) processors is derived from experience: even as a high-end user with sometimes highly visible projects the author has never been able to obtain a larger number of processors with consistent availability in the last two decades. Moreover, no improvement is foreseeable in the future. The main reason lies in the usage dynamics of large-scale computers: once online, a large audience requests time on it, thereby limiting the maximum number of processors available on a regular basis for production runs. Once the code reaches production status, a shift in emphasis becomes apparent. More and more ‘options’ are demanded,and these have to be implemented in a timely manner. Another five years have passed and by this time, processors have become faster (and memory has increased) by a further factor of 1:10, implying that the same run that used to take O(10 3 ) processors can now be run on O(10 2 ) processors. Given this relatively small number of processors, and the time constraints for new options/variants, shared memory parallelism becomes the most attractive option. The widespread acceptance of a successful code will only accentuate the emphasis on quick implementation of options and user-specific demands. Widespread acceptance also implies that the code will no longer run exclusively on supercomputers, but will migrate to high-end servers and ultimately PCs. The code has now been in production for at least 5 years, implying that computing power has increased again by another factor of 1:10. The same run that used to take O(10 3 ) processors in the demonstration stage can now be run using O(10 1 ) processors, and soon will be withinreach ofO(1) processors.Given thatuser-specific demands dominate at this stage, and that the developers are now catering to a large user base working mostly on low-end machines, parallelization diminishes in importance,eventothe point of completely disappearing as an issue. As parallelization implies extra time devoted to coding, thereby hindering fast code development, it may be removed from consideration at this stage. One could consider a fifth phase, 20 years into the life of the code. The code has become an indispensable commodity tool in the design and analysis process, and is run thousands of times per day. Each of these runs is part of a stochastic analysis or optimization loop, and is performed on a commodity chip-based, uni-processor machine. Moore’s Law has effectively removed parallelism from the code. Figure 15.30 summarizes the life cycle of typical scientific computing codes. Concept Demo Prod Wide Use COTS Embedded Time 1 10 100 1000 10000 Number of Processors Number of Users Figure 15.30. Life cycle of scientific computing codes 348 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 15.7.2. EXAMPLES Let us consider two examples where the life cycle of codes described above has become apparent. 15.7.2.1. External missile aerodynamics The first example considers aerodynamic force and moment predictions for missiles. World- wide, approximately 100 new missiles or variations thereof appear every year. In order to assess their flight characteristics, the complete force and moment data for the expected flight envelope must be obtained. Simulations of this type based on the Euler equations require approximately O(10 6 –10 7 ) elements, special limiters for supersonic flows, semi-empirical estimation of viscous effects and numerous specific options such as transpiration boundary conditions, modelling of control surfaces, etc. The first demonstration/feasibility studies took place in the early 1980s. At that time, it took the fastest production machine of the day (Cray-XMP) a night to compute such flows. The codes used were based on structured grids (Chakravarthyand Szema (1987))as the available memory was small compared to the number of gridpoints. The increase of memory, together with the development of codes based on unstructured (Mavriplis (1991b), Luo et al. (1994)) or adaptive Cartesian grids (Melton et al. (1993), Aftosmis et al. (2000)) as well as faster, more robust solvers (Luo et al. (1998)) allowed for a high degree of automation. At present, external missile aerodynamics can be accomplished on a PC in less than an hour, and runs are carried out daily by the thousands for envelope scoping and simulator input on PC clusters (Robinson (2002)). Figure 15.31 shows an example. Figure 15.31. External missile aerodynamics 15.7.2.2. Blast simulations The second example considers pressure loading predictions for blasts. Simulations of this type based on the Euler equations require approximately O(10 6 –10 8 ) elements, special limiters for transient shocks, and numerous specific options such as links to damage prediction EFFICIENT USE OF COMPUTER HARDWARE 349 post-processors. The first demonstration/feasibility studies took place in the early 1990s (Baum and Löhner (1991), Baum et al. (1993, 1995, 1996)). At that time, it took the fastest available machine (Cray-C90 with special memory) several days to compute such flows. The increase of processing power via shared memory machines during the past decade has allowed for a considerable increase in problem size, physical realism via coupled CFD/CSD runs (Löhner and Ramamurti (1995), Baum et al. (2003)) and a high degree of automation. At present, blast predictions with O(2 ×10 6 ) elements can be carried out on a PC in a matter of hours (Löhner et al. (2004c)), and runs are carried out daily by the hundreds for maximum possible damage assessment on networks of PCs. Figure 15.32 shows the results of such a prediction based on genetic algorithms for a typical city environment (Togashi et al. (2005)). Each dot represents an end-to-end run (grid generation of approximately 1.5 million tetrahedra, blast simulation with advanced CFD solver, damage evaluation), which takes approximately 4 hours on a high-end PC. The scale denotes the estimated damage produced by the blast at the given point. This particular run was done on a network of PCs and is typical of the migration of high-end applications to PCs due to Moore’s Law. Figure 15.32. Maximum possible damage assessment for inner city 15.7.3. THE CONSEQUENCES OF MOORE’S LAW The statement that parallel computing diminishes in importanceas codes mature is predicated on two assumptions: - the doubling of computing power every 18 months will continue; - the total number of operations required to solve the class of problems the code was designed for has an asymptotic (finite) value. 350 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES The second assumption may seem the most difficult to accept. After all, a natural side effect of increased computing power has been the increase in problem size (grid points, material models, time of integration, etc.). However, for any class of problem there is an intrinsic limit for the problem size, given by the physical approximation employed. Beyond a certain point, the physical approximation does not yield any more information. Therefore, we may have to accept that parallel computing diminishes in importance as a code matures. This last conclusion does not in any way diminish the overall significance of parallel computing. Parallel computing is an enabling technology of vital importance for the development of new high-end applications. Without it, innovation would seriously suffer. On the other hand, without Moore’s Law many new code developments would appear as unjustified. If computing time does not decrease in the future, the range of applications would soon be exhausted. CFD developers worldwide have always assumed subconsciously Moore’s Law when developing improved CFD algorithms and techniques. 16 SPACE-MARCHING AND DEACTIVATION For several important classes of problems, the propagation behaviour inherent in the PDEs being solved can be exploited, leading to considerable savings in CPU requirements. Examples where this propagation behaviour can lead to faster algorithms include: - detonation: no change to the flowfield occurs ahead of the denotation wave; - supersonic flows: a change of the flowfield can only be influenced by upstream events, but never by downstream disturbances; and - scalar transport: a change of the transported variable can only occur in the downstream region, and only if a gradient in the transported variable or a source is present. The present chapter shows how to combine physics and data structures to arrive at faster solutions. Heavy emphasis is placed on space-marching,where these techniqueshave reached considerable maturity. However, the concepts covered are generally applicable. 16.1. Space-marching One of the most efficient ways of computing supersonic flowfields is via so-called space- marching techniques. These techniques make use of the fact that in a supersonic flowfield no information can travel upstream. Starting from the upstream boundary, the solution is obtained by marching in the downstream direction, obtaining the solution for the next downstream plane (for structured (Kutler (1973), Schiff and Steger (1979), Chakravarthy and Szema (1987), Matus and Bender (1990), Lawrence et al. (1991)) or semi-structured (McGrory et al. (1991), Soltani et al. (1993)) grids), subregion (Soltani et al. (1993), Nakahashi and Saitoh (1996), Morino and Nakahashi (1999)) or block. In the following, we will denote as a subregion a narrow band of elements, and by a block alargerregionof elements (e.g. one-fifth of the mesh). The updating procedure is repeated until the whole field has been covered, yielding the desired solution. In order to estimate the possible savings in CPU requirements, let us consider a steady- state run. Using local timesteps, it will take an explicit scheme approximately O(n s ) steps to converge, where n s is the number of points in the streamwise direction. The total number of operations will therefore be O(n t · n 2 s ),wheren t is the average number of points in the transverse planes. Using space-marching, we have, ideally, O(1) steps per active domain, implying a total work of O(n t · n s ). The gain in performance could therefore approach O(1 :n s ) for large n s . Such gains are seldomly realized in practice, but it is not uncommon to see gains in excess of 1:10. Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition. Rainald Löhner © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-51907-3 352 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Of the many possible variants, the space-marching procedure proposed by Nakahashi and Saitoh (1996) appears as the most general, and is treated here in detail. The method can be used with any explicit time-marching procedure, it allows for embedded subsonic regions and is well suited for unstructured grids, enabling a maximum of geometrical flexibility. The method works with a subregion concept (see Figure 16.1). The flowfield is only updated in the so-called active domain. Once the residual has fallen below a preset tolerance, the active domain is shifted. Should subsonic pockets appear in the flowfield, the active domain is changed appropriately. maskp 0 1 2 3 4 5 6 Active DomainComputed Field Uncomputed Field Residual Monitor Region Flow Direction Figure 16.1. Masking of points In the following, we consider computational aspects of Nakahashi and Saitoh’s space- marching scheme and a blocking scheme in order to make them as robust and efficient as possible without a major change in existing codes. The techniques are considered in the following order: masking of edges and points, renumbering of points and edges, grouping to avoid memory contention, extrapolation of the solution for new active points, treatment of subsonic pockets, proper measures for convergence, the use of space-marching within implicit, time-accurate solvers for supersonic flows and macro-blocking. 16.1.1. MASKING OF POINTS AND EDGES As seen in the previous chapters, any timestepping scheme requires the evaluation of fluxes, residuals, etc. These operations typically fall into two categories: (a) point Loops, which are of the form do ipoin=1,npoin do work on the point level enddo SPACE-MARCHING AND DEACTIVATION 353 (b) edge loops, which are of the form do iedge=1,nedge gather point information do work on the edge level scatter-add edge results to points enddo The first loop is typical of unknown updates in multistage Runge–Kutta schemes, initializa- tion of residuals or other point sums, pressure, speed of sound evaluations, etc. The second loop is typical of flux summations, artificial viscosity contributions, gradient calculations and the evaluation of the allowable timestep. For cell-based schemes, point loops are replaced by cell loops and edge loops are replaced by face loops. However, the nature of these loops remains the same. The bulk of the computational effort of any scheme is usually carried out in loops of the second type. In order to decide where to update the solution, points and edges need to be classified or ‘masked’. Many options are possible here, and we follow the notation proposed by Nakahashi and Saitoh (1996) (see Figure 16.1): maskp=0: point in downstream, uncomputed field; maskp=1: point in downstream, uncomputed field, connected to active domain; maskp=2: point in active domain; maskp=3: point of maskp=2, with connection to points of maskp=4; maskp=4: point in the residual-monitor subregion of the active domain; maskp=5: point in the upstream computed field, with connection to active domain; maskp=6: point in the upstream computed field. The edges for which work has to be carried out then comprise all those for which at least one of the endpoints satisfies 0<maskp<6. These active edges are marked as maske=1, while all others are marked as maske=0. The easiest way to convert a time-marching code into a space- or domain-marching code is by rewriting the point- and edge loops as follows. Loop 1a: do ipoin=1,npoin if(maskp(ipoin).gt.0. and .maskp(ipoin).lt.6) then do work on the point level endif enddo Loop 2a: do iedge=1,nedge if(maske(iedge).eq.1) then gather point information do work on the edge level scatter-add edge results to points endif enddo [...]... Residual 1 0.1 0.01 0.001 0.0001 1e-05 0 100 20 0 300 400 500 60 0 700 800 Steps (e) Figure 16. 8 Continued Table 16 .2 Timings for F117 (543 000 tetrahedra, 1 06 000 points) dxmar Usual Seque 10 20 Block dxsaf CPU (min) Speedup 30 30 1 61 1 518 22 7 21 8 26 0 1.00 1.17 2. 69 2. 80 2. 35 42 000 elements respectively The residual curves for the three different cases are compared in Figure 16. 8(e) As one can see, grid... convergence Table 16 .2 summarizes the CPU requirements on an SGI R10000 processor for different marching and safety-zone sizes, macro-blocking, as well as for usual time-marching and grid sequencing The coarser meshes consisted of 28 1 000 and 364 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 10 ’relres.space’ using 2 ’relres.block’ using 2 ’relres.usual’ using 2 ’relres.seque’ using 2 Density Residual... 2 C A B C D Boundary Condition: No Change Boundary Condition: Supersonic Outflow Figure 16. 5 Macro-blocking with two layers of overlap Within each sub-domain, space-marching may be employed In this way, the solution is obtained in an almost optimal way, minimizing both CPU and memory requirements 360 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Mach-nr (1. 025 , 3. 025 , 0.05) (c) (a) Mach-nr (1. 025 ,... over an F117-like geometry The total length of the airplane is l = 20 0.0 3 62 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES (a) (b) Figure 16. 8 (a) F117 surface mesh; (b) Mach number: usual versus space-marching, min = 0.55, max = 6. 50, incr = 0.1; (c) Mach number: usual versus blocking, min = 0.55, max = 6. 50, incr = 0.1; (d) Mach number contours for blocking solution, min = 0.55, max = 6. 50, incr... history for inlet Table 16. 1 Timings for inlet (540 000 elements) dxmar dxsaf CPU (min) Speedup Usual 0.05 0.10 0.10 Block 0 .20 0.40 0 .60 400 160 88 66 140 1.00 2. 50 4.54 6. 06 2. 85 speedup is sensitive to the safety zone ahead of the converged solution This is a user-defined parameter, and a convincing way of choosing automatically this distance has so far remained elusive 16. 1.9 .2 F117 As a second case,... observation is that the SPACE - MARCHING AND DEACTIVATION 361 (a) (b) 100 ’relres.space’ using 2 ’relres.usual’ using 2 ’relres.block’ using 2 Density Residual 10 1 0.1 0.01 0.001 0.0001 0 20 0 400 60 0 800 1000 120 0 1400 160 0 1800 Steps (c) Figure 16. 7 Mach number: (a) usual versus space-marching, min = 0. 825 , max = 3.000, incr = 0.05; (b) usual versus blocking min = 0. 825 , max = 3.000, incr = 0.05;... time-dependent inflow is applied on one of the end sides: v(t) = b(t − 60 )3 e−a(t 60 ) + v0 367 368 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Source Location 10 m 150 m 18 m 6m Wind Direction (a) (b) (c) (d) (e) Figure 16. 11 (a) Problem definition; (b), (c) iso-surface of concentration c = 0.0001; (d), (e) surface velocities where b = 0. 46 m/s, a = 0.5 1/s, and v0 = 0.4 m/s This inflow velocity corresponds... SPACE - MARCHING AND DEACTIVATION 363 (c) (d) Figure 16. 8 Continued The convergence criterion was set to 2 = 10−4 The Mach numbers obtained for the spacemarching, usual time-marching and blocking procedures are superimposed in Figures 16. 8(b) and (c) The individual blocking domains are shown for clarity in Figure 16. 8(d) The seven blocks consisted of 357 000, 323 000, 29 6 000, 348 000, 361 000, 3 86 000... length of 16 is sufficient, one can simply start from the edge-renumbering obtained before, and renumber the edges further into groups of 16, while avoiding memory contention (see Chapter 15) For CRAYs and NECs, the vector length chosen ranges from 64 to 25 6 nedge mvecl 1 1 npoin Figure 16 .2 Near-optimal point-range access of edge groups The loop structure is shown schematically in Figure 16 .2 One is... Origin2000 running on 32 processors 16 .2. 1 .2 Subway station The second example considers the dispersion of an instantaneous release in the side platform of a generic subway station, and is taken from Löhner and Camelli (20 04) The geometry is shown in Figure 16. 11(a) SPACE - MARCHING AND DEACTIVATION Figure 16. 10 Pressure, mesh and fragment velocities at three different times A time-dependent inflow is applied . Usual vs. 6 Proc Run ( min=0. 825 , max=3.000, incr=0.05 ) (a) 1 2 4 8 16 32 1 2 4 8 16 32 Speedup Nr. of Processors Ideal SGI-O2K SHM SGI-O2K MPI IBM-SP2 MPI HP-DAX MPI (b) Figure 15 .28 . Supersonic. and memory requirements. 360 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Mach-nr. (1. 025 , 3. 025 , 0.05) (a) (c) Mach-nr. (1. 025 , 3. 025 , 0.05) ( b ) Figure 16. 6. Macro-blocking: (a) one layer. 1:10. Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition. Rainald Löhner © 20 08 John Wiley & Sons, Ltd. ISBN: 97 8-0 -4 7 0-5 190 7-3 352

Định dạng
Số trang	25
Dung lượng	631,04 KB