Applied Computational Fluid Dynamics Techniques - Wiley Episode 2 Part 3 pps

ADAPTIVE MESH REFINEMENT 293 (c) Figure 14.9. Continued 294 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 14.5.2. SHOCK-OBJECT INTERACTION IN TWO DIMENSIONS Figures 14.9(a)–(c) show a case taken from (Baum and Löhner (1992)). They show classic h-refinement for strongly unsteady flows at its best. For this class of problems a new mesh is required every five to seven timesteps, strict conservation of mass, momentum and energy during refinement is critical, and the introduction of dissipation due to information loss during interpolation when remeshing provesdisastrous foraccuracy.A maximumof six levels of refinement were specified for this case, yielding meshes that on average have 300000 triangles and 100 000 points. Figures 14.9(a) and (b) show the mesh, mesh refinement levels and pressures for different times. (a) (b) Figure 14.10. Shock–object interaction in three dimensions Observe the detail in the physics that is achievable through adaptation. Notice furthermore the small extent of the regions that require refinement as compared to the overall domain. The equivalent uniform mesh run would have required more than two orders of magnitude more elements, CPU time and memory, pushing the limits of available supercomputers. The ADAPTIVE MESH REFINEMENT 295 (a) (b) (c) Figure 14.11. Shock–structure interaction: (a) building definition; (b) surface mesh and pressure; (c) mesh and pressure in plane 296 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES comparison to experimental results, given in Figure 14.9(c), reveals that indeed very accurate results with a minimum of degrees of freedom are achieved using adaptive grid refinement for this class of problems. Figure 14.12. Object falling into supersonic free stream 14.5.3. SHOCK–OBJECT INTERACTION IN THREE DIMENSIONS Figures 14.10(a)–(b)show a case taken from Baum and Löhner (1991) and Löhner and Baum (1992). The object under consideration is a common main battlefield tank. A maximum of two layers of refinementwere specified close to the tank,whereas only onelevel of refinement was employed farther away. The original, unrefined, but strongly graded mesh consisted of approximately 100000 tetrahedra and 20 000 points. During the run, a mesh change (refinement and coarsening) occurred every five timesteps, and the mesh size increased to approximately 1.6 million tetrahedra and 280 000 points. This represents an increase factor of 1:16.Althoughseemingly high, the correspondingglobal h-refinementwould have resulted in a 1:64 size increase. A second important factor is that most of the elements of the original mesh are close to the body, where most of the refinement is going to take place. Figures 14.10(a) and (b) show surface gridding and pressure contours at two selected times during the run. The extent of mesh refinement is clearly discernable, as well as the location and interaction of shocks. ADAPTIVE MESH REFINEMENT 297 14.5.4. SHOCK–STRUCTURE INTERACTION Figures 14.11(a)–(c) show a typical shock–structure interaction case. The building under consideration is shown in Figure 14.11(a). One layer of refinement was specified wherever the physics required it. The pressures and grids obtained at the surface and at planes at a given time are shown in Figures 14.11(b) and (c). The mesh had approximately 60 million tetrahendra. 14.5.5. OBJECT FALLING INTO SUPERSONIC FREE STREAM TWO DIMENSIONS The problem statement is as follows: an object is placed in a cavity surrounded by a free stream at M ∞ = 1.5. After the steady-state solution is reached (time T =0.0), a body motion is prescribed, and the resulting flowfield disturbance is computed. Adaptive remeshing was performed every 100 timesteps initially, while at later times the grid was modified every 50 timesteps. One level of global h-refinement was used to accelerate the grid regeneration. The maximum stretching ratio specified was S = 5.0. Figure 14.12 shows different stages during the computation at times T = 60 and T = 160. One can clearly see how the location and strength of the shocks change due to the motion of the object. Notice how the directionality of the flow features is reflected in the mesh. 15 EFFICIENT USE OF COMPUTER HARDWARE However clever an algorithm may be, it has to run efficiently on the available computer hardware. Each type of computer, from the PC to the fastest massively parallel machine, has its own shortcomings that must be accounted for when developing both the algorithms and the simulation code. The present section assumes that the algorithm has been selected, and identifies the main issues that must be addressed in order to achieve good performance on the most common types of computers. The main types of computer platforms currently being used are as follows. (a) Personal computers. Although perhaps not considered a serious analysis tool even a decade ago, personal computers can already be used cost-effectively for 3-D simulations. In fact, many applications where CPU time is not a constraining factor are currently being carried out on PCs. Most CFD software companiesreport higher revenues from PC platforms than from all other platforms combined. High-end PCs (4 Gbytes of RAM, 120 GFLOPS graphics card) are ideal tools for simulations. We see this as one more proof of the theme that has been repeated so often in this book: a CFD run is more than just CPU – if this were so, vector machines would have become the dominant type of computer. Rather, it consists of problem definition, grid generation, flow solver execution and visualization. High-end PCs combine a relatively fast CPU with good visualization hardware, allowing to cut down the most expensive cost-component of any run: man-hours. (b) Vector machines. These machines achieve higher speeds by splitting up arithmetic operations (fetch, align, add, multiply, store, etc.), performing each on different data items concurrently. The assumption made is that the same basic operation(s) have to be performed on a relatively large number of data items. These data items can be thought of as vectors, hence the name. As an example, consider the operation D=C*(A+B). While the central CPU fetches the data from memory for the ith item, it may align the data for item i + 1, add two numbers for item i + 2, multiply numbers for item i + 3 and store the results for item i + 4. This would yield a speedup of 1:4. In practice, many more operations than the ones described above are required even to add two numbers. Hence, speedups of about one order of magnitude are achievable (1:14 on the Cray-X or NEC-SX series). (c) Single instruction multiple data (SIMD) machines. Here the assumption made is that all data items (e.g. elements, points, etc.) will be subject to the same arithmetic operations. In order to go beyond the one order of magnitude speedup of vector machines, thousands of processors are combined. Each processor performs the same task on a different piece of data. While this type of machine did not succeed when based on conventional chips, high-end graphics cards are increasingly being used in this mode (Hagen et al. (2006), LeGresley et al. (2007)). Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition. Rainald Löhner © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-51907-3 300 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES (d) Multiple instruction multiple data (MIMD) machines. In this case different arithmetic operations may be performed on different processors. This circumvents some of the restric- tions posed by the SIMD assumption that all processors are performing the same arithmetic operation. On the other hand, the operating system software required to keep these machines functioning is much more involved and sensitive than that required for SIMD machines. The emerging architecture for future machines is a generalization of the MIMD machine, where some processors may be based on commodity, general-purpose chips, others on reduced instruction set chips (RISC-chips), others on powerful vector-processors, and some have SIMD architecture. An example of such a machine is the Cray-T3E, which combines a Cray-T90 vector supercomputer with up to 2056 Alpha-Chip-based processors. An architecture like this, which combines scalar, vector and distributed memory parallel processing, requires the programmer to take into consideration all the individual aspects encountered in each of these architectures. 15.1. Reduction of cache-misses Indirect addressing (i/a) is required for any unstructured field solver, as different data types (point-, element-, face-, edge-based) have to be related to one another. In loops over elements or edges, data is accessed in an unordered way. Most RISC-based machines, as well as some vector machines, will store the data required by the CPU ahead of time in a cache. This cache allows the data to be fetched by the CPU much more rapidly than data stored in memory or on disk. The data is brought into the cache in the same way that it is stored in memory or on disk. If the data required by the CPU for subsequent arithmetic operations is not close enough to fit into the cache, this piece of information will have to be fetched from memory or disk. This is called a cache-miss. Depending on the frequency of cache-misses versus CPU, a serious degradation in performance, often in excess of 1:10, can take place. The relative number of cache-misses invariably increases with problem size. The aim of the renumbering strategies considered in the present section is to minimize the frequency of cache-misses, i.e. to retard the degradation of performance with problem size. The main techniques considered are: - array access in loops; - renumbering of points to reduce the spread in memory of the items fetched by a single element or edge; - reordering of the nodes in each element so that data is accessed in as uniform a way as possible within each element; and - renumbering of elements, faces and edges so that data is accessed in as uniform a way as possible when looping over them. 15.1.1. ARRAY ACCESS IN LOOPS Storing all the arrays required (elements, coordinates, unknowns, edges, etc.) in a way that is compatible with the way they are accessed within loops reduces cache-misses appreciably. To see why, consider the array containing the coordinates of the points: horizontal or flat storage àlacoord(ndimn,npoin) would be the preferred choice for a workstation, whereas for some Crays the preferred choice would be vertical storage à la coord(npoin,ndimn). EFFICIENT USE OF COMPUTER HARDWARE 301 Suppose that the difference vector (dx, dy, dz) of the two endpoints of an edge is required. This implies fetching six items and performing three arithmetic operations. For flat storage, the jump in memory is given by Get x1=coord(1,ipoi1) Jump to ipoi1 Get y1=coord(2,ipoi1) Jump by 1 Get z1=coord(3,ipoi1) Jump by 1 Get x2=coord(1,ipoi2) Jump to ipoi2 Get y2=coord(2,ipoi2) Jump by 1 Get z2=coord(3,ipoi2) Jump by 1 whereas for vertical storage the jumps are Get x1=coord(ipoi1,1) Jump to ipoi1 Get x2=coord(ipoi2,1) Jump to ipoi2 Get y1=coord(ipoi1,2) Jump by mpoin Get y2=coord(ipoi2,2) Jump to ipoi2 Get z1=coord(ipoi1,3) Jump by mpoin Get z2=coord(ipoi2,3) Jump to ipoi2 The difference in the number of large jumps is clearly visible from this comparison. For this reason, flat storage is recommended for any machine with cache. Note that, for codes written in C, the opposite holds, as the second index moves faster than the first one. 15.1.2. POINT RENUMBERING Consider the evaluation of an edge RHS (the same basic principle applies to element-based or face-based solvers), given by the following loop. Loop 1 : do 1600 iedge=1,nedge ipoi1=lnoed(1,iedge) ipoi2=lnoed(2,iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1600 continue The operations to be performed can be grouped into the following three major steps (see Chapter 8): (a) gather point information into the edge; (b) perform the required mathematical operations at edge level; (c) scatter-add the edge RHS to the assembled point RHS. 302 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES The transfer of information to and from memory required in steps (a), (c) is proportional to the number of nodes in the edge (element, face) and the number of unknowns per node. If the nodes within each edge (element, face) are widely spaced in memory, cache-misses are likely to occur. If, on the other hand, all the points within an element are ‘close’ in memory, cache-misses are minimized. From these considerations, it becomes clear that cache-misses are directly linked to the bandwidth of the equivalent matrix system (or graph). Point renumbering to reduce bandwidths has been an important theme for many years in traditional finite element applications (Piessanetzky (1984), Zienkiewicz (1991)). The aim was to reduce the cost of the matrix inversion, which was considered to be the most expensive part of any finite element simulation. (b) (a) Figure 15.1. Ordering of points for 2-D mesh The optimal renumbering of points in such a way that spatial (or near-neighbour) locality is mirrored in memory is a problem of formidable algorithmic complexity. Fortunately, most of the benefits of renumbering points are already obtained from near-optimal heuristic renumbering techniques. To see how most of these fast, near-optimal techniques work, consider the rectangular domain with a structured mesh shown in Figure 15.1. Numbering the points in the horizontal (Figure 15.1(a)) and vertical (Figure 15.1(b)) directions yields an average bandwidthof nx and ny, respectively.One should therefore aim to number the points in the direction normal to the longest graph depth. Based on this observation, several point renumbering techniques have been developed. To exemplify these techniques, the simple mesh shown in Figure 15.2 is considered. 15.1.2.1. Directional ordering If the direction of maximal graph depth is known, one can simply order the points in this direction. This is perhaps the simplest (and fastest) possible renumbering, but implies that the problem class being addressed has a clear maximal graph depth direction that can easily be identified. Renumberingin the x-direction,this yields the numberingshown inFigure 15.3. 15.1.2.2. Bin ordering Given an arbitrary distribution of points, one may first place the points in a bin of uniform size h. One can then identify, by ordering the number of bins in the x, y,z directions in EFFICIENT USE OF COMPUTER HARDWARE 303 Figure 15.2. Original mesh Figure 15.3. Renumbering in the x-direction ascending size i,j,k, the plane k that traverses space yielding the lowest bandwidth, i.e. the closest proximity in memory.Bins offer the advantage of high speed (very few operations are required, and most of these are easy to vectorize/ parallelize) and simplicity. After obtaining the overall dimensions of the computational domain, bin ordering may be realized in two ways: (1) Obtain the bin each point falls into; store the points into bins (e.g. using a linked list via lbin1(1:npoin),lbin2(1:npoin+1)); go through the bins, renumbering points; (2) Obtain the bin each point falls into; assign a number to the point based on the bin it falls into (e.g. inumb=ibinx+nbinx*(ibiny-1)+nbinx*nbiny*(ibinz- 1)); store the points in a heap list (based on the assigned number); retrieve the points from the heap list, renumbering points. Bins are mostly used for grids with modest changes in element size. Figure 15.4 shows the bin ordering of points for the mesh from Figure 15.2. 15.1.2.3. Quad/octree ordering For grids that exhibit large variations in element size, the bin ordering described above will yield suboptimal renumberings, as some bins will have many points that could be ordered in a better way. A way to circumvent this shortcoming is to allow for bins of variable size, i.e. allowing only a certain number of points within each bin (or regular region of [...]... Triangle Tetrahedron 2- ChainTetra Edge Triangle Tetrahedron 2- ChainTetra Edge Triangle Tetrahedron 6.90 3. 65 2. 92 2. 63 13. 4 7 .21 5 .26 3. 83 4. 92 3. 13 2. 60 1.00 1.89 2. 36 2. 62 1.00 1.85 2. 54 3. 50 1.00 1.57 1.89 Table 15.9 Timings for the vector Laplacian Platform IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 Cray-YMP Cray-YMP Cray-YMP npoin Edge type... following (see Figure 15.8): 30 7 EFFICIENT USE OF COMPUTER HARDWARE 1 nedge 1 2 1 3 2 4 5 2 4 6 3 5 8 7 10 8 9 10 11 11 (a) 1 1 2 1 3 2 4 5 2 4 6 3 5 8 7 10 8 9 10 11 11 2 3 2 2 2 1 1 2 1 2 2 0 1 npoin (b) 2 5 7 9 11 12 13 15 16 18 20 20 1 npoin (c) 1 nedge 1 2 1 3 2 4 5 2 4 6 3 5 8 7 10 8 9 10 11 11 min(pt-#) 1 nedge 1 3 2 5 8 4 11 6 9 7 12 10 14 13 16 17 15 18 19 20 old edge-# (d) Figure 15.8 Renumbering... Family of 1-chains Type Edge Edge-Chain Triangle DoubleTria Tetrahedron DoubleTetra QuintuTetra Edges Points i/a/edge i/a reduction 1 1 3 5 6 9 12 2 1 2 3 3 4 5 6:1 3: 1 2: 1 9:5 3 :2 4 :3 5:4 1.00 2. 00 3. 00 3. 33 4.00 4.50 4.80 Table 15.6 Family of 2- chains Type Edge Triangle Tetrahedron Edges Points i/a/edge i/a reduction 1 2 5 2 1 2 6:1 3 :2 6:5 1.00 4.00 5.00 31 4 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES. .. )*(upoi2-upoi1) redg2=edlap(iedge+1)*(upoi3-upoi2) redg3=edlap(iedge +2) *(upoi3-upoi1) redg4=edlap(iedge +3) *(upoi4-upoi1) redg5=edlap(iedge+4)*(upoi4-upoi2) redg6=edlap(iedge+5)*(upoi4-upoi3) rhspo(ipoi1)=rhspo(ipoi1)+redg1+redg3+redg4 rhspo(ipoi2)=rhspo(ipoi2)-redg1+redg2+redg5 rhspo(ipoi3)=rhspo(ipoi3)-redg2-redg3+redg6 rhspo(ipoi4)=rhspo(ipoi4)-redg4-redg5-redg6 1000 continue This loop requires 12 i/a... the 2- D mesh shown in Figure 15. 12 1 4 3 2 7 1 4 1 5 3 1 4 5 4 1 4 4 3 7 2 1 6 5 4 2 1 3 6 2 2 1 6 4 2 3 2 3 1 2 3 2 4 Group #Edges 1 10 2 9 3 7 4 9 5 3 6 3 7 2 1 Figure 15. 12 Renumbering of edges for vectorization The techniques discussed below consider: - reordering the edges into groups so that point data is accessed once only in each edge group; - balancing of groups with a switching algorithm; -. .. superedges Type Edges Edge V-Edges Triangle DoubleTria QuadruTria Tetrahedron DoubleTetra TripleTetra QuintuTetra Edge Tetra V-Edges Points i/a/edge i/a reduction 1 2 3 5 9 6 9 12 15 2 3 3 4 6 4 5 6 7 6:1 9:1 3: 1 12: 5 2: 1 2: 1 15:9 3 :2 21:15 1.00 1 .33 2. 00 2. 50 3. 00 3. 00 3. 60 4.00 4 .28 Triangle DoubleTetra DoubleTria QuadruTria TripleTetra QuintuTetra Figure 15.10 Superedges These superedges have been illustrated... redg3=edlap(iedge +2) *(upoi0-unkno(ipoi3)) redg4=edlap(iedge +3) *(upoi0-unkno(ipoi4)) redg5=edlap(iedge+4)*(upoi0-unkno(ipoi5)) redg6=edlap(iedge+5)*(upoi0-unkno(ipoi6)) rhspo(ipoi0)=redg1+redg2+redg3+redg4+redg5+redg6 rhspo(ipoi1)=rhspo(ipoi1)-redg1 rhspo(ipoi2)=rhspo(ipoi2)-redg2 rhspo(ipoi3)=rhspo(ipoi3)-redg3 rhspo(ipoi4)=rhspo(ipoi4)-redg4 rhspo(ipoi5)=rhspo(ipoi5)-redg5 rhspo(ipoi6)=rhspo(ipoi6)-redg6 1000... 000 10 000 Edge Triangle Tetrahedron 2- ChainTetra Edge Triangle Tetrahedron 2- ChainTetra Edge Triangle Tetrahedron 15.66 11.90 11.55 11 .30 35 .88 20 .14 17. 43 14. 13 1.89 1 .25 1.16 1.00 1. 32 1 .36 1 .39 1.00 1.78 2. 06 2. 54 1.00 1.51 1. 63 and Jameson (19 83) , Mavriplis and Jameson (1990), Mavriplis (1991b), Peraire et al (1992a), Weatherill et al (1993c), Luo et al (19 93) ), and hourglass control operators... Laplacian operator was applied to a vector quantity of five variables, representative of dissipation operators for the compressible Euler equations (see Chapter 8 31 5 EFFICIENT USE OF COMPUTER HARDWARE Table 15.8 Timings for the scalar Laplacian npoin Platform IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 IBM-6000/ 530 Cray-YMP Cray-YMP Cray-YMP Edge type CPU... However, such a specialized 31 1 EFFICIENT USE OF COMPUTER HARDWARE Table 15 .3 Family of stars Type Edges Points i/a/edge i/a reduction Edge 2- star 3- star 4-star 5-star 6-star 1 2 3 4 5 6 2 3 4 5 6 7 6:1 9 :2 4:1 15:4 18:5 7 :2 1.00 1 .33 1.50 1.60 1.67 1.71 point renumbering may lead to new problems, such as large bandwidths and cache-misses Therefore, the more conservative, but realistic, reduction factors . HARDWARE 30 7 121 32 4 524 635 871089101111 1 nedge 121 32 4 524 635 871089101111 1 npoin 23 222 1 121 22 1 0 npoin 2 5 7 9 11 121 31 5161 820 1 20 1 nedge 1 nedge (a) (b) (c) (d) 1 32 5 8411697 121 014 131 61715181 920 old. lnod2(4,4), lnod3(4,4) as data lnod2/ 0 , 3 , 4 , 2 , 4,0,1 ,3, 2, 4,0,1, 3, 1 ,2, 0/ data lnod3/ 0 , 4 , 2 , 3 , 3, 0,4,1, 4,1,0 ,2, 2, 3, 1,0/ RE2. Loop over the elements, reordering the nodes: - Get. achieved. 3 12 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Table 15.4. Family of superedges Type Edges Points i/a/edge i/a reduction Edge 1 2 6:1 1.00 V-Edges 2 3 9:1 1 .33 Triangle 3 3 3: 1 2. 00 DoubleTria

Định dạng
Số trang	25
Dung lượng	557,31 KB