Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 25 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
25
Dung lượng
288,25 KB
Nội dung
EFFICIENT USE OF COMPUTER HARDWARE 319 12345 1 2 3 4 5 mvecl Extent of Old Edge Data for Renumbered Edges 1 nedge Figure 15.14. Renumbering for vectorizability and data locality 1 npoin 1 ned g e mvecl Figure 15.15. Point range accessed by edge groups 15.2.4. SWITCHING ALGORITHM A typical one-processor register-to-register vector machine will work at its peak efficiency if the vector lengths are multiples of the number of vector registers. For a traditional Cray, this number was 64. An easy way to achieve higher rates of performance without a large change in architecture is through the use of multiple vector pipes. A Cray-C90, for example, has two vector pipes per processor, requiring vector lengths that are multiples of 128 for optimal hardware use. For a NEC-SX4, which has four vector pipes per processor, this number rises to 256, and for the NEC SX8 to 512. The most common way to use multiple processors in these systems is through autotasking, i.e. by splitting the work at the DO-loop level. Typically, this is done by simply setting a flag in the compiler. The acceptable vector length for optimal machine utilization is now required to increase again according to the number of processors. For the Cray-C90, with 16 processors, the vector lengths have to be multiples of 2048. The message is clear: for shared- and fixed-memory machines, make the vectors as long as possible while achieving lengths that are multiples of a possibly large machine-dependent number. 320 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Consider a typical unstructured mesh. Suppose that the maximum number of edges surrounding a particular point is mesup. The minimum number of groups required for a vectorizable scatter-add pass over the edges is then given by mesup. On the other hand, the average number of edges surrounding a point aesup will in most cases be lower. Typical numbers are aesup=6 for linear triangles, aesup=22 for linear tetrahedra, aesup=4 for bilinear quads and aesup=8 for trilinear bricks. For a traditional edge renumberer like the one presented in the previous section, the result will be a grouping or colouring of edges consisting of aesup large groups of approximately nedge/aesup edges, and some very small groups for the remaining edges. While this way of renumbering is optimal for memory- to-memory machines, better ways are possible for register-to-register machines. Consider the following ‘worst-case scenario’ for a triangular mesh of nedge=391 edges with a grouping of edges according to lgrou(1:ngrou)=65,65,65,65,65,65,1. For a machine with 64 vector registers this is equivalent to the following grouping: lgrou(1:ngrou)=64,64,64,64,64,64,1,1,1,1,1,1,1, obtained by setting mvecl=64 in C1 above, which clearly is suboptimal. The vector registers are loaded up seven times with only one entry. One option is to treat the remaining seven edges in scalar mode. Clearly the optimal way to group edges is lgrou(1:ngrou)=64,64,64,64,64,64,7, which is possible due to the first grouping. On the other hand, a simple forward or even a backward–forward pass renumbering will in most cases not yield this desired grouping. One possible way of achieving this optimal grouping is the following switching algorithm that is run after a first colouring is obtained (see Figure 15.16). lpoi1 lpoin 12345 1) 2) 3) 4) a) Initial Grouping of Edges b ) Switchin g Se q uence Figure 15.16. Switching algorithm EFFICIENT USE OF COMPUTER HARDWARE 321 The idea is to interchange edges at the end of the list with some edge inside one of the vectorizable groups without incurring a multiple point access. Algorithmically, this is accomplished as follows. S1. Determine the last group of edges with acceptable length; this group is denoted by ngro0; the edges in the next group will be nedg0=lgrou(ngro0)+1 to nedg1=lgrou(ngro0+1), and the remaining edges will be at locations nedg1+1 to nedge; S2. Transcribe edges nedg1+1 to nedge into an auxiliary storage list; S3. Set a maximum desired vector length mvecl; S4. Initialize a point array: lpoin=0; S5. For all the points belonging to edges nedg0 to nedg1: set lpoin=1; S6. Set current group vector length mvecl=nedg1-nedg0+1; S7. Set remaining edge counter ierem=nedg1; S8. Loop over the large groups of edges igrou=1,ngro0; S9. Initialize a second point array: lpoi1=0; S10. For all the points belonging to the edges in this group: set lpoi1=1; S11. Loop over the edges iedge in this group: If lpoin=0 for all the points touching iedge: then Set lpoi1=0 for the points touching iedge; If lpoin=0 for all the points touching ierem: then -Set lpoin=1 for the points touching iedge; -Set lpoi1=1 for the points touching ierem; - Interchange edges - Update remaining edge counter: ierem=ierem+1 - Update current vector length counter: nvecl=nvecl+1; -If nvecl.eq.mvecl: exit edge loop (Goto S10); Else -Re-set lpoi1=1 for the points touching iedge; Endif End of loop over the large groups of edges S12. Store the group counter: lgrou(ngrou)=nenew; S13. If unmarked edges remain (ienew.ne.nedge): reset counters and Goto S1. All of the renumberings described work in linear time complexity and are straightforward to code. Thus, they are ideally suited for applications requiring frequent mesh changes, e.g. adaptive h-refinement (Löhner and Baum (1992)) or remeshing (Löhner (1990)) for transient problems. Furthermore, although the ideas were exemplified on edge-based solvers, they carry over to element- or face-based solvers. 15.2.5. REDUCED I/A LOOPS Suppose the edges are defined in such a way that the first point always has a lower point number than the second point. Furthermore, assume that the first point of each edge increases with stride one as one loops over the edges. In this case, Loop 1 (or the inner loop of Loop 2) may be rewritten as follows. 322 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Loop 1a: do 1600 iedge=nedg0,nedg1 ipoi1=kpoi0+iedge ipoi2=lnoed(2,iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1600 continue Compared to Loop 1, the number of i/a fetch and store operations has been halved while the number of FLOPS remains unchanged. However, unlike stars, superedges and chains, the basic connectivity arrays remain unchanged, implying that a progressive rewrite of codes is possible. In the following, a point and edge renumbering algorithm is described that seeks to maximize the number of loops of this kind that can be obtained for tetrahedral grids. 15.2.5.1. Point renumbering In order to obtain as many uniformly-accessed edge groups as possible, the number of edges attached to a point should decrease uniformly with point number. In this way, the probability of obtaining an available second point to avoid memory contention in each vector group is maximized. The following algorithm achieves such a point renumbering: Initialization : From the edge-connectivity array lnoed: Obtain the points that surround each point; Store the number of points surrounding each point: lpsup(1:npoin); Set npnew=0; Point Renumbering : - while(npnew.ne.npoin): - Obtain the point ipmax with the maximum value of lpsup(ip); - npnew=npnew+1 ! update new point counter - lpnew(npnew)=ipmax ! store the new point - lpsup(ipmax)= 0 ! update lpsup - do: for all points jpoin surrounding ipmax: - lpsup(jpoin)=max(0,lpsup(jpoin)-1) - enddo - endwhile This point renumbering is illustrated in Figures 15.17(a) and (b) for a small 2-D exam- ple. The original point numbering, together with the resulting list of edges, is shown in Figure 15.17(a). Renumbering the points according to the maximal number of available neighbours results in the list of points and edges shown in Figure 15.17(b). One can see that the list of edges attached to a point decreases steadily, making it possible to achieve large vector lengths for loops of type 1a (see above). EFFICIENT USE OF COMPUTER HARDWARE 323 12 3 4 5 6 7 8910 Point Connections 1 2 3 4 2 3 5 6 7 3 4 7 8 9 4 9 5 6 7 6 7 10 7 8 10 8 9 10 9 10 1 2 3 45 67 89 10 Point Connections 1 2 3 4 6 8 10 2 3 5 9 10 3 6 7 9 4 8 10 5 9 10 6 8 7 9 8 9 10 (a) (b) Figure 15.17. (a) Original and (b) maximal connectivity point renumbering 1 2 3 45 67 89 10 Point Connections 1 10 8 6 4 2 3 2 9 10 5 3 3 7 9 7 6 4 8 10 5 10 9 6 8 7 9 8 9 10 Figure 15.18. Reordering into vector groups 15.2.5.2. Edge renumbering Once the points have been renumbered, the edges are reordered according to point numbers as described above (section 15.1.4). Thereafter, they are grouped into vector groups to avoid memory contention (sections 15.2.1–15.2.3). In order to achieve the maximum (ordered) vector length possible, the highest point number is processed first. In this way, memory contention is delayed as much as possible. The resulting vector groups obtained in this way for the small 2-D example considered above is shown in Figure 15.18. 324 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES It is clear that not all of these groups will lead to a uniform stride access of the first point. These loops are still processed as before in Loop 1. The final form for the edge loop then takes the following form. Loop 2a : do 1400 ipass=1,npass nedg0=edpas(ipass)+1 nedg1=edpas(ipass+1) kpoi0=lnoed(1,nedg0)-nedg0 idiff=lnoed(1,nedg1)-nedg1-kpoi0 if(idiff.ne.0) then cdir$ ivdep do 1600 iedge=nedg0,nedg1 ipoi1=lnoed(1,iedge) ipoi2=lnoed(2,iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1600 continue else cdir$ ivdep do 1610 iedge=nedg0,nedg1 ipoi1=kpoi0+iedge 4poi2=lnoed(2,iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1610 continue endif 1400 continue As an example, a F117 configuration is considered, typical of inviscid compressible flows. This case had 619 278 points, 3509926 elements and 4 179771 edges. Figure 15.19 shows the surface mesh. Table 15.10 lists the number of edges processed in reduced i/a mode as a function of the desired vector length chosen. The table contains two values: the first is obtained if one insists on the vector length chosen; the second is obtained if the usual i/a vector groups are examined further, and snippets of sufficiently long (>64) reduced i/a edge groups are extracted from them. Observe that, even with considerable vector lengths, more than 90% of the edges can be processed in reduced i/a mode. 15.2.5.3. Avoidance of cache-misses Renumberingthe points according to their maximum connectivity,as required for the reduced i/a point renumbering described above, can lead to very large jumps in the point index for an edge (or, equivalently, the bandwidth of the resulting matrix). One can discern that for the structured mesh shown in Figure 15.20 the maximum jump in point index for edges is nb max = O(N p /2),whereN p is the number of points in the mesh. EFFICIENT USE OF COMPUTER HARDWARE 325 Figure 15.19. F117: surface discretization Table 15.10. F117 Configuration: nedge=4,179,771 mvecl % reduced i/a nvecl % reduced i/a nvecl 64 97.22 63 97.22 63 128 93.80 127 98.36 121 256 89.43 255 97.87 223 512 86.15 510 97.24 384 1024 82.87 1018 96.77 608 2048 79.30 2026 96.44 855 4096 76.31 4019 96.05 1068 8192 73.25 7856 95.35 1199 16384 67.16 15089 92.93 1371 n n+1 m m+1 n+2 Pass 1 Pass 1 Figure 15.20. Point jumps per edge for a structured grid In general, the maximum jump will be nb max = O((1 − 2/nc max )N p ),wherenc max is the maximum number of neighbours for a point. For tetrahedral grids, the average number of neighbours for a point is approximately nc = 14, implying nb = O(6/7N p ). This high jump 326 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES in point index per edge in turn leads to a large number of cache-misses and consequent loss of performance for RISC-based machines. In order to counter this effect, a two-step procedure is employed. In a first pass, the points are renumbered for optimum cache performance using a bandwidth minimization technique (Cuthill–McKee, wave front, recursive bisection, space- filling curve, bin, coordinate-based, etc.). In a second pass, the points are renumbered for optimal i/a reduction using the algorithm outlined above. However, the algorithm is applied progressively on point groups npoi0:npoi0+ngrou, until all points have been covered. The size of the group ngrou corresponds approximately to the average bandwidth. In this way, the point renumbering operates on only a few hyperplanes or local zones at a time, avoiding large point jumps per edge and the associated cache-misses. 15.2.6. ALTERNATIVE RHS FORMATION A number of test cases (Löhner and Galle (2002)) were run on both the CRAY-SV1 and NEC-SX5 using the conventionalLoop 2 and Loop 2a. The results were rather disappointing: Loop 2a was slightly more expensive than Loop 2, even for moderate (>256) vector lengths and more than 90% of edges processed in reduced i/a mode. Careful analysis on the NEC- SX5 revealed that the problem was not in the fetches, but rather in the stores. Removing one of the stores almost doubled CPU performance. This observation led to the unconventional formation of the RHS with two vectors. Loop 2b : do 1400 ipass=1,npass nedg0=edpas(ipass)+1 nedg1=edpas(ipass+1) cdir$ ivdep ! Cray Ignore Vector DEPendencies do 1600 iedge=nedg0,nedg1 ipoi1=lnoed(1,iedge) ipoi2=lnoed(2,iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhsp0(ipoi1)=rhsp0(ipoi1)+redge rhsp1(ipoi2)=rhsp1(ipoi2)-redge 1600 continue 1400 continue Apparently, the compiler (and this seems to be more the norm than the exception) cannot exclude that ipoi1 and ipoi2 are identical. Therefore, the fetch of rhspo(ipoi2) in Loop 2b has to wait until the store of rhspo(ipoi1) has finished. The introduction of the dual RHS enables the compiler to schedule the load of rhsp1(ipoi2) earlier and to hide the latency behind other operations. Note that if rhsp0 and rhsp1 are physically identical, no additional initialization or summation of the two arrays is required. The same use of dual RHS vectors implemented in Loop 2b is denoted as Loop 2br in what follows. Finally, the if-test in Loop 2a may be removed by reordering the edge-groupsin such a way that all usual i/a groups are treated first and all reduced i/a thereafter. This loop is denoted as Loop 2bg. As an example for the kind of speedup that can be achieved with this type of modified, reduced i/a loop, a sphere close to a wall is considered, typical of low-Reynolds-number EFFICIENT USE OF COMPUTER HARDWARE 327 Table 15.11. Sphere close to the wall: nedge=328,634 mvecl % reduced i/a nvecl % reduced i/a nvecl 128 89.53 126 94.88 119 256 87.23 251 94.94 218 512 84.50 490 94.69 371 1024 78.58 947 93.10 592 2048 69.85 1748 90.85 797 incompressible flows. Figure 15.21 shows the mesh in a plane cut through the sphere. This case had 49574 points, 272434 elements and 328634 edges. Table 15.11 shows the number of edges processed in reduced i/a mode as a function of the desired vector length chosen. Figure 15.21. Sphere in the wall proximity: mesh in the cut plane Table 15.12 shows the relative timings recorded for a desired edge group length of 2048 on the SGI Origin2000, Cray-SV1 and NEC-SX5. One can see that gains are achieved in all cases, even though these machines vary in speed by approximately an order of magnitude, and the SGI has an L1 and L2 cache, i.e. no direct memory access. The biggest gains are achieved on the NEC-SX5 (almost 30% speedup). Table 15.12. Laplacian RHS evaluation (relative timings) Loop type O2K SV1 SX5 Loop 2 1.0000 1.0000 1.0000 Loop 2a 0.9563 1.0077 0.8362 Loop 2b 0.9943 0.8901 0.7554 Loop 2br 0.9484 0.8416 0.7331 Loop 2bg — — 0.7073 328 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Table 15.13. Speedups obtainable (Amdahl’s Law) R p /R s 50% 90% 99% 99.9% 10 1.81 5.26 9.17 9.91 100 1.98 9.90 50.25 90.99 1000 2.00 9.91 90.99 500.25 15.3. Parallel machines: general considerations With the advent of massively parallel machines, i.e. machines in excess of 500 nodes, the exploitation of parallelism in solvers has become a major focus of attention. According to Amdahl’s Law, the speedup s obtained by parallelizing a portion α of all operations required is given by s = 1 α · (R s /R p ) +(1 −α) , (15.1) where R s and R p denote the scalar and parallel processing rates (speeds), respectively. Table 15.13 shows the speedups obtained for different percentages of parallelization and different numbers of processors. Note that even on a traditional shared-memory, multi-processor vector machine, such as the Cray-T90 with 16 processors, the maximum achievable speedup between scalar code and parallel vector code is a staggering R p /R s = 240. What is important to note is that as one migrates to higher numbers of processors, only the embarrassingly parallel codes will survive. Most of the applications ported successfully to parallel machines to date have followed the single program multiple data (SPMD) paradigm. For grid-based solvers, a spatial sub-domain was stored and updated in each processor. For particle solvers, groups of particles were stored and updated in each processor. For obvious reasons, load balancing (Williams (1990), Simon (1991), Mehrota et al. (1992), Vidwans et al. (1993)) has been a major focus of activity. Despite the striking successes reported to date, only the simplest of all solvers, explicit timestepping or implicit iterativeschemes, perhaps with multigrid added on, have been ported without major changes and/or problems to massively parallel machines with distributed mem- ory. Many code options that are essential for realistic simulations are not easy to parallelize on this type of machine. Among these, we mention local remeshing (Löhner (1990)), repeated h-refinement, such as that required for transient problems (Löhner and Baum (1992)),contact detection and force evaluation (Haug et al. (1991)), some preconditioners(Martinand Löhner (1992), Ramamurti and Löhner (1993)), applications where particles, flow and chemistry interact, and other applications with rapidly varying load imbalances. Even if 99% of all operations required by these codes can be parallelized, the maximum achievable gain will be restricted to 1:100. If we accept as a fact that for most large-scale codes we may not be able to parallelize more than 99% of all operations, the shared-memory paradigm, discarded for a while as non-scalable, may make a comeback. It is far easier to parallelize some of the more complex algorithms, as well as cases with large load imbalance, on a shared-memory machine. Moreover, it is within present technological reach to achieve a 100-processor, shared-memory machine. [...]... 1 0 24 0 0 79 24 8 23 776 7136 21 44 1 0 24 1 0 24 336 Table 15.17 Inlet problem on eight processors ipasg min(loopl) max(loopl) 1–8 9–16 17 24 25 – 32 33 40 41 48 49 –56 59 44 0 17 8 24 5360 650 0 0 0 59 44 0 17 8 24 5360 1600 1 0 24 1 0 24 28 8 Table 15.18 Inlet: actual versus optimal edge allocation nproc Actual % Optimal % Loss % 2 4 6 8 50. 122 25 . 140 16.8 84 12. 743 50.00 25 .00 16.67 12. 50 0 . 24 0.56 1.01 1. 94 336 APPLIED. .. in it; - a real number reffd that denotes the desired average cumulative effort in each subdomain 32, 4 31 ,4 30 ,4 29 ,4 27 ,4 28 ,4 23 ,3 24 ,3 22 ,3 25 ,4 26 ,4 18,3 19,3 8,1 6,1 7,1 2, 1 21 ,3 17,3 16 ,2 5,1 14 ,2 15 ,2 4, 1 1,1 20 ,3 13 ,2 10 ,2 3,1 12, 2 9 ,2 11 ,2 Figure 15 .26 Advancing front domain decomposition 3 42 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Then: S1 Initialize point and element arrays; S2 Initialize... gradient with diagonal preconditioning) 340 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 1 1 2 1 1 2 1 1 2 (a) 3 4 7 1 1 8 2 5 1 2 2 4 2 5 2 8 2 5 2 1 8 4 7 4 (b) 7 3 3 4 2 1 1 2 4 2 1 1 2 2 4 3 1 4 4 3 3 2 1 3 3 3 2 2 2 1 1 2 2 1 1 1 6 6 6 (c) Figure 15 .25 Orthogonal recursive bisection: (a) subdivision in x; (b) subdivision in y; (c) subdivision in x - For domain solvers whose cost is composed... (remeshing, h-refinement, etc.) Therefore, only a few 335 EFFICIENT USE OF COMPUTER HARDWARE Table 15.15 Inlet problem on four processors ipasg min(loopl) max(loopl) 1 4 5–8 9– 12 13–16 17 20 21 24 25 28 118 880 35 6 64 10 7 04 321 6 26 6 800 0 118 880 35 6 64 10 7 04 321 6 1 0 24 1 0 24 25 6 Table 15.16 Inlet problem on six processors ipasg min(loopl) max(loopl) 1–6 7– 12 13–18 19 24 25 –30 31–36 37– 42 79 24 8 23 776 7136... 1057 and 46 8 The vector loop length was set to 16 3 34 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Table 15. 14 Inlet problem on two processors ipasg min(loopl) max(loopl) 1 2 3 4 5–6 7–8 9–10 11– 12 13– 14 23 7 776 71 344 21 3 92 641 6 1770 1 12 0 23 7 776 71 344 21 3 92 641 6 1936 1 0 24 576 The minimum and maximum numbers of edges processed for each pass over the processors are given in Tables 15. 14 15.17... code can easily by retrieved 3 32 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES 1 npoin # processors 1 mvecl 4 3 2 1 nedge Figure 15 .23 Global agglomeration by setting edpag(1)=0, edpag (2) =npass, npasg=1 for conventional uni-processor machines If the length of the cache-line is known, one may relax the restriction of non-overlapping point ranges to non-overlapping cache-line ranges This allows for more... (see Figure 15 .22 ) 1 npoin # processors 1 4 mvecl 4 4 3 3 2 2 2 nedge Figure 15 .22 Local agglomeration As each one of the sub-groups has the same number of edges, the load is balanced across the processors The actual loop is given by the following Loop 3: do 140 0 ipass=1,npass nedg0=edpas(ipass)+1 nedg1=edpas(ipass+1) ! Parallelization directive c$omp parallel do private(iedge,ipoi1,ipoi2,redge) c$dir... ipoi1,ipoi2,redge) do 120 0 ipasg=imac0,imac1 npas0=edpag(ipasg)+1 npas1=edpag(ipasg+1) do 140 0 ipass=npas0,npas1 nedg0=edpas(ipass)+1 nedg1=edpas(ipass+1) c$dir ivdep ! Pipelining directive do 1600 iedge=nedg0,nedg1 ipoi1=lnoed(1,iedge) ipoi2=lnoed (2, iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1600 continue 140 0 continue 120 0 continue... ipoi2=lnoed (2, iedge) redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1)) rhspo(ipoi1)=rhspo(ipoi1)+redge rhspo(ipoi2)=rhspo(ipoi2)-redge 1600 continue 140 0 continue In order to demonstrate the performance of the agglomeration technique described above, a supersonic inlet problem is considered The domain, as well as the solution after 800 timesteps are shown in Figures 15 . 24 The mesh had approximately 540 ... 1. 94 336 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES Table 15.19 Speedups for inlet problem nproc Speedup (shared) Speedup (PVM) 2 4 6 8 1.81 3.18 4. 31 5 .28 1.83 3.50 5.10 — steady-state or fixed mesh transient applications have been successful on this type of machine (see, e.g., Jespersen and Levit (1989), Long et al (1989), Oran et al (1990)) SIMD machines may be compared to memory-to-memory vector . 7136 19 24 0 21 44 25 –30 1 0 24 1 0 24 31–36 0 1 0 24 37– 42 0 336 Table 15.17. Inlet problem on eight processors ipasg min(loopl) max(loopl) 1–8 59 44 0 59 44 0 9–16 17 8 24 17 8 24 17 24 5360 5360 25 – 32 650. Configuration: nedge =4, 179,771 mvecl % reduced i/a nvecl % reduced i/a nvecl 64 97 .22 63 97 .22 63 128 93.80 127 98.36 121 25 6 89 .43 25 5 97.87 22 3 5 12 86.15 510 97 . 24 3 84 1 0 24 82. 87 1018 96.77 608 20 48 79.30 20 26. 6 64 9– 12 10 7 04 10 7 04 13–16 321 6 321 6 17 20 26 6 1 0 24 21 24 800 1 0 24 25 28 0 25 6 Table 15.16. Inlet problem on six processors ipasg min(loopl) max(loopl) 1–6 79 24 8 79 24 8 7– 12 23 776 23 776 13–18