Hardware Acceleration of EDA Algorithms- P9 docx

9.4 Our Approach 143 Now, by definition CD(k)=(CD(i) · D(i, k) + CD(j) · D(j, k)) and CD(i)=(CD(a) · D(a, i) + CD(b) · D(b, i)) From the first property discussed for CD, CD(a)=FD(a s-a-0, a) = 1010, and by definition CD( b) = 0000. By substitution and similarly computing CD(i) and CD(j), we compute CD(k) = 0010. The implementation of the computation of detectabilities and cumulative detectabilities in FSIM∗ and GFTABLE is different, since in GFTABLE, all computations for computing detectabilities and cumulative detectabilities are done on the GPU, with every kernel executed on the GPU launched with T threads. Thus a single kernel in GFTABLE computes T times more data, compared to the corresponding computation in FSIM∗. In FSIM∗, the backtracing is performed in a topological manner from the output of the FFR to its inputs and is not scheduled for gates driving zero critical lines in the packet. We found that this pruning reduces the number of gate evaluations by 42% in FSIM∗ (based on tests run on four benchmark circuits). In GFTABLE, however, T times more patterns are evaluated at once, and as a result, no reduction in the number of scheduled gate evaluations were observed for the same four benchmarks. Hence, in GFTABLE, we perform a brute-force backtracing on all gatesinanFFR. As an example, the pseudocode of the kernel which evaluates the cumulative detectability at output k of a 2-input gate with inputs i and j is provided in Algo- rithm 11. The arguments to the kernel are the pointer to global memory, CD, where cumulative detectabilities are stored; pointer to global memory, D, where detectabilities to the immediate dominator are stored; the gate_id of the gate being evaluated (k) and its two inputs (i and j). Let the thread’s (unique) threadID be t x . The data in CD and D, indexed at a location (t x + i × T) and (t x + j × T), and the result computed as per CD(k)=(CD(i) · D(i, k) + CD(j) · D(j, k)) is stored in CD indexed at location (t x + k × T). Our implementation has a similar kernel for 2-, 3-, and 4-input gates in our library. Algorithm 11 Pseudocode of the Kernel to Compute CD of the Output k of 2-Input Gate with Inputs i and j CPT_kernel_2(int ∗ CD,int ∗D,inti,intj,intk){ t x = my_thread_id CD[t x + k ∗ T] = CD[t x + i ∗ T] ·D[t x + i ∗ T] +CD[t x + j ∗ T] ·D[t x + j ∗ T] } 9.4.2.4 Fault Simulation of SR(s) (Lines 15, 16) In the next step, the FSIM∗ algorithm checks that CD(s) = (00 0) (line 15), before it schedules the simulation of SR(s) until its immediate dominator t and the computation of D(s, t). In other words, if CD(s) = (00 0), it implies that for the current vector, the frontier of all faults upstream from s has died before reaching the stem 144 9 Fault Table Generation Using Graphics Processors s, and thus no fault can be detected at s. In that case, the fault simulation of SR(s) would be pointless. In the case of GFTABLE, the effective packet size is 32 × T. T is usually set to more than 1,000 (in our experiments it is ≥10K), in order to take advantage of the parallelism available on the GPU and to amortize the overhead of launching a kernel and accessing global memory. The probability of finding CD(s) = (00 0) in GFTABLE is therefore very low (∼0.001). Further, this check would r equire the logical OR of T 32-bit integers on the GPU, which is an expensive computation. As a result, we bypass the test of line 15 in GFTABLE and always schedule the computation of SR(s) (line 16). In simulating SR(s), explicit fault simulation is performed in the forward levelized order from stem s to its immediate dominator t. The input at stem s during simulation of SR(s)isCD(s) XORed with fault-free value at s. This is equivalent to injecting the faults which are upstream from s and observable at s. After the fault simulation of SR(s), the detectability D(s, t) is computed by XORing the simulation output at t with the true value simulation at t. During the forward levelized simulation, the immediate fanout of a gate g is scheduled only if the result of the logic evaluation at g is different from its fault-free value. This check is conducted for every gate in all paths from stem s to its immediate dominator t. On the GPU, this step involves XORing the current gate’s T 32-bit outputs with the previously stored fault-free T 32-bit outputs. It would then require the computation of a logical reduction OR of the T 32-bit results of the XOR into one 32-bit result. This is because line 17 is computed on the CPU, which requires a 32-bit operand. In GFTABLE, the reduction OR operation is a modified version of the highly optimized tree-based parallel reduction algorithm on the GPU, described in [2]. The approach in [2] effec- tively avoids bank conflicts and divergent warps, minimizes global memory access latencies, and employs loop unrolling to gain further speedup. Our modified reduction algorithm has a key difference compared to [2]. The approach in [2] computes a SUM instead of a logical OR. The approach described in [2] is a breadth-first approach. In our case, employing a breadth-first approach is expensive, since we need to detect if any of the T × 32 bits is not equal to 0. Therefore, as soon as we find a single non-zero entry we can finish our computation. Note that performing this test sequentially would be extremely slow in the worst case. We therefore equally divide the array of T 32-bit words into smaller groups of size Q words and compute the logical OR of all numbers within a group using our modified parallel r eduction approach. As a result, our approach is a hybrid of a breadth-first and a depth-first approach. If the reduction result for any group is not (00 0), we return from the parallel reduction kernel and schedule the fanout of the current gate. If the reduction result for any group, on the other hand, is equal to (00 0), we compute the logical reduction OR of the next group and so on. Each logical reduction OR is computed using our reduction kernel, which takes advantage of all the optimizations suggested in [2] (and improves [2] further by virtue of our modifications). The optimal size of the reduction groups was experimentally determined to be Q = 256. We found that when reducing 256 words at once, there was a high probability of having at least one non-zero bit, and thus there was a high likelihood of returning early from the parallel 9.4 Our Approach 145 reduction kernel. At the same time, using 256 words allowed for a fast reduction within a single thread block of size equal to 128 threads. Scheduling a thread block of 128 threads uses 4 warps (of warp size equal to 32 threads each). The thread block can schedule the 4 warps in a time-sliced fashion, where each integer OR operation takes 4 clock cycles, thereby making optimal use of the hardware resources. Despite using the above optimization in parallel reduction, the check can still be expensive, since our parallel reduction kernel is launched after every gate evaluation. To further reduce the runtime, we launch our parallel reduction kernel after every G gate evaluations. During in-between runs, the fanout gates are always scheduled to be evaluated. Due to this, we would potentially do a few extra simulations, but this approach proved to be significantly faster when compared to either performing a parallel reduction after every gate’s simulation or scheduling every gate in SR(s) for simulation in a brute-force manner. We experimentally determined the optimal value for G to be 20. In the next step (lines 17 and 18), the detectability D(s, t) is tested. If it is not equal to (00 0), stem s is added to the ACTIVE_STEM list. Again this step of the algorithm is identical for FSIM∗ and GFTABLE; however, the difference is in the implementation. On the GPU, a parallel reduction technique (as explained above) is used for testing if D(s, t) = (00 0). The resulting 32-bit value is transferred back to the CPU. The if condition (line 17) is checked on the CPU and if it is true, the ACTIVE_STEM list is augmented on the CPU. 1111 1111 0010 0010 0010 0010 1111 0010 d l m e n o pk SR(k) Fig. 9.3 Fault simulation on SR(k) For our example circuit, SR(k) is displayed in Fig. 9.3. The input at stem k is 0010 (CD(k) XORed with fault-free value at k). The two primary inputs d and e have the original test vectors. From the output evaluated after explicit simulation until p,D(k,p) = 0010 = 0000. Thus, k is added to the active stem list. CPT on FFR(p) can be computed in a similar manner. The resulting values are listed below: D(l, p)=1111; D(n, p)=1111; D(d, p)=0000; D(m, p)=0000; D(e, p)=0000; D(o,p) =0000; D(d, n)=0000; D(l, n)=1111; D(m, o)=0000; D(e, o)=1111; FD(l s-a-0, p)=0000; FD(l s-a-1, p)=1111; CD(d) = 0000; CD(l)=1111; CD(m)=0000; CD(e) =0000; CD(n)=1111; CD(o)=0000; and CD(p)=1111. 146 9 Fault Table Generation Using Graphics Processors Since CD(p) =(0000) and D(p, p) = (0000), the stem p is added to ACTIVE_STEM list. 9.4.2.5 Generating the Fault Table (Lines 22–31) Next, FSIM∗computes the global detectability of faults (and stems) in the backward order, i.e., it removes the highest level stem s from the ACTIVE_STEM list (line 23) and computes its global detectability (line 24). If it is not equal to (00 0)(line25), the global detectability of every fault in FFR(s) is computed and stored in the [a ij ] matrix (lines 26–28). The corresponding implementation in GFTABLE maintains the ACTIVE_STEM on the CPU and, like FSIM∗, first computes the global detectability of the highest level stem s from ACTIVE_STEM list, but on the GPU. Also, another parallel reduction kernel is invoked for D(s, t), since the resulting data needs to be transferred to the CPU for testing whether the global detectability of s is not equal to (00 0) (line 25). If true, the global detectability of every fault in FFR(s) is computed on the GPU and transferred back to the CPU to store the final fault table matrix on the CPU. The complete algorithm of our GFTABLE approach is displayed in Algorithm 12. 9.5 Experimental Results As discussed previously, pattern parallelism in GFTABLE includes both bit- parallelism, obtained by performing logical operations on words (i.e., packet size is 32), and thread-level parallelism, obtained by launching T GPU threads concur- rently. With respect to bit parallelism, the bit width used in GFTABLE implemented on the NVIDIA Quadro FX 5800 was 32. This was chosen to make a fair comparison with FSIM∗, which was run on a 32-bit, 3.6 GHz Intel CPU running Linux (Fedora Core 3), with 3 GB RAM. It should be noted that Quadro FX 5800 also allows operations on 64-bit words. With respect to thread-level parallelism, launching a kernel with a higher number of threads in the grid allows us to better take advantage of the immense parallelism available on the GPU, reduces the overhead of launching a kernel, and hides the latency of accessing global memory. However, due to a finite size of the global memory there is an upper limit on the number of threads that can be launched simultaneously. Hence we split the fault list of a circuit into smaller fault lists. This is done by first sorting the gates of the circuit in increasing order of their level. We then collect the faults associated with every Z (=100) gates from this list, to generate the smaller fault lists. Our approach is then implemented such that a new fault list is targeted in a new iteration. We statically allocate global memory for storing the fault detectabilities of the current faults (faults currently under consideration) for all threads launched in parallel on the GPU. Let the number of faults in the current list being considered be F, and the number of threads launched simultaneously be T, then F × T × 4 B of global memory is used for storing the current fault 9.5 Experimental Results 147 Algorithm 12 Pseudocode of GFTABLE GFTABLE(N){ Set up Fault list FL. Find FFRs and SRs. STEM_LIST ← all stems Fault table [a ik ] initialized to all zero matrix. v=0 while v < N do v=v + T × 32 Generate using LFSR on CPU and transfer test vector to GPU Perform fault free simulation on GPU ACTIVE_STEM ← NULL. for each stem s in STEM_LIST do Simulate FFR using CPT on GPU // bruteforce backtracking on all gates Simulate SRs on GPU // check at every Gth gate during // forward levelized simulation if fault frontier still alive, // else continue with for loop with s ← next stem in STEM_LIST Compute D(s, t)onGPU,wheret is the immediate dominator of s. // computed using hybrid parallel reduction on GPU if (D(s, t) = (00 0)) then update on CPU ACTIVE_STEM ← ACTIVE_STEM + s end if end for while (ACTIVE_STEM = NULL) do Remove the highest level stem s from ACTIVE_STEM. Compute D(s, t)onGPU,wheret is an auxiliary output which connects all primary outputs. // computed using hybrid parallel reduction on GPU if (D(s, t) = (00 0)) then for (each fault f i in FFR(s)) do FD(f i , t)=FD(f i , s) · D(s, t). // computed on GPU Store FD(f i , t)intheith row of [a ik ] //storedonCPU end for end if end while end while } detectabilities. As mentioned previously, we statically allocate space for two copies of fault-free simulation output for at most L gates. The gates of the circuit are topo- logically sorted from the primary outputs to the primary inputs. The fault-free data (and its copy) of the first L gates in the sorted list are statically stored on the GPU. This further uses L × T × 2 × 4 B of global memory. For the remaining gates, the fault-free data is transferred to and from the CPU as and when it is computed or required on the GPU. Further, the detectabilities and cumulative detectabilities of all gates in the FFRs of the current faults, and for all the dominators in the circuit, are stored on the GPU. The total on-board memory on a single NVIDIA Quadro FX 5800 is 4 GB. With our current implementation, we can launch T = 16K threads in 148 9 Fault Table Generation Using Graphics Processors Table 9.1 Fault table generation results with L = 32K Circuit # Gates # Faults GFTABLE FSIM∗ Speedup GFTABLE-8 Speedup c432 196 524 0.77 12.60 16.43× 0.13 93.87× c499 243 758 0.75 8.40 11.20× 0.13 64.00× c880 443 942 1.15 17.40 15.13× 0.20 86.46× c1355 587 1,574 2.53 23.95 9.46× 0.44 54.03× c1908 913 1,879 4.68 51.38 10.97× 0.82 62.70× c2670 1,426 2,747 1.92 56.27 29.35× 0.34 167.72× c3540 1,719 3,428 7.55 168.07 22.26× 1.32 127.20× c5315 2,485 5,350 4.50 109.05 24.23× 0.79 138.48× c6288 2,448 7,744 28.28 669.02 23.65× 4.95 135.17× c7552 3,719 7,550 10.70 204.33 19.10× 1.87 109.12× b14_1 7,283 12,608 70.27 831.27 11.83× 12.30 67.60× b14 9,382 16,207 100.87 1,502.47 14.90× 17.65 85.12× b15 12,587 21,453 136.78 1,659.10 12.13× 23.94 69.31× b20_1 17,157 31,034 193.72 3,307.08 17.07× 33.90 97.55× b20 20,630 35,937 319.82 4,992.73 15.61× 55.97 89.21× b21_1 16,623 29,119 176.75 3,138.08 17.75× 30.93 101.45× b21 20,842 35,968 262.75 4,857.90 18.49× 45.98 105.65× b17 40,122 69,111 903.22 4,921.60 5.45× 158.06 31.14× b18 40,122 69,111 899.32 4,914.93 5.47× 157.38 31.23× b22_1 25,011 44,778 369.34 4,756.53 12.88× 64.63 73.59× b22 29,116 51,220 399.34 6,319.47 15.82× 69.88 90.43× Average 15.68× 89.57× parallel, while using L = 32K gates. Note that the complete fault dictionary is never stored on the GPU, and hence the number of test patterns used for generating the fault table can be arbitrarily large. Also, since GFTABLE does not store the information of the entire circuit on the GPU, it can handle arbitrary-sized circuits. The results of our current implementation, for 10 ISCAS benchmarks and 11 ITC99 benchmarks, for 0.5M patterns, are reported in Table 9.1. All runtimes reported are in seconds. The fault tables obtained from GFTABLE, for all benchmarks, were verified against those obtained from FSIM∗ and were found to ver- ify with 100% fidelity. Column 1 lists the circuit under consideration; columns 2 and 3 list the number of gates and (collapsed) faults in the circuit. The total runtimes for GFTABLE and FSIM∗ are listed in columns 4 and 5, respectively. The runtime of GFTABLE includes the total time taken on both the GPU and the CPU and the time taken for all the data transfers between the GPU and the CPU. In particular, the transfer time includes the time taken to transfer the following: • the test patterns which are generated on the CPU (CPU → GPU); • the results from the multiple invocations of the parallel reduction kernel (GPU → CPU); • the global fault detectabilities over all test patterns for all faults (GPU → CPU); and 9.5 Experimental Results 149 • the fault-free data of any gate which is not in the set of L gates (during true value and faulty simulations) (CPU ↔ GPU). Column 6 reports the speedup of GFTABLE over FSIM∗. The average speedup over the 21 benchmarks is reported in the last row. On average, GFTABLE is 15.68× faster than FSIM∗. By using the NVIDIA Tesla server housing up to eight GPUs [1], the available global memory increases by 8×. Hence we can potentially launch 8× more threads simultaneously and set L to be large enough to hold the fault-free data (and its copy) for all the gates in our benchmark circuits. This allows for a ∼8× speedup in the processing time. The first three items of the transfer times in the list above will not scale, and the last item will not contribute to the total runtime. In Table 9.1, column 7liststheprojectedruntimeswhenusinga8GPUsystemforGFTABLE(referred to as GFTABLE-8). The projected speedup of GFTABLE-8 compared to FSIM∗ is listed in column 8. The average potential speedup is 89.57×. Tables 9.2 and 9.3 report the results with L = 8K and 16K, respectively. All columns in Tables 9.2 and 9.3 report similar entries as described for Table 9.1. The speedup of GFTABLE and GFTABLE-8 over FSIM∗ with L = 8K is 12.88× and 69.73×, respectively. Similarly, the speedup of GFTABLE and GFTABLE-8 over FSIM∗ with L = 16K is 14.49× and 82.80×, respectively. Table 9.2 Fault table generation results with L =8K Circuit # Gates # Faults GFTABLE FSIM∗ Speedup GFTABLE-8 Speedup c432 196 524 0.73 12.60 17.19× 0.13 98.23× c499 243 758 0.75 8.40 11.20× 0.13 64.00× c880 443 942 1.13 17.40 15.36× 0.20 87.76× c1355 587 1,574 2.52 23.95 9.52× 0.44 54.37× c1908 913 1,879 4.73 51.38 10.86× 0.83 62.04× c2670 1,426 2,747 1.93 56.27 29.11× 0.34 166.34× c3540 1,719 3,428 7.57 168.07 22.21× 1.32 126.92× c5315 2,485 5,350 4.53 109.05 24.06× 0.79 137.47× c6288 2,448 7,744 28.17 669.02 23.75× 4.93 135.72× c7552 3,719 7,550 10.60 204.33 19.28× 1.85 110.15× b14_1 7,283 12,608 70.05 831.27 11.87× 12.26 67.81× b14 9,382 16,207 120.53 1,502.47 12.47× 21.09 71.23× b15 12,587 21,453 216.12 1,659.10 7.68× 37.82 43.87× b20_1 17,157 31,034 410.68 3,307.08 8.05× 71.87 46.02× b20 20,630 35,937 948.06 4,992.73 5.27× 165.91 30.09× b21_1 16,623 29,119 774.45 3,138.08 4.05× 135.53 23.15× b21 20,842 35,968 974.03 4,857.90 5.05× 170.46 28.50× b17 40,122 69,111 1,764.01 4,921.60 2.79× 308.70 15.94× b18 40,122 69,111 2,100.40 4,914.93 2.34× 367.57 13.37× b22_1 25,011 44,778 647.15 4,756.53 7.35× 113.25 42.00× b22 29,116 51,220 915.87 6,319.47 6.90× 160.28 39.43× Average 12.88× 69.73× 150 9 Fault Table Generation Using Graphics Processors Table 9.3 Fault table generation results with L = 16K Circuit # Gates # Faults GFTABLE FSIM∗ Speedup GFTABLE-8 Speedup c432 196 524 0.73 12.60 17.33× 0.13 99.04× c499 243 758 0.75 8.40 11.20× 0.13 64.00× c880 443 942 1.03 17.40 16.89× 0.18 96.53× c1355 587 1,574 2.53 23.95 9.46× 0.44 54.03× c1908 913 1,879 4.68 51.38 10.97× 0.82 62.70× c2670 1,426 2,747 1.97 56.27 28.61× 0.34 163.46× c3540 1,719 3,428 7.92 168.07 21.22× 1.39 121.26× c5315 2,485 5,350 4.50 109.05 24.23× 0.79 138.48× c6288 2,448 7,744 28.28 669.02 23.65× 4.95 135.17× c7552 3,719 7,550 10.70 204.33 19.10× 1.87 109.12× b14_1 7,283 12,608 70.27 831.27 11.83× 12.30 67.60× b14 9,382 16,207 100.87 1,502.47 14.90× 17.65 85.12× b15 12,587 21,453 136.78 1,659.10 12.13× 23.94 69.31× b20_1 17,157 31,034 193.72 3,307.08 17.07× 33.90 97.55× b20 20,630 35,937 459.82 4,992.73 10.86× 80.47 62.05× b21_1 16,623 29,119 156.75 3,138.08 20.02× 27.43 114.40× b21 20,842 35,968 462.75 4,857.90 10.50× 80.98 59.99× b17 40,122 69,111 1,203.22 4,921.60 4.09× 210.56 23.37× b18 40,122 69,111 1,399.32 4,914.93 3.51× 244.88 20.07× b22_1 25,011 44,778 561.34 4,756.53 8.47× 98.23 48.42× b22 29,116 51,220 767.34 6,319.47 8.24× 134.28 47.06× Average 14.49× 82.80× 9.6 Chapter Summary In this chapter, we have presented our implementation of fault table generation on a GPU, called GFTABLE. Fault table generation requires fault simulation without fault dropping, which can be extremely computationally expensive. Fault simulation is inherently parallelizable, and the large number of threads that can be computed in parallel on a GPU can therefore be employed to accelerate fault simulation and fault table generation. In particular, we implemented a pattern-parallel approach which utilizes both bit parallelism and thread-level parallelism. Our implementation is a significantly re-engineered version of FSIM, which is a pattern-parallel fault simulation approach for single-core processors. At no time in the execution is the entire circuit (or a part of the circuit) required to be stored (or transferred) on (to) the GPU. Like FSIM, GFTABLE utilizes critical path tracing and the dominator concept to reduce explicit simulation time. Further modifications to FSIM allow us to maxi- mally harness the GPU’s computational resources and large memory bandwidth. We compared our performance to FSIM∗, which is FSIM modified to generate a fault table. Our experiments indicate that GFTABLE, implemented on a single NVIDIA Quadro FX 5800 GPU card, can generate a fault table for 0.5 million test patterns, on average 15×faster when compared with FSIM∗. With the NVIDIA Tesla server [1], our approach would be potentially 90× faster. References 151 References 1. NVIDIA Tesla GPU Computing Processor. http://www.nvidia.com/object/IO_ 43499.html 2. Parallel Reduction. http://developer.download.nvidia.com/∼reduction.pdf 3. Abramovici, A., Levendel, Y., Menon, P.: A logic simulation engine. In: IEEE Transactions on Computer-Aided Design, vol. 2, pp. 82–94 (1983) 4. Abramovici, M., Breuer, M.A., Friedman, A.D.: Digital Systems Testing and Testable Design. Computer Science Press, New York (1990) 5. Abramovici, M., Menon, P.R., Miller, D.T.: Critical path tracing – An alternative to fault simulation. In: DAC ’83: Proceedings of the 20th Conference on Design Automation, pp. 214–220. IEEE Press, Piscataway, NJ (1983) 6. Agrawal, P., Dally, W.J., Fischer, W.C., Jagadish, H.V., Krishnakumar, A.S., Tutundjian, R.: MARS: A multiprocessor-based programmable accelerator. IEEE Design and Test 4(5), 28–36 (1987) 7. Amin, M.B., Vinnakota, B.: Workload distribution in fault simulation. Journal of Electronic Testing 10(3), 277–282 (1997) 8. Antreich, K., Schulz, M.: Accelerated fault simulation and fault grading in combinational circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 6(5), 704–712 (1987) 9. Banerjee, P.: Parallel Algorithms for VLSI Computer-aided Design. Prentice Hall Englewood Cliffs, NJ (1994) 10. Beece, D.K., Deibert, G., Papp, G., Villante, F.: The IBM engineering verification engine. In: DAC ’88: Proceedings of the 25th ACM/IEEE Conference on Design Automation, pp. 218–224. IEEE Computer Society Press, Los Alamitos, CA (1988) 11. Bossen, D.C., Hong, S.J.: Cause-effect analysis for multiple fault detection in combinational networks. IEEE Transactions on Computers 20(11), 1252–1257 (1971) 12. Gulati, K., Khatri, S.P.: Fault table generation using graphics processing units. In: IEEE Inter- national High Level Design Validation and Test Workshop (2009) 13. Harel, D., Sheng, R., Udell, J.: Efficient single fault propagation in combinational circuits. In: Proceedings of the International Conference on Computer-Aided Design ICCAD, pp. 2–5 (1987) 14. Hong, S.J.: Fault simulation strategy for combinational logic networks. In: Proceedings of Eighth International Symposium on Fault-Tolerant Computing, pp. 96–99 (1979) 15. Lee, H.K., Ha, D.S.: An efficient, forward fault simulation algorithm based on the parallel pattern single fault propagation. In: Proceedings of the IEEE International Test Conference on Test, pp. 946–955. IEEE Computer Society, Washington, DC (1991) 16. Mueller-Thuns, R., Saab, D., Damiano, R., Abraham, J.: VLSI logic and fault simulation on general-purpose parallel computers. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12, pp. 446–460 (1993) 17. Narayanan, V., Pitchumani, V.: Fault simulation on massively parallel simd machines: Algo- rithms, implementations and results. Journal of Electronic Testing 3(1), 79–92 (1992) 18. Ozguner, F., Daoud, R.: Vectorized fault simulation on the Cray X-MP supercomputer. In: Computer-Aided Design, 1988. ICCAD-88. Digest of Technical Papers, IEEE International Conference on, pp. 198–201 (1988) 19. Parkes, S., Banerjee, P., Patel, J.: A parallel algorithm for fault simulation based on PROOFS. pp. 616–621. URL citeseer.ist.psu.edu/article/parkes95parallel.html 20. Pfister, G.F.: The Yorktown simulation engine: Introduction. In: DAC ’82: Proceedings of the 19th Conference on Design Automation, pp. 51–54. IEEE Press, Piscataway, NJ (1982) 21. Pomeranz, I., Reddy, S., Tangirala, R.: On achieving zero aliasing for modeled faults. In: Proc. [3rd] European Conference on Design Automation, pp. 291–299 (1992) 152 9 Fault Table Generation Using Graphics Processors 22. Pomeranz, I., Reddy, S.M.: On the generation of small dictionaries for fault location. In: ICCAD ’92: 1992 IEEE/ACM International Conference Proceedings on Computer-Aided Design, pp. 272–279. IEEE Computer Society Press, Los Alamitos, CA (1992) 23. Pomeranz, I., Reddy, S.M.: A same/different fault dictionary: An extended pass/fail fault dictionary with improved diagnostic resolution. In: DATE ’08: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1474–1479 (2008) 24. Richman, J., Bowden, K.R.: The modern fault dictionary. In: Proceedings of the International Test Conference, pp. 696–702 (1985) 25. Tai, S., Bhattacharya, D.: Pipelined fault simulation on parallel machines using the circuitflow graph. In: Computer Design: VLSI in Computers and Processors, pp. 564–567 (1993) 26. Tulloss, R.E.: Fault dictionary compression: Recognizing when a fault may be unambiguously represented by a single failure detection. In: Proceedings of the Test Conference, pp. 368–370 (1980) [...]... electrical and timing behavior before tape-out Further, process variations increasingly impact the electrical behavior of a design This is often tackled by performing Monte Carlo SPICE simulations, requiring significant computing and time resources K Gulati, S.P Khatri, Hardware Acceleration of EDA Algorithms, DOI 10.1007/978-1-4419-0944-2_10, C Springer Science+Business Media, LLC 2010 153 154 10 Accelerating... convergence of the NR-based non-linear equation solver The total number of such evaluations can easily run into the billions, even for small- to medium-sized designs Therefore, the speed of the device model evaluation code is a significant determinant of the speed of the overall SPICE simulator [16] For more accurate device models like BSIM4 [2], which account for additional electrical behaviors of deep... match for capabilities of the GPU This is because such an application can exploit the extreme memory bandwidths of the GPU, as well as the presence of several computation elements on the GPU To the best of the authors’ knowledge, this work is the first to accelerate circuit simulation on a GPU platform An extended abstract of this work can be found in [12] The key contributions of this work are as follows:... which exploits the SIMD architecture of the GPU – The values of the device parameters required for evaluating the model equations are obtained using a texture memory lookup, thus exploiting the extremely large memory bandwidth of GPUs • Our device model evaluation is implemented in a manner which is aware of the specific constraints of the GPU platform such as the use of (cached) texture memory for table... graph (CDFG) of the BSIM3 code Then we find the disconnected components of this graph, which form a set D For each component d ∈ D, we partition the code of d into smaller kernels as appropriate The partitioning is performed such that the number of variables that are written by kernel k and read by kernel k + j is minimized This minimizes the number of global memory accesses Also, the number of registers... total number of threads that can be issued in parallel on a single multiprocessor is 8,192/R, rounded down to the nearest multiple of 32, as required by the 8800 architecture The number of threads issued in parallel cannot exceed 768 for a single multiprocessor 10.4.1.3 Efficient Use of GPU Memory Model In order to obtain maximum speedup of the BSIM3 model evaluation code, the different forms of GPU memory... the convergence of the NR-based non-linear equation solver Therefore, billions of these evaluations could be required for a complete simulation, even for small to medium designs Also, 158 10 Accelerating Circuit Simulation Using Graphics Processors these computations are independent of each other, exposing significant parallelism for medium- to large-sized designs The speed of execution of the device... that hides memory access latencies, by issuing hundreds of threads at once In case a single thread (which implements all the device model evaluations) is launched, it will not leave sufficient hardware resources to instantiate a sufficient number of additional threads to execute the same kernel (on different data) As a result, the latency of accessing off-chip memory will not be hidden in such a scenario... model using Newton–Raphson (NR) based iterations; and • solving a linear system of equations in the inner loop of the engine The main time-consuming computation in SPICE is the evaluation of device model equations in different iterations of the above flow Our profiling experiments, using BSIM3 models, show that on average 75% of the SPICE runtime is spent in performing these evaluations This is because... populate the coefficient values in the linear system of equations • Solving a linear system of equations forms the inner loop of the SPICE engine We profiled the SPICE code to find the fraction of time that is spent performing device model evaluations, on several circuits These profiling experiments, which were performed using OmegaSIM, showed that on average 75% of the total simulation runtime is spent in performing . higher number of threads in the grid allows us to better take advantage of the immense parallelism available on the GPU, reduces the overhead of launching a kernel, and hides the latency of accessing. simulations, requiring significant computing and time resources. K. Gulati, S.P. Khatri, Hardware Acceleration of EDA Algorithms, DOI 10.1007/978-1-4419-0944-2_10, C  Springer Science+Business Media,. of the SPICE runtime. By accelerating this portion of SPICE, therefore, a speedup of up to 4× can be obtained in theory. Our results show that in practice, our approach can obtain a speedup of

Định dạng
Số trang	20
Dung lượng	236,48 KB