2D-PPC: A single-correction multiple-detection method for Through-Silicon-Via Faults Khanh N Dangả , Michael Conrad MeyerĐ , Akram Ben Ahmed , Abderazek Ben Abdallah‡ and Xuan-Tu Tran¶ ¶ SISLAB, University of Engineering and Technology, Vietnam National University Hanoi, Hanoi, 123106, Vietnam of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, Japan † National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, 305–8568, Japan ‡ Adaptive Systems Laboratory, The University of Aizu, Aizu-Wakamatsu, Fukushima 965-8580, Japan Email: ∗ khanh.n.dang@ieee.org § G.S Abstract—Through-Silicon-Via (TSV) is one of the most promising technologies to realize 3D Integrated Circuits (3DICs) However, the reliability issues due to the low yield rates, the sensitivity to thermal hotspots and stress issues due to the difference in temperature between layers are preventing TSVbased 3D-ICs from being widely and efficiently used Due to defect clustering, 3D-ICs could have multiple defects in the same region which cannot be detected by using error correction codes while dedicated testing could take a significant number of testing cycles This paper presents a 2D Parity Product Code (2D-PPC) with the ability to correct one fault and detect, at least, two faults With the extension using Orthogonal Latin Square, 2DPPC could detect multiple defects while reasonably increasing the area cost and latency Index Terms—3D-ICs, Through Silicon Via, Fault Tolerance, Error Correction Code, Orthogonal Latin Square I I NTRODUCTION Through-Silicon-Vias (TSVs) serve as vertical wires between two adjacent layers in Three-Dimensional Integrated Circuits (3D-ICs) Thanks to their extremely short lengths, their latencies are low, which could lead to extremely high communication speeds [1] Moreover, as a 3D-IC technology, TSV-based ICs can have smaller footprints despite the TSV’s overheads [2], and lower power consumption thanks to the shorter wires [1] Despite the aforementioned advantages, reliability has been a major concern of Through-Silicon-Vias due to their low yield rates, vulnerability to thermal and stress, and the crosstalk issues of parallel TSVs [3] Defects on TSVs can occur in both random and cluster distributions [4] which create concerns about their fault-tolerance capabilities Because of the natural parallel structure, TSVs also face crosstalk challenges [5] Furthermore, the difference in thermal expansion coefficients of materials and temperature variations between two layers, which has been reported to reach up to 10°C [6], could lead to stress issues To enhance the reliability of TSVs, there are three main approaches: (i) hardware fault-tolerance such as correction circuits [7], redundancies [4], reliability mapping [8]; (ii) information redundancy such as coding techniques [9]–[11] or re-transmission request [12]; or (iii) algorithmbased fault-tolerance [13] Built-in-self-test (BIST) [14] and online testing [15], [16] techniques have also been proposed 978-1-7281-2940-2/19/$31.00 ©2019 IEEE 109 to help the system to determine whether a TSV has a defect or not Although numerous methods have been proposed to solve the reliability issues of TSVs, several problems remain a challenge for designers First, to ensure the reliability of TSVs, testing and defect awareness must be provided However, a testing process using BIST [14] or external testing [17] usually causes interruptions of the system’s operations and may take an enormous amount of cycles Therefore, the system must be aware of when a possible fault has occurred Second, ECCs can detect TSV defects immediately; however, they are limited by a certain number of detectable faults For instance, the detection rates of Hamming or SECDED (Single Error Correction Double Error Detection) are low (one and two faults) which may lead to silent faults if multiple TSVs fail The exception is Orthogonal Latin Square Code (OLSC) [18] which provides a low latency and modular design However, OLSC does not provide extra detectability Therefore, in this paper, we propose a coding method named Two Dimensional Parity Product Code (2D-PPC) and its matrix-switching method, which is specially designed for correcting and detecting faults in TSV-based links The contributions of this paper are as follows: • 2D Parity Product Code (2D-PPC) offers one-bit correction and at least two bits detection A Monte-Carlo simulation shows that 2D-PPC could detect more than two defects • By using Orthogonal Latin Square matrices as alternative row and column coding, the detectability of 2D-PPC can be significantly improved • Light-weight design of the proposed 2D-PPC’s encoder and decoder Design of 2D-PPC shows lower delay values than Hamming and SECDED The organization of this paper is as follows: Section II presents the proposed 2D-PPC Section III provides the evaluation environment and results Finally, Section IV concludes the paper II 2D PARITY P RODUCT C ODE This section presents the proposed 2D Parity Product Code (2D-PPC) The following parts demonstrate the encoding and decoding processes with equivalent circuits Finally, we APCCAS2019 1 1 Data Bits 10 11 12 1 1 1 1 Parity Bits 10 11 12 13 14 15 16 v v v 1 In 2D-PPC 1 1 v v v v 1 v 1 v 1 v 1 1 1 1 1 1 v 1 1 1 u 1 1 1 1 v 1 Matrix-3 1 1 1 Matrix-2 15 v 1 Matrix-1 14 1 13 v Matrix-0 1 1 1 1 v v data bits v r0 r1 r2 r3 c0 c1 c2 c3 alternative for c and r bits u Figure 1: Switching 2D-PPC using orthogonal Latin square discuss the correctability and detectability of the coding technique A Fault Consideration Regarding behavior, we modeled the possible faults as stuck-at faults and inverted logic behavior In other words, the output logic value of a TSV is stuck to ‘0’ or ‘1’ or the opposite of the true value These behaviors are generally applied to soft errors as a single event upset, but they can also be modeled as permanent defects or cross-talk depending on the frequency of use The distribution of faults is defined as random and multiple faults could happen in the same TSV group B Encoding C Decoding By using parity checking, the decoder can find the column and row indexes of the flipped bit The parity equations are as follows: sri = bi,0 ⊕ bi,1 ⊕ · · · ⊕ bi,N −1 ⊕ ri scj = b0,j ⊕ b1,j ⊕ · · · ⊕ bM −1,j ⊕ cj scN = r0 ⊕ r1 ⊕ rM −1 ⊕ u srM = c0 ⊕ c1 ⊕ cN −1 ⊕ u The outputs of Eq are two arrays of parity column (sc) and parity row (sr) If there is one or no flipped bit, the decoder can correct it using a masked M ask where M ask(i, j) = For each transmission, a TSV group send a coded flit F as follows: b0,0 b0,1 b0,2 b0,N −1 r0 b1,0 b1,1 b1,2 b1,N −1 r1 Fk = bM −1,0 bM −1,1 bM −1,2 bM,N −1 rM −1 c0 c1 c2 cN −1 u (2) if sri == and scj == For each received flit Fˆk , the corrected flit Fk is obtained by: Fk = Fˆk ⊕ Mask The decoder fails to correct when there are two or more faults In this fashion, the decoder sends a NACK signal and a hybrid automatic retransmission request (HARQ) is used to perform correction where bi,j is a data bit and N +1 ri = bi,0 ⊕ bi,1 ⊕ · · · ⊕ bi,N −1 i=0 cj = b0,j ⊕ b1,j ⊕ · · · ⊕ bM −1,j ur = r0 ⊕ r1 ⊕ · · · ⊕ rM −1 M +1 sri ≥ 2) OR ( NACK = ( sci ≥ 2) (3) i=0 D Correctability and Detectability (1) uc = c0 ⊕ c1 ⊕ · · · ⊕ cN −1 −1 M −1 u = ur = uc = ⊕N i=0 ⊕j=0 (bi,j ) Note that the symbol ⊕ stands for XOR function 110 In general, 2D-PPC can ensure the ability to correct one and detect two flipped bits However, if there are more than two flipped bits, 2D-PPC still has a chance to detect them Although 2D-PPC can detect more than two faults, there is a weak point in its detection approach that always prevents it APCCAS2019 '33& 'HWHFWLRQ5DWH% '33&0DWUL[ 1XPEHURIIDXOWV D0 1 1XPEHURIIDXOWV E0 1 'HWHFWLRQ5DWH% '33&0DWUL[ 1XPEHURIIDXOWV H0 1 1XPEHURIIDXOWV 1XPEHURIIDXOWV G0 1 F0 1 1XPEHURIIDXOWV I0 1 1XPEHURIIDXOWV J0 1 1XPEHURIIDXOWV K0 1 Figure 2: 2D-PPC detection ability evaluation from detecting three faults For instance, if bits with indexes (i, j), (i, k) and (l, j) are flipped, both cri and scj are ‘0’ which makes the decoder fail to detect while both crk and srl could be ‘1’ This symptom makes the decoder understand that there is one fault and corrects the bit bl,k E 2D-PPC using switching Orthogonal Latin Square based matrix To correct more faults, we could extend 2D-PPC based on Orthogonal Latin Square Note that it will limit the shape of 2D-PPC to square (M = N ) Here, 2D-PPC could be considered as an extended version of Orthogonal Latin Square code There are two features that this could break the undetectable pattern We could observe that the design for Matrix-2 and Matrix-3 can be shared with the original matrices of 2DPPC Because we target 2D-PPC for TSVs, adding extra parity bits is not desirable; therefore, switching between matrices is the optimal solution In the first cycle, 2D-PPC runs with its original, then it could run alternative matrices in the following cycles While the original matrix could limit the undetectable pattern, simply switching the different matrices could break this pattern The extra cost and latency are only M × N multiplexers and a MUX 2:1 delay, respectively III E VALUATION in Fig With 2D-PPC (2 × 2), the results show that it can detect a high percentage of even number of faults However, with odd numbers of faults, the undetectable patterns in Section II-D occur which reduce the detection rate This could be explained by the hidden pattern are occurred with three faults and are likely replicated with odd numbers Although we switch the matrix, this pattern could be replicated in the alternative matrices that reduces the detectability of the method Even still, 2D-PPC provides excellent performance with a higher number of data bit-width because there is a lower chance for the worst cases of 2D-PPC to happen By using an additional matrix and simply switching between them periodically, we can improve the accuracy of multiple fault detection With the addition of Matrix-2, we can observe a significant improvement where most cases reach over 90% With Matrix-2 and Matrix-3, the 2D-PPC can almost break the undetectable patterns 99+% or 100% of the time However, there is a special case of having faults when M=N=2 where switching matrices cannot help in tackling the undetectable patterns This could be easily accepted because having faults out of TSVs makes detection infeasible In summary, the proposed 2D-PPC provides a reasonable detection rate By using extra matrices, it could help detect multiple faults without adding overwhelming extra area cost (M × N 2:1 multiplexers and a 2-bit counter) B Hardware Implementation The 2D-PPC circuit is designed in Verilog-HDL with 45 nm process technology The design is implemented using EDA tools by Synopsys We first evaluate the detectability of 2DPPC Then, the real implementation results are presented and compared A Detection performance In order to study the detection ability of 2D-PPC, we perform a 10,000 cases Monte-Carlo simulation represented 111 The hardware implementations of 2D-PPC are presented in Table I Besides the works in [19] and [20], we also perform the comparison with results obtained from [12] which are implemented in 65 nm technology Even when scaling to 45nm, the area cost of Hamming Product Code (HP-HARQII) in [12] is 8× higher than 2D-PPC The BCH [12] code provides multi-bit correction; however, its complexity Extra test results: http://dangnamkhanh.com/share/2D-PPC extra all.csv APCCAS2019 Table I: Hardware implementation results: “AO” and “DO” are Area Optimization and Delay Optimization, respectively Scheme Tech (nm) k (bit) n (bit) 45 45 65 65 65 45 45 45 45 45 64 64 64 64 64 64 64 64 64 64 71 72 72 69 85 72 72 81 81 81 Hamming [9] SECDED [10] HP + HARQ-II [12] ARQ (CRC-5) [12] BCH [12] SEC-DAEC [19] TAEC-64 [20] 2D-PPC(8 × 8) 2D-PPC(8 × 8)+Matrix-2 2D-PPC(8 × 8)+Matrix-2&3 is 50× more than the proposed one HP-HARQ-II encoder’s and decoder’s latencies are 6.82% lower and 9.26% higher while using older technology However, 2D-PPC’s latency is still extremely low (0.44 ns and 0.54 ns) Meanwhile, the area cost is similar to Hamming and SECDED which are two simple coding techniques It is important to mention that the area cost results have not taken into account the area of the TSVs With the same 64 data bit-width, 2D-PPC uses 81 code-word bit-width (or TSVs) while Hamming, SECDED, BCH use 71, 72 and 85 code-word bit-width (or TSVs), respectively To support the switching technique with different matrices, additional circuits are needed which increases the area cost and the latency Utilizing one or two additional matrices maintaining a latency below 1ns and increased the area cost by factors of 1.5x and 1.7x, respectively, but greatly improved the detection rate compared to using a single matrix IV C ONCLUSION This paper presents the 2D Parity Product Code (2D-PPC) to enhance the reliability of TSV-based 3D-IC designs By exploiting the inherent 2D array organization of TSVs, the proposed approach can efficiently represent the fault manifestation in TSV-based systems allowing it to correct one and detect at least two faults in a set of TSVs From the conducted experiments, and in contrast to conventional coding schemes that are limited to detecting a certain number of faults, the proposed 2D-PPC has demonstrated its ability to detect several defects while keeping a reasonable area cost and latency Thank to the matrix-switching technique, 2DPPC significantly improves the detection rate to allow it detect most of the fault cases As a future work, we plan to apply the 2D-PPC to a dedicated 3D-IC architecture (e.g., 3D-RAM, 3D-NoCs) to investigate the impact on the overall system Extending the technique with adaptive coding and different based coding methods is another possible direction V ACKNOWLEDGMENT This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.01-2018.312 112 Area Cost (µm2 ) Encoder Decoder AO DO AO DO 193.1200 463.1060 234.6120 487.0460 9792.5 3605.6 77353.2 678 812 3106 4227 566 695 5279 7165 201.8940 442.8900 341.2780 628.0260 404.5860 691.3340 Latency (ns) Encoder Decoder AO DO AO DO 0.69 1.58 0.75 1.62 0.41 0.59 0.37 0.41 0.42 0.72 0.61 0.33 1.75 0.61 0.58 0.30 1.81 0.62 0.44 0.54 0.53 0.91 0.55 0.97 R EFERENCES [1] W R Davis et al., “Demystifying 3D ICs: The pros and cons of going vertical,” IEEE Des Test Comput., vol 22, no 6, pp 498–510, 2005 [2] X Dong and Y Xie, “System-level cost analysis and design exploration for three-dimensional integrated circuits (3D ICs),” in Proc of the 2009 Asia and South Pacific Des Automation Conf., 2009, pp 234–241 [3] G Van der Plas et al., “Design issues and considerations for low-cost 3-D TSV IC technology,” IEEE J Solid-State Circuits, vol 46, no 1, pp 293–307, 2011 [4] L Jiang et al., “On effective through-silicon via repair for 3-D-stacked ICs,” IEEE Trans Comput.-Aided Design Integr Circuits Syst., vol 32, no 4, pp 559–571, 2013 [5] A Eghbal et al., “Analytical fault tolerance assessment and metrics for TSV-based 3D network-on-chip,” IEEE Trans Comput., vol 64, no 12, pp 3591–3604, 2015 [6] Y J Park et al., “Thermal analysis for 3D multi-core processors with dynamic frequency scaling,” in 2010 IEEE/ACIS 9th Int Conf on Comput and Inform Sci (ICIS) IEEE, 2010, pp 69–74 [7] M Cho et al., “Design method and test structure to characterize and repair TSV defect induced signal degradation in 3D system,” in Proc Int Conf on Comput.-Aided Des., 2010, pp 694–697 [8] F Ye and K Chakrabarty, “TSV open defects in 3D integrated circuits: Characterization, test, and optimal spare allocation,” in Proc of the 49th Annu Des Automation Conf ACM, 2012, pp 1024–1030 [9] R W Hamming, “Error detecting and error correcting codes,” Bell Labs Tech J., vol 29, no 2, pp 147–160, 1950 [10] M.-Y Hsiao, “A class of optimal minimum odd-weight-column SECDED codes,” IBM J Res Dev., vol 14, no 4, pp 395–401, 1970 [11] R Kumar and S P Khatri, “Crosstalk avoidance codes for 3D VLSI,” in Automation and Test in Europe EDA Consortium, 2013, pp 1673– 1678 [12] B Fu and P Ampadu, “On hamming product codes with type-ii hybrid ARQ for on-chip interconnects,” IEEE Trans Circuits Syst I, vol 56, no 9, pp 2042–2054, 2009 [13] K N Dang et al., “Scalable design methodology and online algorithm for TSV-cluster defects recovery in highly reliable 3D-NoC systems,” IEEE Trans Emerg Topics Comput., in press [14] Y Lou et al., “Comparing through-silicon-via (TSV) void/pinhole defect self-test methods,” Journal of Electronic Testing, vol 28, no 1, pp 27– 38, 2012 [15] K N Dang et al., “TSV-IaS: Analytic analysis and low-cost nonpreemptive on-line detection and correction method for TSV defects,” in The IEEE Symposium on VLSI (ISVLSI) 2019, 2019 [16] Y Zhao et al., “Online Fault Tolerance Technique for TSV-Based 3-DIC,” IEEE Trans VLSI Syst., vol 23, no 8, pp 1567–1571, 2015 [17] B Noia et al., “Pre-bond probing of TSVs in 3D stacked ICs,” in 2011 IEEE Int Test Conf (ITC) IEEE, 2011, pp 1–10 [18] M Hsiao, D Bossen, and R Chien, “Orthogonal latin square codes,” IBM Journal of Research and Development, vol 14, no 4, pp 390–394, 1970 [19] A Dutta and N A Touba, “Multiple bit upset tolerant memory using a selective cycle avoidance based SEC-DED-DAEC code,” in 25th IEEE VLSI Test Symp IEEE, 2007, pp 349–354 [20] L.-J Saiz-Adalid et al., “MCU tolerance in SRAMs through lowredundancy triple adjacent error correction,” IEEE Trans VLSI Syst., vol 23, no 10, pp 2332–2336, 2015 APCCAS2019 ... complexity Extra test results: http://dangnamkhanh.com/share/2D-PPC extra all.csv APCCAS2019 Table I: Hardware implementation results: “AO” and “DO” are Area Optimization and Delay Optimization, respectively... different matrices, additional circuits are needed which increases the area cost and the latency Utilizing one or two additional matrices maintaining a latency below 1ns and increased the area cost... Orthogonal Latin Square code There are two features that this could break the undetectable pattern We could observe that the design for Matrix-2 and Matrix-3 can be shared with the original matrices