RESEARCH Open Access A framework for ABFT techniques in the design of fault-tolerant computing systems Hodjat Hamidi * , Abbas Vafaei and Seyed Amirhassan Monadjemi Abstract We present a framework for algorithm-based fault tolerance (ABFT) methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparis on of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs. Number data processing errors are detected by comparing parity values associated with a convolution code. This article proposes a new computing paradigm to provide fault tolerance for numerical algorithms. The data processing system is protected through parity values defined by a high-rate real convolution code. P arity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. To use ABFT methods efficiently, a systematic form is desirable. A class of burst-correcting convolution codes will be investigated. The purpose is to describe new protection techniques that are easily combined with data processing methods, leading to more effective fault tolerance. Keywords: algorithm-based fault tolerance (ABFT), burst-correcting convolution codes, parity values, syndrome 1. Introduction Algorithm-based fault tolerance (ABFT) was first intro- duced by Huang and Abraham [1] and was directed toward detection of high-level errors because of internal processing failures. ABFT techniques are most effective when employing a systematic form [2-6]. The motiva- tional model basic ABFT as applied to data processing of blocks of real data is shown in Figures 1 and 2. The ABFT philosophy leads directly to a model from which error correction can be developed. The parity values are determined according to a systematic real convolution code. Detection relies on two sets of parity values which are computed in two different ways, one set from the input data but with a s implified combined processing subsystem, and the other set directly from the output processed data, employing the parity definitions directly. These comparable sets will be very close numerically, although not identical because of round-off error differ- ences between the two parity generation processes. The effects of internal failures and round-off error are mod- eled by additive error sources located at the output of the processing block and input at threshold detector. This model combines the aggregate effects of errors and fail- ures and applies them to the respective outputs. ABFT for arithmetic and numerical processing operations is based on linear codes. Bosilca et al. [7] proposed a new ABFT method based on parity check coding for high-per- formance computing. The application of low density par- ity check (LDPC) based ABFT is compared an d analyzed in [8], as the use of LDPC to classical Reed-Solomon (RS) codes with respect to different fault models. However, Roche et al. [8] did not provide a method for construct- ing LDPC codes algebraically and systematically, such as RS and BCH codes are constructed, and LDPC encoding is very complex because of the lack of a ppropriate struc- ture. ABFT methodologies used in [9] present parity values dictated by a real conv olution code for protecting linear processing systems. A class of high rate burst-correcting convo lution codes is discussed in [10]. Convolution codes provide error detection in a continuous manner using the same computational resources as the algorithm progresses. Redinbo [11] presented a met hod to wavelet codes into systematic forms for ABFT applications. This method applies high-rate, low-redundancy wavelet codes which * Correspondence: hamidi@eng.ui.ac.ir Department of Computer Science, University of Isfahan, Post Code 81746- 73441, Isfahan, Iran Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 © 2011 Hamidi et al; licensee Springer. This is an Open Ac cess article distributed under the te rms of the Creative Commons Attribution License (http://creative commons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the origin al work is properly cited. use continuous checking attributes for detecting the onset of errors. However, this technique is suited to image processing and data compression app lications. In addition, there is a difficult analytical approach to accu- rate the measures of the detection performances of the ABFT technique using wavelet codes [11,12]. Figure 1, [13], shows the basic architecture of an ABFT system. Ex isting techniques use various coding schemes to provide information redundancy needed for error detection and correction. The coding algorithm is closely related to the running process and is often defined by real number codes generally the block types [14]. Systematic codes are of most interest because the fault detection scheme can be superimposed on the original process box with theleastchangesinthealgo- rithm and architecture. The goal is to describe new pro- tection techniques that are eas ily combined with normal data processing methods, leading to more effective fault tolerance. The data processing system is protected through parity sequences specified by a high rate real convolution code. Parity comparisons provide error detection, while o utput data co rrection are affected by a decoding method that includes both round-off error and computer-induced errors. The error detection stru ctures are developed and they not only detected subsystem errors, but also corrected errors i ntroduced in the data processing system. Concurrent parity values’ techniques are very useful in detecting numerical error in the data Figure 1 General architecture of ABFT [13]. Figure 2 Block diagram of the ABFT technique. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 2 of 12 processing operations, where a single error can propa- gate to many output errors. The following contributions are made in this article: In Section 2, the convolution codes are discussed briefly; in Section 3, the architecture of ABFT (ABFT scheme) and modeling errors are proposed and the method for detect- ing errors using parity values is discussed; in Section 4, the class of convolution codes: burst-error-correcting convolu- tion codes is discussed; in Section 5, the decoding and cor- rector system is discussed; in Section 6, the results and evaluations and simulations are presented and finally in Section 7, conclusions are presented. 2. Convolution codes A convolution code is an error correcting code that pro- cesses information serially or, continuously, in short block lengths [15-21]. A convolution encoder has mem - ory, in the sense that the output symbols depend not only on the input symbols, but also on previous i nputs and/or outputs. In other words, the encoder is a sequen- tial circ uit [15,17,20]. A rate R=k/nconvolution enco- der with memory order m can be realized as a k-input, n-output linear sequential circuit with input memory order m; that is, inputs remain in the enco der for an additional m time units after entering. Typically, n and k are small integers, k<n, the information sequence is divided into blocks of length k, and the codeword is divided i nto blocks of length n. In the important special case, when k=1, the information sequence is not divided into blocks and is processed continuously. Unlike with block codes, large minimum distances and low error probabilities are achieved not by increasing k and n but by increasing the memory order m [16, Chap- ter 11]. We consider only systematic forms of convolu- tion codes because the normal operation of Process block is not altered and there is no need to decoding for obtaining true outputs. A systematic real convolution code guarantees that faults representing errors in the processed data will result in notable non-zero values in syndrome sequence. Systematic encoding means that the information bits always appear in the first k positions of a code word, leftmost. The remaining n - k bits in a code word are a function of the info rmation bits, and provide redundancy that can be used for error correc- tion and/or detection purposes, rightmost. Real number convolution codes may find applications in channel cod- ing for communication systems and in fault-tolerant data processing systems containing error correction. Real-number codes can be constructed easily from finite-field codes, viewing the field elements as corre- sponding integers in the real number field, and as such theoretically have as good if not better properties as the original finite field structures [6]. 3. Code usage for ABFT and ABFT scheme 3.1. Code usage for ABFT A real convolutio n code in systematic form [16] is used to compute parity values associated with the processing outputs as shown in Figure 2. Certain classes of errors occurring anywhere in the overall system including the parity generation and regeneration subsystems are easily detected. A convolution code with its encoding memory can sense the onset of errors before they increase beyond detection limits. For a rate k/n real convolution code with constraint parameter, it is always possible by sim ple linear opera tions to extract the parity generating part. The (n - k) parity samples for each processed block of sampl es are pr oduced in block processin g fash- ion. Since processing resources are in close proximity, it is easily demonstrated [9] that an efficient block proces- sing structure can produce the (n - k) parity values directly from the inputs. When these two comparab le parity values are subtracted, one from the outputs and the others directly from the inputs, only the s tochastic effects remain, and the syndromes are produced as shown in Figure 2. 3.2. Modeling errors It is generally assumed that transient errors can occur in the intermediate values at any time during the course of data processing as shown in Figure 3. Furthermore, only one error is permitted during a sequence of operations to avoid complete overload. The proposed error model implies that errors are described by adding a modeling numerical value e to the calculated output: z = y + e. 3.3. ABFT scheme To achieve fault detection and correction properties of convolution code in a linear process with the minimum overhead computa tions, the a rchitecture is proposed in Figure 2. For error correction purposes, redundancy must be inserted in some form and convolution parity codes will be employed, u sing the ABFT. A systematic form of convolution codes is especially profitable in the ABFT detection plan because no redundant transforma- tions are needed to achieve the processed data a fter the detection operations. Figu re 2 summarizes an ABFT technique employing a systematic convolution code to Figure 3 Modeling errors in data processing. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 3 of 12 define the parity values. The data processing operations are combined with the parity generating function to provide one set of parity values. The k is the basic block size of the input data, and n is the block size of the out- put data, new data samples are accepted and (n - k) new parity values are produced. The upper way, Figure 2, is the processed data flow which pa sses through the process block (data processing block) and then feeds the convolution encoder (parity regeneration) to produce parity values. On the other hand, the comparable parity values are generated effi- ciently and directly from the inputs (parity and proces- sing combined, see Figure 2), without producing the original outp uts. The difference in the comparable two parity va lues, which are computed in different ways, is called the syndrome; the syndrome sequence is a st ream of zero or near zero values. The convolution code’s structure is designed to produce distinct syndromes for a l arge class of errors appearing in the processing ou t- puts. Figure 2 employs convolution code parity in detecting and correcting processing errors. 3.4. Error detection The method for detecting errors using parity values is shown in Figure 2. Except for small round-off errors, the two parity values ¯ p u i and ¯ p l i should be equal in the error-free case. The two parities are equal if an error does not occur, ignoring any round-off errors in the arithmetic computatio ns. The comparator computes the difference, S, between the two parity values and deter- mines if its magnitude is smaller than a chosen thresh- old determined by round-off error, S = ¯ p l i - ¯ p u i if |S|<τ then there is no error (τ is threshold). The difference between the parity values, considering a round-off threshold, τ, can be used to detect a error. This thresh- old τ places a bound on the effects of errors appearing at the output, modeled here as a vector e which is added to the true output y to characterize the observed output z = y + e, see Figure 3. A total self-checking checker (comparator) for real number parities using a detection threshold is described in [9,11]. Its role is to indicate if an error has occurred in the process using the par ities ¯ p l i and ¯ p u i . The comparator is constructed by producing a 1-out-of-2 codeword at terminals (sign threshold, banded thresholds) = (T SGN , T τ )asshownin Figure 4. Given that s truly represents ¯ p l i - ¯ p u i ,ifeither |S| ≥ τ, the sign, or the value-characterize unit has failed when valid p arity inputs are appli ed, the output will not be a valid 1-out-of-2 code. Otherwise, the comparator and its checking parts give a 1-out-of-2 code indicating that no error has occurred in the data processing unit and its checking facilities. The precision required for the two parity values, the value characterizations in Figure 4, only need to meet the separation by the threshold value to be effective for detection. 4. Burst-error-correcting convolution codes Aburstoflengthd is defined as a vector and the non- zero components are confine d to d consecutive digit positions, the first and last are non-zero [16,17]. A burst refers to a group of possibly contiguous errors which i s characteristic of unforeseeable effects of errors in data computation. Only systematic forms of convolution codes are considere d here becau se the normal operation of Process block has not changed and there is no need for decoding to obtain true outputs. Moreover, convolu- tion codes have g ood correcting characteristics because of memory in their encoding structure [17]. 4.1. Bounds on burst-error-correcting convolution codes Costello and Lin [16] have shown that a sequence of error bits e d+1 ,e d+2 , , e d+a is call ed a burst of length a relative to a guard space of length b if 1. e d+1 = e d+a =1; 2. the b bits p receding e d+1 and the b bits following e d +a are zero; 3. the a bits from e d+1 through e d+a contain no subse- quence of b zero. For any c onvolution code of rate R that corrects all bursts of length a or less relative to a guard space of length b, b a ≥ 1+R 1 − R . (1) The bound of (1) is known as the bound on complete burst-error correction [16]. Massey [20] has also shown Figure 4 Comparator using threshold τ [11]. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 4 of 12 that if we allow a small fraction of the bursts of length a to be decoded incorrectly, the guard space requirements can be reduced significantly. In particular, for a convolution code of rate R that corrects all but a fraction ψ of bursts of length a or less relative to a guard space of length b b a ≥ R +[log 2 (1 − ψ)]/a 1 − R ≈ R 1 − R (2) for small ψ. The bound of (2) is known as the bound on almost all burst-error correction. Burst-correcting conv olution codes at structure of the convolution codes are appropriate and efficient in detecting and correcting errors from internal computing failures. Burst-correcting convolut ion co des need guard bands (error-free regions) before and after bursts of errors, particularly if error correction is needed [16]. One class of burst-correcting codes is the Berlekamp-Preparata (BP) codes [16-20] that have many appro priate characteristic with regard to failure error-detecting. Their design properties guarantee for detecting the onset of errors because of failures, regardless of any error-free region following the begin- ning of a burst of errors. Consider designing an (n, k = n -1,m) systematic convolution encoder to correct a phased burst error confined to a single block of n bits relative to a guard space of m error-free blocks. To design such a code, we must assure that each correct- able error value [E] m =[e 0 , e 1 , , e m ] results in a distinct syndrome [ S] m =[s 0 , s 1 , , s m ]. This implies that each error value with e 0 ≠ 0ande d =0,d =1,2, ,m must yield a distinct syndrome a nd that each of these syn- dromes must be distinct from the syndrome caused by anyerrorvaluewithe 0 =0andasingleblocke d ≠ 0, d = 1, 2, , m. Ther efore, the first error block e 0 can cor- rectly be decoded if first (m +1)blocksofe contain at most one non-zero block, and assuming feedback decoding, each successive error block can be decoded in thesameway.An(n, k = n -1,m) systematic code is depicted by the set of generator polynomials g 1 (n-1) (D), g 2 (n-1) (D), , g (n−1) n−1 ( D). The generator matrix of a sys- tematic convolution code, G,isasemi-finitematrix evolving m finite sub-matrixes as G = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ IP 0 0P 1 0P 2 0P m IP 0 0P 1 0P m−1 0P m IP 0 0P m−2 0P m−1 0P m . ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (3) where I and 0 are identity and all zero k × k matrixes, respectivel y, and P i with i =0tom is a k ×(n - k) matrix [18]. The parity-check matrix is constructed from a basic binary matrix, labeled H 0 ,a2n × n binary matrix containing the skew-identity matrix in its top n rows (4). H m = [ H 0 , H 1 , , H m ] (4) where H 0 is an n ×(m+1) matrix (5): H 0 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ g (n−1) 1,0 g (n−1) 1,1 g (n−1) 1,m . . . . . . . . . . . . g (n−1) n−1,0 g (n−1) n−1,1 g (n−1) n−1,m 1 0 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (5) For 0 <d ≤ m, we obtain H d from H d-1 by shifting H d-1 one column to t he right and deleting the last column. Mathematically, this operation can be expressed as H 0 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 010 0 001 0 . . . . . . . . . . . . . . . 000 1 000 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = H d−1 T (6) where T is an (m+1) × (m+1) shifting matrix. Another important parity check type of matrix is put together using H 0 and its d succ essive downward shifted versions [19]. However, all necessary information for forming the systematic parity check matrix H T is con- tained in the basis matrix H 0 .Thelowertriangularpart of this matrix, (n - 1) rows, (n - 1) columns, hold binary values selected by a construction method to produce desirable detection and correction properties [19]. For systematic codes, the parity check matrix submatrices H m in (4) have special forms that control how these equations are formed. H T 0 = P 0 |I n−k , H T i = P i |0 n−k i =1,2, L . (7) where I n-k and 0 n-k are identity and all zero k × k matrixes, respectively, and P i is an (n-1) × k matrix. However, in an alternate view, the respective rows of H 0 contain the parity submatrices P i needed in H T , (4) and (7): H 0 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ P 0 | I 1 P 1 | 0 P 2 | 0 P L−1 | 0 P L | 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (8) Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 5 of 12 The n columns of H 0 are designed as an n-dimen- sional subspace of a full (2n)-dimensional space compar- able with the size of the row space. Using this notation, the syndrome [S] m = [E] m H T m = e 0 H 0 + e 1 H 1 + +e m H m = e 0 H 0 + e 1 H 0 T + + e m H 0 T m = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ S i S i+1 . . . S i+n ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (9) [S] m is a syndrome vector with (l+1) values, in this class of codes (n - k) equal 1. The design properties of this class of codes assure any c ontribution of errors in one observed vector, [E] m , appearing in syndrome vector [S] m is linearly independent of syndromes caused by ensuing error vectors [E] i+1 ,[E] i+2 , , [E] i+l in adjacent observed vectors. At any time, a single burst of errors is limited to set [E] m , correction is possible by separating the error effects. These errors in [E] m are recognized with the top n items in [S] m . [E] m = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ e i,1 e i,2 . . . e i,n ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (10) then error values recognition e i,n = S i , e i,n−1 = S i+1 , , e i,1 = S i+n+1 (11) If there a re non-zero error bursts in [E] i+1 ,[E] i+2 , , [E] i+l , their accumulate contribution is in a separate sub- space never permitting the syndrome vector [ S ] m to be all zeros. The beginning of errors, even if they over- whelm the correcting capability of the code, can be detected. This distinction between correctable and only detectable error bursts is achieved by applying an anni- hilating matrix, denoted F T 0 ,whichisn ×2n and has a defining property, F T 0 H 0 =0 n . Hence, it is possible to check whether a syndrome vector [S] m represents cor- rectable errors, F T 0 .[S] m =0,then[S] m obtain correct- able model. From (1) for an opt imum burst-error correcting code, b/a = (1 +R)/(1 -R). For the preceding case with R=(n -1)/n and b = m.n = m.a, this implies that b a = m =2n − 1 (12) i.e., H 0 is an n ×2n matrix. We must choose H 0 such that the conditions for burst-error correction are satis- fied. If we choose the first n columns of H 0 to be the skewed n×n identity matrix, then (9) implies that each error sequence with e 0 ≠ 0ande d =0,d = 1, 2, , m will yield a distinct syndrome. In this case, we obtain the estimate of simply by reversing the first n bits in the 2n-bit syndrome. In addition, for each e 0 ≠ 0, the condi- tion e 0 H 0 = e d H 0 T d , d = 1, 2, , m , (13) must be satisfied for e d ≠ 0. This ensures that an error in some other blocks will not be confused for an error in block zero. For any e d ≠ 0 and d ≥ n, the first n posi- tions in the vector e d H 0 T d must be zero, since T d shifts H 0 such that H 0 T d has all zero in its first d columns; however, for any e d ≠ 0, the vector cannot have all zeros in its first positions. Hence, condition ( 13) is automati- cally satisfied for n ≤ d ≤ m, m =2n -1, and we replace (13) with the condition that for each e 0 ≠ 0, e 0 H 0 = e d H 0 T d , d = 1, 2, , n − 1 (14) 5. Decoding and corrector system The BP codes can be decoded using a general decoding technique for burst-error-correcting convolution codes according to Massey [20]. We recall from (9) that the set of possible syndromes for a burst confined to b lock 0 is simply the row space of the n×2n matrix H 0 . Hence, e 0 ≠ 0ande d =0,d =1,2, ,m [S ] m are codeword in the (2n, n ) block code generated by H 0 ; however, if e 0 = 0 and a single block e d ≠ 0 for some d,1≤ d ≤ m , con- dition (13) ensures that [S] m is not a codeword in the block code generated by H 0 .Therefore,e 0 contains a correctable error pattern if and only if [S] m is a code- word in the block code generated by H 0 . This requires deter mining if [S] m . H T 0 =0 is t he n×2n block code par- ity check matrix corresponding to H 0 .If[S] m H T 0 =0 , the decoder must then find the correctable error pattern that produced the syndrome [S] m . B ecause in this case [S] m = e 0 H 0 , we obtain the estimate of simply by rever- sing the first n bits in [S] m . For a feedback decoder, the syndrome must then be modified to remove the effect of e 0 . But, for a correctable error pattern, [S] m = e 0 H 0 depends only on e 0 , and h ence when the effect of e 0 is removed the syndrome will be reset to all zeros. Error correction system provides a more detailed view of some subassemblies in Figure 2 (see Figure 5). The pro- cessed data ¯ d i can include errors ¯ e i and the error cor- rection system will subtract their estimates ¯ e i as indicated in the corrected data output of the error cor- rection system. If one of the computed parity values, ¯ p u i or ¯ p l i in Figure 5, comes from a failed subsystem, the error correction system’s inputs may be incorrect. Since the data are correct under the single failed subsystem assumption, the data contain no errors and the error Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 6 of 12 correction system is operating correctly. The error cor- rection system will observe the errors in the syndromes and properly estimate them as limited to other posi- tions. In addition, an excessive number o f error esti- mates { ¯ e i } could be deducted from correct data, yielding { ¯ d i - ¯ e i } values at the Error Correction System’s output, which the regeneration of parity values produces { ¯ p u i }. There are several indicators that will detect errors in the error correction system’s input syndromes { ¯ s i }. 6. Simulations and results 6.1. Design evaluation The m ethods discussed in this article are programmed using the MATLAB programming tool. The MATLAB codeformsthebasisforasimulationprogramthat explores the role of the threshold τ , S = ¯ p l i - ¯ p u i if ¯ p l i − ¯ p u i <τ then there are no errors. If the threshold τ is set too low, even occasional round-off errors will exceed it, indicating failures leading to recomputation unnecessarily. It is generally permissible to accep t a few small errors that are in the range of round-off levels. Nevertheless, the simulations examine how the thresh- old choice impacts undetected errors. Errors are detected by examining the magnitude of the respective syndromes and comparing against thresholds five times the standard deviation of syndrome values when only low levels of round-off error appear. The simulation program randomly selects the line in a magnitude error is su perimposed. The magnitud e of each error is chosen from a Gaussian population with zero mean and fixed variance. For small thresholds, large errors always lead to detection, whereas large thresholds increase the undetect ed error performance. The threshold was varied overawiderangesoastoseethetransitionbetween low detected errors and high levels of missed errors. However, for a simulation, the error-detecting capabil- ities are interrelated with the variance of the simulated computer-ind uced errors. The probability of undetec ted errors when errors are present is evaluated as the ratio of threshold to error variance is varied over several orders of magnitude. The results are shown in Figure 6. The input data size is k = 100 samples. The error mag- nitude variance is take n as 10 -3 so th at, probabilistically, only small errors are superimposed. At ver y low thresh- olds, the experimental probability of undetected errors is zero. The values are not displayed on the smallest part of the abscissa. The curv es shown in Figure 6 never have any undetected error until the threshold 5, when the first undetected probability is 1.1 × 10 -4 . Two longer simulations using 10 6 samples are performed for two low thresholds of 2 × 10 -3 and 2 × 10 -5 . The undetected error rate is 4.86 × 10 -7 when the threshold is 2 × 10 -5 . For the slightly higher threshold of 2 × 10 -3 this error rate is 4.724 × 10 -5 . By comparing the differences between the two parity values ¯ p u i and ¯ p l i ,wecanshowthecheckingsystem responding to error. Figure 7 shows how the errors are reflected at the checker output (comparator). The top figure shows a very smal l difference betwe en the two parity values ¯ p u i and ¯ p l i . The reason for the non-zero differences is round off errors because of the finite answer of comput- ing system. In the bottom figure, the values of | ¯ p l i - ¯ p u i | reflect errors occurred. If the error threshold is setup low enough, then most of the errors can be detected by the comparator; however, if we set the Figure 5 Block diagram of the ABFT technique along with error correction system. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 7 of 12 Figure 6 Undetected error probabilities versus threshold choices. Figure 7 The responding to errors (computer-induced errors): (a) no errors, (b) errors and the difference between the two parity values ¯ p u i and ¯ p l i . Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 8 of 12 threshold too low, the comparator may pick up the round-off errors and c onsider those to be the errors bec ause of the computer-induced error s. Thus, we need to find a good threshold, which separates the errors because of computer analysis limited and the computer- induced errors. Figure 8 gives the error detection performance versus the setup threshold. At the small setup threshold, the checker picks up most the errors occurred. The perfor- mance is getting worse when the threshold is getting larger. 6.2. Mean square error performance The correction procedures are governed by a minimum mean square error (MSE ) criterion. This section exam- ines the MSE performance through MATLAB simula- tions. Errors are inserted additively, both in the code symbols and syndrome v alues to model failures. Simula- tion runs for the code (4, 3), rate 3/4 is performed for each standard deviation of the inserted errors, 10 -3 to 10 -8 . The insertion error rate is p =5×10 -3 . The aver- age MSE plots shown in Figure 9 display the values for input errors as well as those for corrected code. The input mean-squared values for input errors are very similar by statistical regularity while the corrected MSEs are much lower since large errors have been eli minat ed. Furthermore, the code seems quite capable of correcting all errors. The differences between input error mean- squared values and its corrected version can be evalu- ated by taking a ratio of their mean-squared levels. 6.3. Examples and simulations A BP burst-correcting convolution code (6, 5, 11) is constructed [16] for use with a fault-tolerant processing situation. A rate 1/3 (3, 1, 10) code is chosen from a standard text [16] which have a con straint parameter m = 10. Long simulations involving 250, 000 blocks of data over a wide range of variances are performed. For the Figure 8 Detection performance of the comparator versus threshold. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 9 of 12 rate 1/3 code, this represented 750, 000 samples, while for the rate 5/6 code case it implied 1.5 million samples. Burst and errors within each block are permitted. A burst in this context means that the standard deviations of all components in a block are raised t o 10% of the maximum standard deviation. On the other hand, when a burst is not active, errors are allowed with positions within a block chosen independently at random, and those selected ha d their standard deviations raised to 10% of their maxima. The probability of a burst is 5 × 10 -3 , while intra block errors have probability 10 -3 . For long simulations, the basic parameter s 2 (variance of error) is changed from 10 -9 up to 3.2. The mean-square e rror performance for the rate 1/3 example is shown in Figure 10a, while that for the processing system protected by the rate 5/6 BP code is displayed in Figure 10b. These plots show consistent improvement for the coded situations over the wide range of m odeling erro r variances. The corrective actions for both cases are displayed in Figure 11. The input errors and correction values are displayed as labeled, but the important plots represent t he absolute value of correction differences. 7. Conclusions This article addresses new methods for performing error correction when real convolution codes are involved. Real convolution codes can provide effective protection for data processing operations at the data-parity level. Data processing implementations are protected against Figure 9 Average MSE values versus standard deviation. Hamidi et al. EURASIP Journal on Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 10 of 12 [...]... et al.: A framework for ABFT techniques in the design of fault-tolerant computing systems EURASIP Journal on Advances in Signal Processing 2011 2011:90 Submit your manuscript to a journal and benefit from: 7 Convenient online submission 7 Rigorous peer review 7 Immediate publication on acceptance 7 Open access: articles freely available online 7 High visibility within the field 7 Retaining the copyright... propagate to many output errors Parity values are the most effective tools used to detect burst errors occurring in the code stream The detection performance in the data processing system depends on the detection threshold, which is determined by round-off tolerances The structures have been tested using MATLAB programs and compute error detecting performance of the concurrent parity values method and simulation... is affected by a decoding method that includes both round-off error and computer-induced errors The error detection structures are developed and they not only detected subsystem errors, but also corrected errors introduced in the data processing system Concurrent parity values techniques are very useful in detecting numerical error in the data processing operations, where a single error can propagate... doi:10.1142/S0218126607003708 14 J Baylis, Error-Correcting Codes: A Mathematical Introduction, (Chapman and Hall Ltd, 1998) 15 VS Veeravalli, Fault tolerance for arithmetic and logic unit IEEE Southeastcon 09 329–334 (2009) 16 D Costello, S Lin, Error Control Coding Fundamentals and Applications, (Pearson Education Inc., NJ, 2004), 2 17 RH Morelos-Zaragoza, The Art of Error Correcting Coding, (Wiley, 2006), 2 ISBN: 0470015586... 2006), 2 ISBN: 0470015586 18 AJ Viterbi, JK Omura, Principles of Digital Communication and Coding, (McGrawhill, 1985), 2 19 ER Berlekamp, A class of convolution codes Inf Control 6, 1–13 (1962) 20 JL Massey, Implementation of burst-correcting convolution codes IEEE Trans Inf Theory 11, 416–422 (1965) doi:10.1109/TIT.1965.1053798 21 LHC Lee, Convolutional Coding: Fundamentals and Applications, (Artech House,... Advances in Signal Processing 2011, 2011:90 http://asp.eurasipjournals.com/content/2011/1/90 Page 12 of 12 Abbreviations ABFT: algorithm-based fault tolerance; BP: Berlekamp-Preparata; LDPC: low density parity check; MSE: mean square error; RS: Reed-Solomon Acknowledgements The authors are grateful to the comments from Mrs Mahbobeh Meshkinfam that significantly improved the quality of this article Competing... article Competing interests The authors declare that they have no competing interests Received: 9 May 2011 Accepted: 22 October 2011 Published: 22 October 2011 References 1 KH Huang, JA Abraham, Algorithm-based fault tolerance for matrix operations IEEE Trans Comput C-33, 518–528 (1984) 2 JY Jou, JA Abraham, Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures... Abraham, Fault-tolerant FFT networks IEEE Trans Comput 37, 548–561 (1988) doi:10.1109/12.4606 4 P Banerjee, JT Rahmeh, CB Stunkel, VSS Nair, K Roy, JA Abraham, Algorithmbased fault tolerance on a hypercube multiprocessor IEEE Trans Comput 39, 1132–1145 (1990) doi:10.1109/12.57055 5 J Rexford, NK Jha, Algorithm-based fault tolerance for floating-point operations in massively parallel systems Proc Int Symp... codes for fault-tolerant matrix operations on processor arrays IEEE Trans Comput 39, 426–435 (1990) doi:10.1109/12.54836 7 G Bosilca, R Delmas, J Dongarra, J Langou, Algorithm-based fault tolerance applied to high performance computing J Parallel Distrib Comput 69(4), 410–416 (2009) doi:10.1016/j.jpdc.2008.12.002 8 T Roche, M Cunche, JL Roch, Algorithm-based fault tolerance applied to P2P computing networks... 2009 First International Conference on Advances in P2P Systems 144–149 (2009) 9 GR Redinbo, Generalized algorithm-based fault tolerance: error correction via Kalman estimation IEEE Trans Comput 47(6), 1864–1876 (1998) 10 GR Redinbo, Failure-detecting arithmetic convolution codes and an iterative correcting strategy IEEE Trans Comput 52(11), 1434–1442 (2003) doi:10.1109/TC.2003.1244941 11 GR Redinbo, Wavelet . error-detecting. Their design properties guarantee for detecting the onset of errors because of failures, regardless of any error-free region following the begin- ning of a burst of errors. Consider designing. (ABFT) methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparis on of parity values computed in two ways. The parallel processing of. 3 of 12 define the parity values. The data processing operations are combined with the parity generating function to provide one set of parity values. The k is the basic block size of the input