P V Ananda Mohan Residue Number Systems Theory and Applications Residue Number Systems P.V Ananda Mohan Residue Number Systems Theory and Applications P.V Ananda Mohan R&D CDAC Bangalore, Karnataka India ISBN 978-3-319-41383-9 ISBN 978-3-319-41385-3 DOI 10.1007/978-3-319-41385-3 (eBook) Library of Congress Control Number: 2016947081 Mathematics Subject Classification (2010): 68U99, 68W35 © Springer International Publishing Switzerland 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper This book is published under the trade name Birkhaăuser The registered company is Springer International Publishing AG Switzerland (www.birkhauser-science.com) To The Goddess of learning Saraswati and Shri Mahaganapathi Preface The design of algorithms and hardware implementation for signal processing systems has received considerable attention over the last few decades The primary area of application was in digital computation and digital signal processing These systems earlier used microprocessors, and, more recently, field programmable gate arrays (FPGA), graphical processing units (GPU), and application-specific integrated circuits (ASIC) have been used The technology is evolving continuously to meet the demands of low power and/or low area and/or computation time Several number systems have been explored in the past such as the conventional binary number system, logarithmic number system, and residue number system (RNS), and their relative merits have been well appreciated The residue number system was applied for digital computation in the early 1960s, and hardware was built using the technology available at that time During the 1970s, active research in this area commenced with application in digital signal processing The emphasis was on exploiting the power of RNS in applications where several multiplications and additions needed to be carried out efficiently using small word length processors The research carried out was documented in an IEEE press publication in 1975 During the 1980s, there was a resurgence in this area with an emphasis on hardware that did not need ROMs Extensive research has been carried out since 1980s and several techniques for overcoming certain bottlenecks in sign detection, scaling, comparison, and forward and reverse conversion A compilation of the state of the art was attempted in 2002 in a textbook, and this was followed by another book in 2007 Since 2002, several new investigations have been carried out to increase the dynamic range using more moduli, special moduli which are close to powers of two, and designs that use only combinational logic Several new algorithms/theorems for reverse conversion, comparison, scaling, and error correction/detection have also been investigated The number of moduli has been increased, yet the same time focusing on retaining the speed/area advantages It is interesting to note that in addition to application in computer arithmetic, application in digital communication systems has gained a lot of attention Several applications in wireless communication, frequency synthesis, and realization of vii viii Preface transforms such as discrete cosine transform have been explored The most interesting development has been the application of RNS in cryptography Some of the cryptography algorithms used in authentication which need big word lengths ranging from 1024 bits to 4096 bits using RSA (Rivest Shamir Adleman) algorithm and with word lengths ranging from 160 bits to 256 bits used in elliptic curve cryptography have been realized using the residue number systems Several applications have been in the implementation of Montgomery algorithm and implementation of pairing protocols which need thousands of modulo multiplication, addition, and reduction operations Recent research has shown that RNS can be one of the preferred solutions for these applications, and thus it is necessary to include this topic in the study of RNS-based designs This book brings together various topics in the design and implementation of RNS-based systems It should be useful for the cryptographic research community, researchers, and students in the areas of computer arithmetic and digital signal processing It can be used for self-study, and numerical examples have been provided to assist understanding It can also be prescribed for a one-semester course in a graduate program The author wishes to thank Electronics Corporation of India Limited, Bangalore, where a major part of this work was carried out, and the Centre for Development of Advanced Computing, Bangalore, where some part was carried out, for providing an outstanding R&D environment He would like to express his gratitude to Dr Nelaturu Sarat Chandra Babu, Executive Director, CDAC Bangalore, for his encouragement The author also acknowledges Ramakrishna, Shiva Rama Kumar, Sridevi, Srinivas, Mahathi, and his grandchildren Baby Manognyaa and Master Abhinav for the warmth and cheer they have spread The author wishes to thank Danielle Walker, Associate Editor, Birkhaăuser Science for arranging the reviews, her patience in waiting for the final manuscript and assistance for launching the book to production Special thanks are also to Agnes Felema A and the Production and graphics team at SPi-Global for their most efficiently typesetting, editing and readying the book for production Bangalore, India April 2015 P.V Ananda Mohan Contents Introduction References Modulo Addition and Subtraction 2.1 Adders for General Moduli 2.2 Modulo (2nÀ1) Adders 2.3 Modulo (2n + 1) Adders References 9 12 16 24 Binary to Residue Conversion 3.1 Binary to RNS Converters Using ROMs 3.2 Binary to RNS Conversion Using Periodic Property of Residues of Powers of Two 3.3 Forward Conversion Using Modular Exponentiation 3.4 Forward Conversion for Multiple Moduli Using Shared Hardware 3.5 Low and Chang Forward Conversion Technique for Arbitrary Moduli 3.6 Forward Converters for Moduli of the Type (2n Ỉ k) 3.7 Scaled Residue Computation References 27 27 28 30 32 34 35 36 37 Modulo Multiplication and Modulo Squaring 4.1 Modulo Multipliers for General Moduli 4.2 Multipliers mod (2n À 1) 4.3 Multipliers mod (2n + 1) 4.4 Modulo Squarers References 39 39 44 51 69 77 RNS to Binary Conversion 5.1 CRT-Based RNS to Binary Conversion 5.2 Mixed Radix Conversion-Based RNS to Binary Conversion 81 81 90 ix x Contents 5.3 RNS to Binary Conversion Based on New CRT-I, New CRT-II, Mixed-Radix CRT and New CRT-III 5.4 RNS to Binary Converters for Other Three Moduli Sets 5.5 RNS to Binary Converters for Four and More Moduli Sets 5.6 RNS to Binary Conversion Using Core Function 5.7 RNS to Binary Conversion Using Diagonal Function 5.8 Performance of Reverse Converters References 95 97 99 111 114 117 128 Scaling, Base Extension, Sign Detection and Comparison in RNS 6.1 Scaling and Base Extension Techniques in RNS 6.2 Magnitude Comparison 6.3 Sign Detection References 133 133 153 157 160 Error Detection, Correction and Fault Tolerance in RNS-Based Designs 7.1 Error Detection and Correction Using Redundant Moduli 7.2 Fault Tolerance Techniques Using TMR References 163 173 174 Specialized Residue Number Systems 8.1 Quadratic Residue Number Systems 8.2 RNS Using Moduli of the Form rn 8.3 Polynomial Residue Number Systems 8.4 Modulus Replication RNS 8.5 Logarithmic Residue Number Systems References 177 177 179 184 186 189 191 Applications of RNS in Signal Processing 9.1 FIR Filters 9.2 RNS-Based Processors 9.3 RNS Applications in DFT, FFT, DCT, DWT 9.4 RNS Application in Communication Systems References 195 195 220 226 242 256 10 RNS in Cryptography 10.1 Modulo Multiplication Using Barrett’s Technique 10.2 Montgomery Modular Multiplication 10.3 RNS Montgomery Multiplication and Exponentiation 10.4 Montgomery Inverse 10.5 Elliptic Curve Cryptography Using RNS 10.6 Pairing Processors Using RNS References 263 265 267 287 295 298 306 343 163 Index 349 Chapter Introduction Digital computation is carried out using binary number system conventionally Processors with word lengths up to 64 bits have been quite common It is well known that the basic operations such as addition can be carried out using variety of adders such as carry propagate adder, carry look ahead adders and parallel-prefix adders with different addition times and area requirements Several algorithms for high-speed multiplication and division also are available and are being continuously researched with the design objectives of low power/low area/high speed Fixed-point as well as floating-point processors are widely available Interestingly, operations such as sign detection, magnitude comparison, and scaling are quite easy in these systems In applications such as cryptography there is a need for processors with word lengths ranging from 160 bits to 4096 bits In such requirements, a need is felt for reducing the computation time by special techniques Applications in digital signal processing also continuously look for processors for fast execution of multiply and accumulate instruction Several alternative techniques have been investigated for speeding up multiplication and division An example is using logarithmic number systems (LNS) for digital computation However, using LNS, addition and subtraction are difficult In binary and decimal number systems, the position of each digit determines the weight The leftmost digits have higher weights The ratio between adjacent digits can be constant or variable The latter is called Mixed Radix Number System [1] For a given integer X, the MRS digit can be found as 7 6 X 7 iÀ1 xi ẳ 1:1aị 4a 5mod Mi Mj jẳ0 â Springer International Publishing Switzerland 2016 P.V Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_1 336 10 RNS in Cryptography Figure 10.26 Algorithm for Tate pairing for MNT curves (adapted from [69]) latter needs modular multiplications in Fp and overall, the second step of final exponentiation needs 23 multiplications and 11 reductions in Fp The hard part involves one Frobenius (five modular multiplications), one multiplication in F , one exponentiation by 2l For each step of exponentiation, p 12 multiplications and reductions or 18 multiplications and reductions are required depending on whether multiplications are not needed or needed For a 96-bit security, l has bit length of 192 implying that lines 1–6 are done 191 times and lines 7–12 around 96 times Totally, the Miller loop needs 191 Â (10 + + 30) + 96 Â (11 + + 8) ¼ 12,815 multiplications and 191 Â (8 + + 12) + 96 Â (10 + + 6) ¼ 7460 reductions The easy part of the final exponentiation needs inversion, 77 multiplications and 33 reductions in Fp Considering that 2l is 96 bits long, the hard part can perform exponentiation using sliding window of for computing f2l This needs 96 squarings in F , 24 multiplications in F and three p p 10.6 Pairing Processors Using RNS 337 pre-computations Thus, the hard part needs + 18 + 97 Â 12 + 27 Â 18 ¼ 1673 multiplications and + + 97 Â 12 + 27 Â 18 ¼ 755 reductions Thus, full Tate pairing needs 14,565 multiplications and 8248 reductions A radix implementation on the other hand needs 14,565 Â 62 + 8248 Â (62 + 6) ¼ 870,756 word multiplicaÀ Á À tions, whereas RNS needs 1:1 14, 565 ỵ 8248 7582 ỵ 858 ẳ 736, 626 word multiplications indicating a gain of 15.4 % In the case of Ate pairing, lines 1–4 are done in F requiring 10 multiplications p and reductions in F i.e 60 multiplications and 24 reductions in Fp Similarly, p lines 7–10 require 11 multiplications and 10 reductions in F or 66 multiplications p and 30 reductions in Fp If the coordinates of T are (XT, YTβ, ZT), lines and 11 must be replaced by À Á 50 : g ẳ Z 2T Z 2T yP ỵ A XT À Z 2T xP ÀÀ 2νY 2T Á 110 : g ẳ ZTỵP yQ ỵ ZTỵP yP F xP À xQ ð10:56Þ À Á where Z2T , Z 2T , A ¼ 3X2T À a4 Z4T , Y 2T , ZTỵP , F ẳ Y T yQ Z 3T are computed in F in p3 the previous steps The first requires 18 multiplications and 12 reductions in Fp, whereas the second requires 15 multiplications and reductions Finally, since t À has bit length 96, and Hamming weight of about 48 bits, the Miller loop requires 95 Â (60 + 18 + 30) + 47 Â (66 + 15 + 18) ¼ 14,913 multiplications and 95 Â (24 + 12 + 12) + 47 Â (30 + + 6) ¼ 6534 reductions The final exponentiation is same as in Tate pairing and thus full Ate Pairing needs 16,663 multiplications and 7322 reductions In radix representation, this means that 907,392 word multiplications are needed whereas in RNS, only 703,204 word multiplications are needed The gain is thus 22.65 % In the case of BN curves, the flow chart is presented in Figure 10.27 Due to the twist of order for BN curves, some improvements can be made The author considers that F 12 is built as a quadratic extension of a cubic extension of F p p which is compatible with the use of twist of order Due to the twist defined by v, À Á the second input of Tate pairing can be written as Q ẳ xQ ỵ yQ γ with xQ, yQ F As seen earlier in the case of MNT curves, Lines 1–4 in the algorithm need p multiplications in Fp and modular reductions, whereas lines and 10 require 11 multiplications in Fp and 10 modular reductions in Fp Since xQ and yQ are in F , line requires modular multiplications in Fp and lazy reduction cannot be p used In line 11, multiplications and only modular reductions are needed since lazy reduction can be used on the constant term Line involves both squaring and multiplication in F This requires 36 multiplications and 12 reductions Furtherp more, multiplication by g needs 39 multiplications and 12 reductions Thus, the total complexity for line is 75 multiplications and 24 modular reductions in Fp The case of line 12 is similar and it needs 39 modular multiplications and 338 10 RNS in Cryptography Figure 10.27 Algorithm for Tate pairing for BN curves (adapted from [69]) f 12 reductions in Fp Line 13 computes p6 where computation of f is free by p6 conjugation Hence, one multiplication and inversion are needed in F 12 This p inversion needs one inversion, 97 multiplications and 35 reductions in Fp The first step of the exponentiation thus requires 151 multiplications, 47 modular reductions and one inversion in Fp Line 14 involves one multiplication in F 12 and one p powering to p2 The Frobenius map and its iterations need 11 modular multiplications in Fp This step thus needs 65 multiplications and 23 reductions in Fp The hard part given in line 15 involves one Frobenius (11 modular multiplications), one multiplication in F 12 (54 multiplications and 12 reductions) and one p exponentiation Since for BN curves, l can be chosen as sparse, a classical square and multiply can be used Since in line 13, f has been raised to the power ( p6 À 1), it f 10.6 Pairing Processors Using RNS 339 is a unit and can be squared with only squarings and reductions in Fp (i.e 24 multiplications and 12 reductions in Fp) Thus, the cost is only 24 multiplications and 12 reductions for most steps For steps corresponding to the non-zero bits of the exponent, 54 additional multiplications and 12 additional reductions are necessary In line 16, four applications of the Frobenius map, multiplications and squarings in F 12 (i.e 674 multiplications and 224 reductions in Fp) are needed It also p needs an exponentiation which is similar to line 15 but two times larger Considering a Hamming weight of l as 11 and ‘ as 90, we observe that steps 1–6 are done 255 times and lines 7–12 are done 89 times for a 128-bit security level Thus, the Miller loop needs 255 Â (7 + + 75) + 89 Â (11 + + 39) ¼ 27,934 multiplications and 255 Â (6 + + 24) + 89 Â (10 + + 12) ¼12,093 reductions The easy part of the final exponentiation requires one inversion, 216 multiplications and 70 reductions in Fp The hard part involves exponentiation by 6l À which has Hamming weight of 11 and 6l2 + which has Hamming weight of 28 The second exponentiation can be split into two parts l and 6l [88] both having Hamming weight of 11 This leads to 21 multiplications Lines 15 and 16 require 11 + 54 + 65 Â 24 + Â 54 + 674 + 127 Â 24 + 21 Â 54 ¼ 6967 multiplications and 11 + 12 + 65 Â 12 + Â 12 + 224 + 127 Â 12 + 21 Â 12 ¼ 2911 reductions Thus, the full Tate pairings needs 35,117 multiplications but only 15,074 reductions For radix implementation using (32 bit) words, we need 35,117 Â82 + 15,074 Â (82 + 8) ¼ 3,332,816 multi word ¼ plications whereas RNS needs 1:1 35, 117 Â Â þ 15, 074 Â þ 5 2, 315, 994 word multiplications This has a gain of 30.5 % In the case of Ate pairing, lines 1–4 are done in F requiring multiplications, p squarings and reductions in F i.e 17 multiplications and 12 reductions in Fp p Similarly, lines 7–10 require multiplications and squarings and 10 reductions in F or 30 multiplications and 20 reductions in Fp If the coordinates of T are (XTγ 2, p YTγ 3, ZT), lines and 11 must be replaced by À Á 50 : g ¼ Z2T Z 2T yP À AZ 2T xP ỵ AXT 2Y 2T 110 : g ẳ ZTỵP yP FxP ỵ FxQ ZTỵP yQ 10:57ị where Z2T , A ẳ 3X2T , Y 2T , Z TỵP , F ẳ Y T yQ Z3T were computed in the previous steps The first requires 15 multiplications and 12 reductions in Fp, whereas the second requires 10 multiplications and reductions Note further that the value g obtained has only terms in γ, γ3 and a constant term so that a multiplication by g requires only 39 multiplications instead of 54 Next, since t À has bit length 128 and Hamming weight 29, the total cost of the Miller loop is 127 Â (17 + 15 + 36 + 39) + 28 Â (30 + 10 + 39) ¼ 15,801 multiplications and 127 Â (12 + 12 + 24) + 28 Â (20 + + 12) ¼ 7160 reductions The final exponentiation is same as in Tate pairing and thus full Ate Pairing needs 22,984 multiplications and 10,241 reductions In radix representation, this means that 340 10 RNS in Cryptography 2,208,328 word multiplications are needed, whereas in RNS, only 1,558,065 word multiplications are needed The gain is thus 29.5 % In the case of R-Ate pairing, while the Miller loop is same, an additional step p is necessary at the end: the computation of f : f gT;Qị Pị gTỵQị, T ị Pị where T ẳ (6l + 2)Q is computed in the Miller loop and π is the Frobenius map on the curve The following operations will be needed in the above computation One step of addition as in the Miller loop (computation of T + Q and gTỵQị Pị) needs 40 multiplications and 26 reductions in Fp As p 1mod for BN curves, one application of Frobenius map is needed which requires multiplications in F by p pre-computed values Next, one non-mixed addition step (computation of gTỵQị, T ị Pị) needs 60 multiplications and 40 reductions in Fp Two multiplications of the results in the two previous steps require 39 multiplications and 12 reductions in Fp Next, a Frobenius needs 11 modular multiplications and finally, one full multiplication in F 12 requires 54 multiplications and reductions in Fp p Thus, totally this step requires 249 multiplications and 117 reductions in Fp Considering that 6l + has 66 bits and Hamming weight of 9, the cost of the Miller loop is 65 Â (17 + 15 + 36 + 39) + Â (30 + 10 + 39) ¼ 7587 multiplications and 65 Â (12 + 12 + 24) + Â (20 + + 12) ¼ 3424 reductions The final exponentiation is same as for Tate pairing Hence, for complete R-Ate pairing, we need 15,019 multiplications and 6405 reductions This means that 1,422,376 word multiplications in radix representation and 985,794 in the case of RNS will be required thus saving 30.7 % Kammler et al [100] have described an ASIP (application specific instruction set processor) for BN curves They consider the NIST recommended prime group order of 256 bits E(Fp) and 3072 bits for the finite field F k ¼ 256 Â 12 ¼ 3072 (since p k ¼ 12) This ASIC is programmable for all pairings They keep the points in Jacobian coordinates throughout the pairing computation and thus field inversion can be avoided almost entirely Inversion is accomplished by exponentiation with ( p À 2) All the values are kept in Montgomery form through out the pairing computation The authors have used the scalable Montgomery modulo multiplier architecture (see Figure 10.28a) due to Nibouche et al [101] which can be segmented and pipelined In this technique, for computing ABRÀ1mod M, the algorithm is split into two multiplication operations that can be performed in parallel It uses carry-save number representation The actual multiplication is carried out in the left half (see Figure 10.28a) and reduction is carried out in the right half simultaneously The left is a conventional multiplier built up of gated full-adders and the right is a multiplier with special cells for the LSBs These LSB cells are built around half-adders Due to the area constraint, subsets of the regular structure of the multiplier have been used and computation is performed in multiple cycles They have used multi-cycle multipliers for W Â H (W is word length and H is number of words) of three different sizes 32 Â 8, 64 Â and 128 Â bits For example for a 256-bit multiplier, 10.6 Pairing Processors Using RNS 341 symbol: 255 256-W 255-W 256-2W 2W-1 W W-1 load from memory W M B “0” CR “0” CM “0” “0” SM load 32 from memory 31 bin a H cin sin SR t' multiplication ain t' W ×H bit out cout sout t'in H-1 cin sin reduction W ×H bit cout sout Figure 10.28 (a) Montgomery multiplier based on Nibouche et al technique and (b) multi-cycle Montgomery Multiplier (MMM) (adapted from [100] â2009) H ẳ and W ¼ 32 can be used Thus, A is taken as bits at a time and B taken as 32 bits at a time thus needing 256 cycles for multiplication and partial reduction and addition (see Figure 10.28a) This approach makes the design adaptable to the desired computation performance and to trade off area versus execution time of the multiplication The structure of the multi-cycle Montgomery multiplier (MMM) is shown in Figure 10.28b The two’s complementer is included in the multiplication unit The result is stored in the registers of temporary carry-save values CM, SM, SR,CR The authors have used a multi-cycle adder unit for modular addition and subtraction In addition, an enhanced memory architecture has been employed-transparent interleaved memory segmentation Basically, the number of ports to the memory system is extended to increase the throughput These memory banks can be accessed in parallel The authors mention that in 130 nm standard cell technology, an optimal 342 10 RNS in Cryptography Table 10.8 Number of operations needed in various pairing computations (adapted from [100] ©iacr2009) Number of Multiplications Additions Inversions Opt Ate 17,913 84,956 Ate 25,870 121,168 η 32,155 142,772 Tate 39,764 174,974 Comp η 75,568 155,234 Comp tate 94,693 193,496 Ate pairing needed 15.8 ms and frequency was 338 MHz The number of operations needed for different pairing applications are presented in Table 10.8 in order to illustrate the complexity of a pairing processor Barenghi et al [102] described an FPGA co-processor for Tate pairing over Fp which used BKLS algorithm [62] followed by Lucas laddering [103] for the final À Á exponentiation pk À =r: m m À p2 f P DQ r ẳ c ỵ id ịp1 ẳ c id ị2 ẳ a ỵ ibịm V m 2aị ỵ ibU 2aị m 2 where m ¼ pÀ1 m r , a ¼ c À d , b ¼ À2cd Note that ða ỵ ibị ẳ where Um and Vm are the mth terms of the Lucas sequence The prime p is a 512-bit number and k ¼ has been used They have designed a block which can be used for modular addition/subtraction using three 512-bit adders The adders compute A + B, A + B À M and A À B + M Modular multiplication was using Montgomery algorithm based on CIOS technique The architecture comprises of a microcontroller, a Program ROM, a Fp multiplier and adder/subtractor, a register file and an input/ output buffer The microcontroller realizes Miller’s loop by calling the corresponding subroutines The ALU could execute multiplication and addition/ subtraction in parallel Virtex-2 8000 (XC2V8000-5FF1152) was used which needed 33,857 slices and frequency of 135 MHz and a time of 1.61 ms References W Stallings, Cryptography and Network Security, Principles and Practices, 6th edn (Pearson, Upper Saddle River, 2013) B Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C (Wiley, New York, 1996) P Barrett, Implementing the Rivest-Shamir-Adleman Public Key algorithm on a standard Digital Signal Processor, in Proceedings of Annual Cryptology Conference on Advances in Cryptology, (CRYPTO‘86), pp 311–323 (1986) A Menezes, P van Oorschot, S Vanstone, Handbook of Applied Cryptography (CRC, Boca Raton, 1996) J.-F Dhem, Modified version of the Barrett Algorithm, Technical report (1994) M Knezevic, F Vercauteren, I Verbauwhede, Faster interleaved modular multiplication based on Barrett and Montgomery reduction methods IEEE Trans Comput 59, 1715–1721 (2010) References 343 J.-J Quisquater, Encoding system according to the so-called RSA method by means of a microcontroller and arrangement implementing the system, US Patent #5,166,978, 24 Nov 1992 C.D Walter, Fast modular multiplication by operand scanning, Advances in Cryptology, LNCS, vol 576 (Springer, 1991), pp 313–323 E.F Brickell, A fast modular multiplication algorithm with application to two key cryptography, Advances in Cryptology Proceedings of Crypto 82 (Plenum Press, New York, 1982), pp 51–60 10 C.K Koc RSA Hardware Implementation TR 801, RSA Laboratories, (April 1996) 11 C.K Koc, T Acar, B.S Kaliski Jr., Analyzing and comparing Montgomery Multiplication Algorithms, in IEEE Micro, pp 26–33 (1996) 12 M McLoone, C McIvor, J.V McCanny, Coarsely integrated Operand Scanning (CIOS) architecture for high-speed Montgomery modular multiplication, in IEEE International Conference on Field Programmable Technology (ICFPT), pp 185–192 (2004) 13 M McLoone, C McIvor, J.V McCanny, Montgomery modular multiplication architecture for public key cryptosystems, in IEEE Workshop on Signal Processing Systems (SIPS), pp 349–354 (2004) 14 C.D Walter, Montgomery exponentiation needs no final subtractions Electron Lett 35, 1831–1832 (1999) 15 H Orup, Simplifying quotient determination in high-radix modular multiplication, in Proceedings of IEEE Symposium on Computer Arithmetic, pp 193–199 (1995) 16 C McIvor, M McLoone, J.V McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, in Proceedings of IEE Computers and Digital Techniques, vol 151, pp 402–408 (2004) 17 N Nedjah, L.M Mourelle, Three hardware architectures for the binary modular exponentiation: sequential, parallel and systolic IEEE Trans Circuits Syst I 53, 627–633 (2006) 18 M.D Shieh, J.H Chen, W.C Lin, H.H Wu, A new algorithm for high-speed modular multiplication design IEEE Trans Circuits Syst I 56, 2009–2019 (2009) 19 C.C Yang, T.S Chang, C.W Jen, A new RSA cryptosystem hardware design based on Montgomery’s algorithm IEEE Trans Circuits Syst II Analog Digit Signal Process 45, 908–913 (1998) 20 A Tenca, C Koc, A scalable architecture for modular multiplication based on Montgomery’s algorithm IEEE Trans Comput 52, 1215–1221 (2003) 21 D Harris, R Krishnamurthy, M Anders, S Mathew, S Hsu, An improved unified scalable radix-2 Montgomery multiplier, in IEEE Symposium on Computer Arithmetic, pp 172–175 (2005) 22 K Kelly, D Harris, Very high radix scalable Montgomery multipliers, in Proceedings of International Workshop on System-on-Chip for Real-Time Applications, pp 400–404 (2005) 23 N Jiang, D Harris, Parallelized Radix-2 scalable Montgomery multiplier, in Proceedings of IFIP International Conference on Very Large-Scale Integration (VLSI-SoC 2007), pp 146–150 (2007) 24 N Pinckney, D Harris, Parallelized radix-4 scalable Montgomery multipliers J Integr Circuits Syst 3, 39–45 (2008) 25 K Kelly, D Harris, Parallelized very high radix scalable Montgomery multipliers, in Proceedings of Asilomar Conference on Signals, Systems and Computers, pp 1196–1200 (2005) 26 M Huang, K Gaj, T El-Ghazawi, New hardware architectures for Montgomery modular multiplication algorithm IEEE Trans Comput 60, 923–936 (2011) 27 M.D Shieh, W.C Lin, Word-based Montgomery modular multiplication algorithm for low-latency scalable architectures IEEE Trans Comput 59, 1145–1151 (2010) 28 A Miyamoto, N Homma, T Aoki, A Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers IEEE Trans VLSI Syst 19, 1136–1146 (2011) 29 K.C Posch, R Posch, Modulo reduction in residue Number Systems IEEE Trans Parallel Distrib Syst 6, 449–454 (1995) 344 10 RNS in Cryptography 30 C Bajard, L.S Didier, P Kornerup, An RNS Montgomery modular multiplication Algorithm IEEE Trans Comput 47, 766–776 (1998) 31 J.C Bajard, L Imbert, A full RNS implementation of RSA IEEE Trans Comput 53, 769–774 (2004) 32 A.P Shenoy, R Kumaresan, Fast base extension using a redundant modulus in RNS IEEE Trans Comput 38, 293–297 (1989) 33 H Nozaki, M Motoyama, A Shimbo, S Kawamura, Implementation of RSA Algorithm Based on RNS Montgomery Multiplication, in Cryptographic Hardware and Embedded Systems—CHES, ed by C Paar (Springer, Berlin, 2001), pp 364–376 34 S Kawamura, M Koike, F Sano, A Shimbo, Cox-Rower architecture for fast parallel Montgomery multiplication, in Proceedings of International Conference on Theory and Application of Cryptographic Techniques: Advances in Cryptology, (EUROCRYPT 2000), pp 523–538 (2000) 35 F Gandino, F Lamberti, G Paravati, J.C Bajard, P Montuschi, An algorithmic and architectural study on Montgomery exponentiation in RNS IEEE Trans Comput 61, 1071–1083 (2012) 36 D Schinianakis, T Stouraitis, A RNS Montgomery multiplication architecture, in Proceedings of ISCAS, pp 1167–1170 (2011) 37 Y.T Jie, D.J Bin, Y.X Hui, Z.Q Jin, An improved RNS Montgomery modular multiplier, in Proceedings of the International Conference on Computer Application and System Modeling (ICCASM 2010), pp V10-144–147 (2010) 38 D Schinianakis, T Stouraitis, Multifunction residue architectures for cryptography IEEE Trans Circuits Syst 61, 1156–1169 (2014) 39 H.M Yassine, W.R Moore, Improved mixed radix conversion for residue number system architectures, in Proceedings of IEE Part G, vol 138, pp 120–124 (1991) 40 M Ciet, M Neve, E Peeters, J.J Quisquater, Parallel FPGA implementation of RSA with residue number systems—can side-channel threats be avoided?, in 46th IEEE International MW Symposium on Circuits and Systems, vol 2, pp 806–810 (2003) 41 J.-J Quisquater, C Couvreur, Fast decipherment algorithm for RSA public key cryptosystem Electron Lett 18, 905–907 (1982) 42 R Szerwinski, T Guneysu, Exploiting the power of GPUs for Asymmetric Cryptography Lect Notes Comput Sci 5154, 79–99 (2008) 43 B.S Kaliski Jr., The Montgomery inverse and its applications IEEE Trans Comput 44, 1064–1065 (1995) 44 E Savas, C.K Koc, The Montgomery modular inverse—revisited IEEE Trans Comput 49, 763–766 (2000) 45 A.A.A Gutub, A.F Tenca, C.K Koc, Scalable VLSI architecture for GF(p) Montgomery modular inverse computation, in IEEE Computer Society Annual Symposium on VLSI, pp 53–58 (2002) 46 E Savas, A carry-free architecture for Montgomery inversion IEEE Trans Comput 54, 1508–1518 (2005) 47 J Bucek, R Lorencz, Comparing subtraction free and traditional AMI, in Proceedings of IEEE Design and Diagnostics of Electronic Circuits and Systems, pp 95–97 (2006) 48 D.M Schinianakis, A.P Kakarountas, T Stouraitis, A new approach to elliptic curve cryptography: an RNS architecture, in IEEE MELECON, Benalma´dena (Ma´laga), Spain, pp 1241–1245, 16–19 May 2006 49 D.M Schinianakis, A.P Fournaris, H.E Michail, A.P Kakarountas, T Stouraitis, An RNS implementation of an Fp elliptic curve point multiplier IEEE Trans Circuits Syst I Reg Pap 56, 1202–1213 (2009) 50 M Esmaeildoust, D Schnianakis, H Javashi, T Stouraitis, K Navi, Efficient RNS implementation of Elliptic curve point multiplication over GF(p) IEEE Trans Very Large Scale Integration (VLSI) Syst 21, 1545–1549 (2013) References 345 51 P.V Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1} IEEE Trans Circuits Syst II 54, 775–779 (2007) 52 M Esmaeildoust, K Navi, M Taheri, A.S Molahosseini, S Khodambashi, Efficient RNS to Binary Converters for the new 4- moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}” IEICE Electron Exp 9(1), 1–7 (2012) 53 J.C Bajard, S Duquesne, M Ercegovac, Combining leak resistant arithmetic for elliptic curves defined over Fp and RNS representation, Cryptology Reprint Archive 311 (2010) 54 M Joye, J.J Quisquater, Hessian elliptic curves and side channel attacks CHES, LNCS 2162, 402–410 (2001) 55 P.Y Liardet, N Smart, Preventing SPA/DPA in ECC systems using Jacobi form CHES, LNCS 2162, 391–401 (2001) 56 E Brier, M Joye, Wierstrass elliptic curves and side channel attacks Public Key Cryptography LNCS 2274, 335–345 (2002) 57 P.L Montgomery, Speeding the Pollard and elliptic curve methods of factorization Math Comput 48, 243–264 (1987) 58 A Joux, A one round protocol for tri-partite Diffie-Hellman, Algorithmic Number Theory, LNCS, pp 385–394 (2000) 59 D Boneh, M.K Franklin, Identity based encryption from the Weil Pairing, in Crypto 2001, LNCS, vol 2139, pp 213–229 (2001) 60 D Boneh, B Lynn, H Shachm, Short signatures for the Weil pairing J Cryptol 17, 297–319 (2004) 61 J Groth, A Sahai, Efficient non-interactive proof systems for bilinear groups, in 27th Annual International Conference on Advances in Cryptology, Eurocrypt 2008, pp 415–432 (2008) 62 V.S Miller, The Weil pairing and its efficient calculation J Cryptol 17, 235–261 (2004) 63 P.S.L.M Barreto, H.Y Kim, B Lynn, M Scott, Efficient algorithms for pairing based cryptosystems, in Crypto 2002, LCNS 2442, pp 354–369 (Springer, Berlin, 2002) 64 F Hess, N.P Smart, F Vercauteren, The eta paring revisited IEEE Trans Inf Theory 52, 4595–4602 (2006) 65 F Lee, H.S Lee, C.M Park, Efficient and generalized pairing computation on abelian varieties, Cryptology ePrint Archive, Report 2008/040 (2008) 66 F Vercauteren, Optimal pairings IEEE Trans Inf Theory 56, 455–461 (2010) 67 S Duquesne, N Guillermin, A FPGA pairing implementation using the residue number System, in Cryptology ePrint Archive, Report 2011/176(2011), http://eprint.iacr.org/ 68 S Duquesne, RNS arithmetic in Fpk and application to fast pairing computation, Cryptology ePrint Archive, Report 2010/55 (2010), http://eprint.iacr.org 69 P Barreto, M Naehrig, Pairing friendly elliptic curves of prime order, SAC, 2005 LNCS 3897, 319–331 (2005) 70 A Miyaji, M Nakabayashi, S Takano, New explicit conditions of elliptic curve traces for FR-reduction IEICE Trans Fundam 84, 1234–1243 (2001) 71 B Lynn, On the implementation of pairing based cryptography, Ph.D Thesis PBC Library, https://crypto.stanford.edu/~blynn/ 72 C Costello, Pairing for Beginners, www.craigcostello.com.au/pairings/PairingsFor Beginners.pdf 73 J.C Bazard, M Kaihara, T Plantard, Selected RNS bases for modular multiplication, in 19th IEEE International Symposium on Computer Arithmetic, pp 25–32 (2009) 74 A Karatsuba, The complexity of computations, in Proceedings of Staklov Institute of Mathematics, vol 211, pp 169–183 (1995) 75 P.L Montgomery, Five-, six- and seven term Karatsuba like formulae IEEE Trans Comput 54, 362–369 (2005) 76 J Fan, F Vercauteren, I Verbauwhede, Efficient hardware implementation of Fp-arithmetic for pairing-friendly curves IEEE Trans Comput 61, 676–685 (2012) 77 J Fan, F Vercauteren, I Verbauwhede, Faster Fp-Arithmetic for cryptographic pairings on Barreto Naehrig curves, in CHES, vol 5747, LNCS, pp 240–253 (2009) 346 10 RNS in Cryptography 78 J Fan, http://www.iacr.org/workshops/ches/ches2009/presentations/08_ Session_5/CHES 2009_fan_1.pdf 79 J Chung, M.A Hasan, Low-weight polynomial form integers for efficient modular multiplication IEEE Trans Comput 56, 44–57 (2007) 80 J Chung, M Hasan, Montgomery reduction algorithm for modular multiplication using low weight polynomial form integers, in IEEE 18th Symposium on Computer Arithmetic, pp 230–239 (2007) 81 C.C Corona, E.F Moreno, F.R Henriquez, Hardware design of a 256-bit prime field multiplier for computing bilinear pairings, in 2011 International Conference on Reconfigurable Computing and FPGAs, pp 229–234 (2011) 82 S Srinath, K Compton, Automatic generation of high-performance multipliers for FPGAs with asymmetric multiplier blocks, in Proceedings of 18th Annual ACM/Sigda International Symposium on Field Programmable Gate Arrays, FPGA ‘10, New York, pp 51–58 (2010) 83 R Brinci, W Khmiri, M Mbarek, A.B Rabaa, A Bouallegue, F Chekir, Efficient multipliers for pairing over Barreto-Naehrig curves on Virtex -6 FPGA, iacr Cryptology Eprint Archive (2013) 84 A.J Devegili, C OhEigertaigh, M Scott, R Dahab, Multiplication and squaring on pairing friendly fields, in Cryptology ePrint Archive, vol 71 (2006) 85 A.L Toom, The complexity of a scheme of functional elements realizing the multiplication of integers Sov Math 4, 714–716 (1963) 86 S.A Cook, On the minimum computation time of functions, Ph.D Thesis, Harvard University, Department of Mathematics, 1966 87 J Chung, M.A Hasan, Asymmetric squaring formulae, Technical Report, CACR 2006-24, University of Waterloo (2006), http://www.cacr.uwaterloo.ca/techreports/2006/cacr2006-24 pdf 88 D Hankerson, A Menezes, M Scott, Software Implementation of Pairings, in Identity Based Cryptography, Chapter 12, ed by M Joye, G Neven (IOS Press, Amsterdam, 2008), pp 188–206 89 G.X Yao, J Fan, R.C.C Cheung, I Verbauwhede, A high speed pairing Co-processor using RNS and lazy reduction, eprint.iacr.org/2011/258.pdf 90 M Scott, Implementing Cryptographic Pairings, ed by T Takagi, T Okamoto, E Okamoto, T Okamoto, Pairing Based Cryptography, Pairing 2007, LNCS, vol 4575, pp 117–196 (2007) 91 J.L Beuchat, J.E Gonzalez-Diaz, S Mitsunari, E Okamoto, F Rodriguez-Henriquez, T Terya, in High Speed Software Implementation of the Optimal Ate Pairing over BarretoNaehrig Curves, ed by M Joye, A Miyaji, A Otsuka, Pairing 2010, LNCS 6487, pp 21–39 (2010) 92 M Scott, N Benger, M Charlemagne, L.J.D Perez, E.J Kachisa, On the final exponentiation for calculating pairings on ordinary elliptic curves, Cryptology ePrint Archive, Report 2008/ 490(2008), http://eprint.iacr.org/2008/490.pdf 93 A.J Devegili, M Scott, R Dahab, Implementing cryptographic pairings over BarretoNaehrig curves, Pairing 2007, vol 4575 LCNS (Springer, Berlin, 2007), pp 197–207 94 J Olivos, On vectorial addition chains J Algorithm 2, 13–21 (1981) 95 G.X Yao, J Fn, R.C.C Cheung, I Verbauwhede, Novel RNS parameter selection for fast modular multiplication IEEE Trans Comput 63, 2099–2105 (2014) 96 C Costello, T Lange, M Naehrig, Faster pairing computations on curves with high degree twists, ed by P Nguyen, D Pointcheval, PKC 2010, LNCS, vol 6056, pp 224–242 (2010) 97 D Aranha, K Karabina, P Longa, C.H Gebotys, J Lopez, Faster explicit formulae for computing pairings over ordinary curves, Cryptology ePrint Archive, Report 2010/311 (2010), http://eprint.iacr.org/ 98 R Granger, M Scott, Faster squaring in the cyclotomic subgroups of sixth degree extensions, PKC-2010, 6056, pp 209–223 (2010) References 347 99 N Guillermin, A high speed coprocessor for elliptic curve scalar multiplications over Fp, CHES, LNCS (2010) 100 D Kammler, D Zhang, P Schwabe, H Scharwaechter, M Langenberg, D Auras, G Ascheid, R Leupers, R Mathar, H Meyr, Designing an ASIP for cryptographic pairings over Barreto-Naehrig curves, in CHES 2009, LCNS 5747 (Springer, Berlin, 2009), pp 254–271 101 D Nibouche, A Bouridane, M Nibouche, Architectures for Montgomery’s multiplication, in Proceedings of IEE Computers and Digital Techniques, vol 150, pp 361–368 (2003) 102 A Barenghi, G Bertoni, L Breveglieri, G Pelosi, A FPGA coprocessor for the cryptographic Tate pairing over Fp, in Proceedings of Fifth International Conference on Information Technology: New Generations, ITNG 2008, pp 112–119 (April 2008) 103 M Scott, P.S.L.M Barreto, Compressed pairings, in CRYPTO, Lecture Notes in Computer Science, vol 3152, pp 140–156 (2004) Further Reading E Savas, M Nasser, A.A.A Gutub, C.K Koc, Efficient unified Montgomery inversion with multibit shifting, in Proceedings of IEE Computers and Digital Techniques, vol 152, pp 489–498 (2005) A.F Tenca, G Todorov, C.K Koc, High radix design of a scalable modular multiplier, in Proceedings of Third International Workshop on Cryptographic Hardware and Embedded Systems, CHES, pp 185–201 (2001) Index A Adaptive filter using RNS, 198 Almost Montgomery Inverse (AMI), 295 Aryabhata Remainder Theorem, Auto-scale multipliers, 39, 216 B Barreto-Naehrig curves, 308, 311 Binary to RNS Conversion, 4, 27–30, 34, 35, 42, 69, 183, 196, 198, 209, 212, 215, 218, 222, 223, 244, 248, 292, 301, 342 C Chinese Remainder Theorem (CRT), 4, 81, 96 Chung-Hasan technique for multiplication, 317 for squaring, 323 Communication receiver, RNS based, 244 Comparison of residue numbers, 133–136, 139–141, 143–160 Conjugate moduli, 32, 34, 105, 154 Core function reverse conversion using, 113 scaling using, 6, 111, 150 sign detection using, 112, 114 Cox-Rower architecture, 290–292, 329 D DCT implementation using RNS, 226–241 DFT implementation using RNS, 6, 226–241 Digital Frequency synthesis using RNS, 245 Diminished-1 representation, 6, 16, 18, 20, 39, 51, 56, 58–60, 64, 67, 76, 160 Distributed arithmetic based RNS FIR filters, 223 Division in RNS, 36, 133, 149, 251, 298 Double-LSB encoding, 64 DPA resistant, 294 DS-CDMA using RNS, 242 E Elliptic curve cryptography bilinear pairing, 306 Miller loop, 306 pairing processor, 306 point doubling, 299 point multiplication, 301 projective coordinates, 299 R-Ate pairing, 306 Tate pairing, 306, 307 using Barreto-Naehrig curves, 308 using MNT curves, 308 using RNS, 264, 298–305 Weil pairing, 306 Error correction using projections, 165 single, 163 using redundant moduli, 163–173 Extension field arithmetic cubic extension, 318 inversion in, 325 Quartic Extension, 318, 321 Sextic Extension, 318 © Springer International Publishing Switzerland 2016 P.V Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3 349 350 F Fault tolerance in RNS, 163–165, 167–173 FIR filters using RNS, 6, 172, 195–220, 223, 226, 228, 235 Five moduli sets, 105, 107, 123, 204 Fixed multifunction architectures (FMA), 69, 70 Floating-point arithmetic, Forward conversion multiple moduli sharing hardware, 32–34 using modular exponentiation, 30–31 Four moduli sets, 35, 50, 82, 93, 99, 101, 104, 105, 107, 117, 171, 206, 217, 223, 240, 301 Frequency synthesis using RNS, 6, 245 Frobenius computation, 325, 326, 335 G GPUs for RNS, 295 H Hard multiple computation, 50, 61, 68 I IEEE 754 standard, IIR filters using RNS inversion in Fpk, 6, 203, 206 K Karatsuba algorithm, 309, 314, 315, 318, 319, 321–324, 332 L Lazy addition Logarithmic Residue Number systems, 6, 189–191 Low–high lemma, 51, 52 M Magnitude comparison using MRC technique, 153 using new CRTs, 154–156 Mixed Radix Conversion, 4, 6, 81, 90–95, 102, 127, 153, 157, 159 Mixed Radix Number system, Moduli of the Form rn, 179–184 Modulo addition mod (2n+1) addition, 17, 20, 21 mod (2n-1) addition, 14 Index Modulo multiplication for IDEA algorithm, 51, 52 using Barretts algorithm, 265–267, 282 using combinational logic, 41, 45, 195 using index calculus, 39, 40, 197, 198, 223 using diminished-1 representation, 39, 51, 56, 58–60, 64, 67, 76 Modulo squaring, 6, 39–44, 46, 48, 50–55, 57, 58, 60–64, 67–70, 72, 75, 76, 264 Modulo subtraction, 4, 43, 90–92, 102, 178, 244, 301 Modulus Replication Residue Number systems (MRRNS), 186–189 Montgomery inverse, 295–297 Montgomery modular multiplication CIHS, 268, 269 CIOS, 268 FIPS, 268, 269 scalable, 277 SOS, 268 using Kawamura et al technique, 292, 295 using RNS Bajard et al technique, 289 word based, 275 Montgomery polynomial, 310 MQRNS system, 179, 240 Multi-modulus squarers, 6, 64, 66, 67 Multiple error correction, 170 Multiplication technique for quartic polynomials, 317 for quintic polynomials, 311 for sextic polynomials, 310 N New Chinese Remainder Theorems New CRT-I, 104 New CRT-II, 81, 95–97 New CRT-III, 95–97 O OFDM system using RNS, 249 One-hot coding, Optimal Ate pairing, 328, 329, 331–333, 341 P Pairing implementation using RNS, 264, 309, 327–342 Pairing processors using RNS, 306–342 Parity detection, 141, 142, 154 Polynomial Residue Number system, 6, 184–186 Powers of two related moduli sets, 6, 28–30, 39, 44, 92, 99, 103, 199 Index Q Quadratic Residue Number systems (QRNS), 6, 69, 177–179, 183, 184 Quasi-chaotic generator using RNS, 253 R Reduced Precision Redundancy (RPR), 217, 219 Redundant moduli, 145, 163–165, 167–171, 173, 244, 247, 249 Reverse Conversion using Core function, 6, 81, 111–114, 151–153 using CRT, 117 using Mixed Radix Conversion, 4, 6, 81, 90–95, 102, 117, 127, 157 using Mixed Radix CRT, 6, 81, 95–97 using New CRT, 6, 81, 95–97, 104, 107, 117, 125, 154 using quotient function, 6, 81, 88, 89 S Scaled residue computation, 36–37 Scaling using core function 351 using Look up tables, 142, 167 using Shenoy and Kumaresan, 142 Sign detection, 1, 4–6, 81, 87, 112, 133–136, 138, 140–142, 144–160, 227 Specialized Residue Number systems, 6, 177–180, 182, 184–191 T Three moduli sets, 82, 97–99, 101, 102, 108, 117, 118, 134, 143, 159, 189, 200, 203, 216, 240, 301 Triple modular redundancy, 163, 173, 186 Twist of Elliptic curves, 308 Two-D DCT, 6, 232 Two-dimensional filtering using RNS, 184, 186, 188 V Variable multifunction architectures (VMA), 69–71 Voltage overscaling, 217 ... consider specialized Residue number systems such as Quadratic Residue Number systems (QRNS) and its variations Polynomial Residue number systems and Logarithmic Residue Number systems are also considered.. .Residue Number Systems P.V Ananda Mohan Residue Number Systems Theory and Applications P.V Ananda Mohan R&D CDAC Bangalore, Karnataka India ISBN... Specialized Residue Number Systems 8.1 Quadratic Residue Number Systems 8.2 RNS Using Moduli of the Form rn 8.3 Polynomial Residue Number Systems