Thesis Outline Electrical and Computer Engineering
The Implementation and Analysis of the ECDSA on the Motorola StarCore SC140 DSP Primarily Targeting Portable Devices by Eric W. Smith A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science In Electrical and Computer Engineering Waterloo, Ontario, Canada, 2002 © Eric W. Smith 2002 I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research. I further authorize the University of Waterloo to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. ii The University of Waterloo requires the signatures of all persons using or photocopying this thesis. Please sign below, and state an address and date. iii Abstract The viability of the Elliptic Curve Digital Signature Algorithm (ECDSA) on portable devices is important due to the growing wireless communications industry, which has inherent insecurities. The StarCore SC140 DSP (SC140) targets portable devices, and therefore is a prime candidate to study the viability of the ECDSA on such devices. The ECDSA was implemented on the SC140 using a Koblitz curve over GF(2 163 ). The τ-adic representation of polynomials involved in the elliptic curve point- multiplication is exploited to achieve superior performance. The ECDSA was implemented and optimized in C and assembly, and verified in hardware. The performance of the C and assembly implementations is analyzed and compared to previously published results. The ability of the compiler to generate efficient cryptographic related code and the SC140 to perform efficient operations is discussed. Numerous compiler optimization improvements that considerably enhance the performance of the generated assembly are suggested. Coding guidelines that state simple measures to improve the performance of the implementation and help to achieve efficient C and assembly are listed. Finally, security issues, with respect to the implementation and focusing on side-channel attacks (SCA) are investigated, including estimated performance penalties due to adding resiliency. Two SCA countermeasures specific to the implementation are also described. In summary, the implemented ECDSA signature generation and verification processes require 4.43 and 8.63 ms when the SC140 operates at 300MHz. Methods of optimizing the implementation to further reduce execution times are also presented. iv Acknowledgements The author would like to thank his supervisor, Professor Catherine Gebotys, for her aid and direction throughout the development of the thesis, as well as the use of computing resources and the StarCore SC140 Software Development Platform (SDP). He would also like to thank friends and family for their support, without which the completion of the thesis would not be possible. The author is extremely grateful for the financial support provided by the National Sciences and Research Council of Canada (NSERC), Motorola, his supervisor and the Department of Electrical and Computer Engineering at the University of Waterloo. Financial support was provided by the listed entities through various scholarships, which allowed the author to focus more thoroughly on his research and studies. v Contents 1 Introduction 1 1.1 DSPs and Embedded Systems Security Requirements 2 1.2 Thesis Objective 3 1.3 Thesis Overview 3 2 Public-Key Cryptosystems and the StarCore SC140 DSP 4 2.1 Public-Key Cryptosystems 4 2.2 ECC Background 5 2.2.1 Comparison to Other Cryptographic Techniques 7 2.3 Digital Signature Schemes 9 2.4 StarCore SC140 DSP Processor Description 10 2.5 Previous Cryptographic and DSP Research 14 3 The ECDSA Algorithm and Implementation Philosophy 15 3.1 The ECDSA 15 3.2 Finite Field and Large Integer Arithmetic 17 3.2.1 Basic Operations 18 3.2.2 Finite Field Multiplication 19 3.2.3 Finite Field Squaring 20 3.2.4 Finite Field Inversion 20 3.2.5 Large Integer Operations 22 3.3 Elliptic Curve Arithmetic 22 3.3.1 Elliptic Curve Point Addition and Subtraction 22 3.3.2 Elliptic Curve Point Representation 25 3.3.3 Elliptic Curve Point-Multiplication 27 3.3.3.1 Non-Adjacent Format 27 3.3.3.2 Reduced TNAF Representation 28 3.3.3.3 TNAF Point-Multiplication 32 3.3.3.4 Width-w TNAF Representation 33 3.3.3.5 TNAFw Point-multiplication 34 3.3.4 Simultaneous Multiple Point-Multiplication 35 3.4 Implementation and Integration Philosophy 36 4 Implementation Analysis and Performance Results 38 4.1 C Data Structures 38 4.2 Finite Field Operations 39 vi 4.2.1 Finite Field Addition (c = a ⊕ b) 40 4.2.2 Finite Field Reduction (c = a mod f) 40 4.2.3 Finite Field Multiplication (c = a ⋅ b) 41 4.2.4 Finite Field Squaring (c = a 2 ) 43 4.2.5 Finite Field Inversion (c = a -1 mod f) 45 4.3 Large Integer Operations 48 4.3.1 Large Integer Addition and Subtraction (c = a + b; c = a - b) 49 4.3.2 Large Integer Multiplication (c = a ⋅ b) 49 4.3.3 Large Integer Division (c = a / b) 50 4.3.4 Large Integer Inversion (c = a -1 mod f) 51 4.4 Elliptic Curve Operations 51 4.4.1 TNAF Conversion (k 2 Æ k TNAF ) 52 4.4.2 Partial Reduction - Partmod δ (k′ = k partmod δ) 52 4.4.3 TNAF Point-Multiplication (Q = k TNAF ⋅ P) 54 4.4.4 TNAFw Conversion (k 2 Æ k TNAFw ) 55 4.4.5 TNAFw Point-Multiplication (Q = k TNAFw ⋅ P) 57 4.4.6 Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q) 59 5 Implementation Comparison and Coding Guidelines 61 5.1 Performance Comparison with Previous Published Results 61 5.1.1 Low-Level Performance Comparison 61 5.1.2 High-Level Performance Comparison 63 5.2 Guidelines for Writing Efficient C Code for Cryptographic Applications 65 5.3 Guidelines for Writing Efficient Assembly Code for Cryptographic Applications 67 5.4 Hand-Written and Compiler-Generated Assembly Comparison 70 5.4.1 Low-Level Performance Comparison 70 5.4.2 High-Level Performance Comparison 76 5.5 Memory Requirements Comparison 78 6 SC140 and Compiler Analysis for Cryptographic Applications 81 6.1 Analysis of the SC140 for Elliptic Curve Cryptographic Applications 81 6.1.1 SC140 Cryptographic Pros 82 6.1.2 SC140 Cryptographic Cons 87 6.2 Compiler Optimization Improvements 89 6.3 Compiler Anomalies 98 6.3.1 Compiler Anomaly A 98 6.3.2 Compiler Anomaly B 100 7 Side-Channel Attack Security Issues 104 7.1 Timing Attacks 105 7.2 Simple Power Attacks 107 7.3 Differential Power Analysis 108 7.4 SCA Countermeasures specific to Koblitz Curves and the SC140 109 7.4.1 Parallel Processing Countermeasure 110 vii 7.4.2 Koblitz Curve Specific Countermeasure 112 8 Discussion and Conclusions 115 8.1 Thesis Summary 115 8.2 Limitations of the Research and Implementation 116 8.3 Conclusions 117 8.4 Future Work 119 Appendix A – Koblitz Curve Parameters 122 Bibliography 123 viii List of Acronyms AAU Address Arithmetic Unit IF Integer Factorization AGU Address Generation Unit IFA IF Always AIA Almost Inverse Algorithm IFF IF False ALU Arithmetic Logic Unit IFT IF True ASL Arithmetic Shift Left (by one bit) JF Jump if True ASLL Arithmetic Shift Left (by multiple bits) JT Jump if False ASM Assembly Language Code LSL Logical Shift Left (by one bit) ASR Arithmetic Shift Right (by one bit) LSLL Logical Shift Left (by multiple bits) ASRR Arithmetic Shift Right (by multiple bits) LSR Logical Shift Right (by one bit) BF Branch if False LSRR Logical Shift Right (by multiple bits) BFU Bit Field Unit LUT Look-Up Table BT Branch if True MAC Multiply and Accumulate CA Certificate Authority MIPS Million Instructions Per Second CGA Compiler-Generated Assembly NAF Non-Adjacent Format CLB Count Leading Bits NB Normal Basis CP Critical Path NIST National Institute of Standards and Technology DALU Data Arithmetic Logic Unit NOP No Operation DL Discrete Logarithm PB Polynomial Basis DLP Discrete Logarithm Problem PDA Personal Digital Assistant DPA Differential Power Analysis RRK Random Rotation of Key DSA Digital Signature Algorithm SCA Side Channel Attack DSP Digital Signal Processor SC140 StarCore SC140 DSP EC Elliptic Curve SPA Simple Power Attacks ECC Elliptic Curve Cryptography SMPM Simultaneous Multiple Point-Multiplication ECDLP Elliptic Curve Discrete Logarithm Problem SRAM Static Random Access Memory ECDSA Elliptic Curve Digital Signature Algorithm TA Timing Attack EEA Extended Euclidean Algorithm TNAF τ-adic NAF FF Finite Field TNAFw Width-w TNAF GUI Graphical User Interface VLES Variable Length Execution Set HWA Hand-Written Assembly VLIW Very Long Instruction Word IDE Integrated Development Environment XXX(A) XXX and XXXA instructions ix List of Algorithms Algorithm 3-1. ECDSA Signature Generation [30] 16 Algorithm 3-2. ECDSA Signature Verification [30] 16 Algorithm 3-3. Finite Field Reduction (c = a mod f) [19] 18 Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39] 19 Algorithm 3-5. Finite Field Squaring (c = a 2 ) [19] 20 Algorithm 3-6. Finite Field Inversion (b = a -1 mod f) [20] 21 Algorithm 3-7. Elliptic Curve Point Addition (P 3 = P 1 + P 2 ) [38] 24 Algorithm 3-8. TNAF Conversion (k TNAF = r 0 + r 1 ⋅τ) [61] 29 Algorithm 3-9. Partmod δ Reduction (r 0 + r 1 ⋅τ := k 2 partmod δ) [61] 31 Algorithm 3-10. TNAF Point-Multiplication (Q = k TNAF ⋅P) [19] 32 Algorithm 3-11. TNAFw Conversion (k TNAFw = r 0 + r 1 ⋅τ) [61] 33 Algorithm 3-12. TNAFw Point-Multiplication (Q = k TNAFw ⋅P) [61] 35 Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19] 35 Algorithm 4-1. Improved Finite Field Squaring (c = a 2 ) 45 Algorithm 4-2. Improved Finite Field Inversion (c = a -1 mod f) 46 Algorithm 4-3. Integer Coefficient to Binary Representation Conversion 56 Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = k TNAF ⋅P) [22] 106 Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication 112 x [...]... target markets include wireless Internet and multimedia, network and data communications, 3rd generation wireless handset systems with wideband data services, wireless and wireline base stations and the corresponding infrastructure [46] The high performance SC140 is designed to have a large data throughput of 4.8GBytes/sec The processor uses a 32-bit unified program and data address space, which is byte... total function and descendant cycle count, average function cycle count, and average function and descendant cycle count to quickly determine the functions that consume the most execution time, and therefore most likely require optimizations The data recorded by the profiler is best viewed and analyzed with the IDE and profiler, but can be exported to other formats including HTML, XML and a tab delimited... CHAPTER 1 INTRODUCTION 1.2 3 Thesis Objective The objective of the thesis is to study the performance of ECC, and more precisely the ECDSA, on a DSP targeting portable devices The ECDSA is implemented on the StarCore SC140 DSP The implementation is examined thoroughly, and optimized to improve its performance with respect to execution time and code size The compiler and associated optimizer are examined... brief description of public-key cryptosystems, focusing on ECC, and the StarCore SC140 DSP is presented The ECDSA and the algorithms utilized to implement the required finite field and elliptic curve operations are outlined in chapter 3, along with the implementation philosophy The implementation and performance analysis of the finite field and elliptic curve operations are presented in chapter 4 Chapter... 2000-bits, and the original size of the encrypted message is 100-bits CHAPTER 2 PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 9 Larger keys, parameters, signatures, and encrypted messages require more memory for storage and more bandwidth to transmit, both of which are scarce resources when dealing with portable devices Moreover, even with non-portable devices, there is no reason to unnecessarily squander... the demand on network security, the wireless communications industry is rapidly expanding The current trend in the communications industry is increasing wireless services as 3rd generation cellular systems become a reality The services that cell phones, personal digital assistants (PDAs) and other portable handheld devices provide are ever increasing The new services require more bandwidth and greater... DALU register file Each ALU contains a Multiply and Accumulate (MAC) and Bit-Field Unit (BFU) The MAC is capable of a multiplication of two 16-bit values and an accumulate every clock cycle The BFU contains a 40-bit bi-directional barrel shift register It is capable of single and multiple, arithmetic and logical shifts, as well as logical, bit-masking and bitextraction operations The Address Generation... high-speed backbones, and will likely be deployed in future handheld devices Handheld devices are often part of extensive wireless networks that are naturally insecure, and are extremely susceptible to security risks such as impersonation attacks Digital signatures allow tasks such as data integrity, data origin authentication and nonrepudiation to be performed The importance of integrity and origin of data... defines the binary finite field used for the scope of the thesis, is presented in the paper and below as Algorithm 3-3 The assumption that 32-bit registers are used to store finite field elements, c and a, is made Therefore, c[i] refers to the ith 32-bit register Furthermore, ⊕ represents the exclusive-or operation, and >> and . support provided by the National Sciences and Research Council of Canada (NSERC), Motorola, his supervisor and the Department of Electrical and Computer Engineering at the University of Waterloo performance. The ECDSA was implemented and optimized in C and assembly, and verified in hardware. The performance of the C and assembly implementations is analyzed and compared to previously published. the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science In Electrical and Computer Engineering Waterloo, Ontario, Canada, 2002