A 45nm HighThroughput and Low Latency AES Encryption for RealTime Applications44912

2019 19th International Symposium on Communications and Information Technologies (ISCIT) A 45nm High-Throughput and Low Latency AES Encryption for Real-Time Applications Pham-Khoi Dong, Hung K Nguyen, Xuan-Tu Tran* SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam *) Corresponding author’s email: tutx@vnu.edu.vn Abstract— In this paper, we propose a high-throughput and low-latency AES architecture for wideband and real-time applications such as surveillance cameras, video conference, motion detection, IoT gateways, data store encryption… Our design uses an outer round pipeline technique to achieve high throughput The design has been modelled in RTL VHDL and then synthesized with a 45nm CMOS technology using Synopsys Design Compiler The implementation results show that the proposed architecture achieves a throughput of 111.3Gbps and a latency of 12.6ns at the maximum operating frequency of 870MHz With the same 45nm CMOS technology, our design has area efficiency (2.4 times) and energy efficiency (4.7 times) higher than other related works Keywords— AES, cryptography, high throughput, low latency, real-time applications I INTRODUCTION The Advanced Encryption Standard (AES) was published by the National Institute of Standards and Technology (NIST) in 2001 It is a symmetric block cipher that is intended to replace DES as the approved standard for a wide range of applications [1] In AES, the number of cipher rounds depends on the size of the key It is equal to 10, 12, or 14 for 128-, 192-, or 256-bit keys, respectively AES encryption round employs consecutively four primary operations: SubBytes, ShiftRows, MixColumns, and AddRoundKey AES implementations can be broadly classified into software and hardware implementations Compared to the software implementation, the hardware implementation of AES, by nature, provides more physical security and higher speed In general, the hardware implementation can be performed in either FPGA or ASIC technologies [2] In hardware implementations design, there are two main trends: low power consumption with limited performance and high performance designs of AES cores D.-H Bui et al [3] present hardware optimization strategies for high-speed ultralow-power AES architectures First, the authors used AES 32-datapath to optimize area cost Next, they utilized eight S-Box to improve throughput Finally, they applied a clock gating strategy into data storage registers to reduce power consumption The test-chip was verified on ST FDSOI 28nm technology It achieved a power consumption of less than 20µW for all key configurations with the energy consumption of less than 1pJ/b and the throughput of 28Mb/s at 10MHz operating frequency In [4], the design of ultra-low power AES encryption core by combining optimized architectures, using clock gating technique is presented This AES encryption core has been 978-1-7281-5009-3/19/$31.00 ©2019 IEEE implemented on silicon on thin buried oxide (SOTB) 65nm technology The implementation results show that by using two S-Boxes the AES encryption core requires the smallest number of clock cycles and achieves the lowest power consumption of 0.4µW/ MHz Moreover, the proposed one SBox AES encryption core occupies very low hardware resources of 2.4 kilo gate equivalent (kGEs) Zhao et al [5] present the architectural exploration of lightweight AES accelerators with the goal of minimizing energy consumption The number of cycles per encryption in lightweight AES designs is estimated as a function of the number of available S-Boxes This AES architecture was implemented in a 65nm test-chip and achieved 0.83pJ/bit energy at 0.32 V with a throughput of 376kbps Works [3], [4], [5] propose AES designs with an extremely low area and low power consumption, but due to the use of loop architecture and low frequency, throughput is limited and latency is too long Therefore, these architectures are not suitable for high throughput applications and low latency requirements In the second trend, pipelining and sub-pipelining techniques can be applied to increase the operating frequency and throughput of the AES Hodjat et al [6] propose AES128 core architectures with throughputs of 30 to 70Gbps corresponding to area cost between 180 and 275kGEs, implemented on CMOS 180nm technology With 30 Gbps throughput, the architecture used is outer round pipelining (one stage pipeline per round), takes 11 cycles to encrypt a 128-bit block Therefore, the corresponding latency is 47 ns With a throughput of 70 Gbps, the authors used a 4-stage pipeline architecture in each round, which took 41 cycles for each 128-bit block corresponding to a delay of 74.9 ns The design in [7] proposed a reconfigurable AES-128/192/ 256 encryption engine targeted for the on-die acceleration of real-time encryption/decryption of media content on highperformance microprocessor platforms It was fabricated by CMOS 45nm technology The design achieves a high throughput of 53Gbps and a maximum operating frequency of 2100MHz It spends 55 clock cycles per encryption so its latency is 26.2 ns AES core in [8] running at 1000MHz achieves the highest throughput of 128Gbps This architecture has 20 pipeline stages so need 20 clock cycles to encrypt one block of data Therefore, the latency of this AES is 20ns Despite achieving high throughput, the designs in [8] [9] and [10] have the large latency and they are inefficient in terms of hardware resources and power consumption due to 196 2019 19th International Symposium on Communications and Information Technologies (ISCIT) excessive use of the pipeline In real-time applications, latency is an important factor Delay in the encryption, decryption plus other types of delays can affect the quality of service Latency in AES encryption is defined by the number of cycles that each data sample has to take to go through the encryption data-path before the encrypted output is generated The inner round pipelining of the AES algorithm reduces the area while the same throughput is maintained, but the cost is an increase in latency When there is only outer round pipelining (one stage pipeline per round), the latency is 11 cycles In designs with two pipeline stages per round, the latency is 21 cycles For the fully inner and outer round pipelined designs with three or four pipeline stages per round, the latencies are 31 and 41 cycles, respectively [6] Our target is to design a high-throughput and low latency AES architecture; therefore, we focus on outer round pipelining architectures II HARDWARE ARCHITECTURE PROPOSAL A Proposed AES architecture The proposed architecture for the AES core architecture is presented in Fig In this architecture, Cipher Round through Cipher Round and Final Round are combinational logic blocks in the AES-128 encryption To achieve high throughput, we insert registers between the cipher rounds The pipeline architecture ensures that when data is fully filled in the pipeline states, after clock cycle a block of 128-bit is encrypted In our design, each encryption round is a combination logic To minimize the latency, in the proposed architecture we not use the pipeline for the inner round We apply parallel techniques to reduce the number of clocks per encryption B Proposed CipherRound architecture The micro architecture of each cipher round is presented in Fig There are four main operations in each cipher round: SubMatrix, ShiftMatrix, MixMatrix, and AddRoundKey To provide sub-keys for ten Cipher Round transformations, we design KeyExpansion architecture including ten RoundKey transformations Between RoundKey transformations, we insert registers to store sub-keys for each cipher round The details of these operations are proposed in the following subsections Cipher Round SubMatrix S S 8 S S S S S S S S 8 S S S S S S ShiftMatrix S′ 02 ⎡ 0,c ′ ⎤ ⎢S1,c ⎥ 01 ⎢S′ ⎥ = 01 ⎢ 2,c ⎥ ′ 03 ⎣S3,c ⎦ MixMatrix r0 AddRoundKey 03 02 01 01 r1 01 03 02 01 01 01 03 02 r2 S′ 02 ⎡ 0,c ′ ⎤ ⎢S1,c ⎥ 01 ⎢S ′ ⎥ = 01 ⎢ 2,c ⎥ ′ 03 ⎣S3,c ⎦ S0,c ⎡ ⎤ S ⎢ 1,c ⎥ ⎢S2,c ⎥ ⎣S3,c ⎦ r3 r4 r5 03 02 01 01 01 03 02 01 01 01 03 02 r6 r7 S′ 02 ⎡ 0,c ′ ⎤ ⎢S1,c ⎥ 01 ⎢S ′ ⎥ = 01 ⎢ 2,c ⎥ ′ 03 ⎣S3,c ⎦ S0,c ⎡ ⎤ S ⎢ 1,c ⎥ ⎢S2,c ⎥ ⎣S3,c ⎦ r8 r9 03 02 01 01 01 03 02 01 r10 01 01 03 02 S′ 02 ⎡ 0,c ′ ⎤ ⎢S1,c ⎥ 01 ⎢S′ ⎥ = 01 ⎢ 2,c ⎥ ′ 03 ⎣S3,c ⎦ S0,c ⎡ ⎤ S ⎢ 1,c ⎥ ⎢S2,c ⎥ ⎣S3,c ⎦ r11 r12 8 r13 03 02 01 01 01 03 02 01 r14 01 01 03 02 S0,c ⎡ ⎤ S ⎢ 1,c ⎥ ⎢S2,c ⎥ ⎣S3,c ⎦ r15 8 Fig The micro architecture of CipherRound C Proposed SubMatrix transformation To speed up the encryption process we apply the parallelization technique for each SubMatrix transformation The input per round is 128 bits (16 bytes) assembled into a × byte matrix So that, in the SubMatrix transform we use 16 S-Boxes; each S-Box is a 16 × 16-byte Look Up Table The micro architecture of SubMatrix is shown in Fig Therefore, we use 16 S-Boxes for each transformation round 128 8 8 8 8 8 8 8 8 S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box 8 8 8 8 8 8 8 8 128 Fig The micro architecture of Parallel S-Boxes in SubMatrix transformation D Proposed ShiftMatrix transformation The ShiftMatrix transformation is implemented through simple signal wiring E Proposed MixMatrix transformation The MixColumns transformation multiplies each column of the input matrix by matrix M Multiplication operations in the Galois field GF(28) with irreducible polynomial 𝑚(𝑥) = 𝑥 + 𝑥 + 𝑥 + 𝑥 + 𝑆 ⎡ ⎢𝑆 ⎢𝑆 ⎢ ⎣𝑆 , , , , ⎤ ⎥= ⎥ ⎥ ⎦ 02 01 01 03 03 02 01 01 01 03 02 01 01 01 03 02 𝑆 ⎡ 𝑆 ⎢ 𝑆 ⎢ ⎣𝑆 , , , , ⎤ ⎥ ⎥ ⎦ (1) AESTOP NewCipherKey_SI D Q Clk D Q Clk D KeyExp Round 128 KeyExp Round CipherKey_IN KeyExp Round START Q Clk D KeyExpansion Control Q Clk Clk 128 D Clk Q D Clk 128 RoundKey (1) Q Cipher Round D Clk 128 RoundKey (2) Q Cipher Round RoundKey (10) D Clk Q SubMatrix PlainText_IN RoundKey (0) ShiftMatrix 128 RESET Fig Architecture of outer round pipelining and parallel AES-128 197 128 D Clk Q 128 CipherText_OUT 2019 19th International Symposium on Communications and Information Technologies (ISCIT) Therefore: 𝑆 , = {02} ⋅ 𝑆 𝑆 𝑆 ⊕ {03} ⋅ 𝑆 , , =𝑆 , ⊕ {02} ⋅ 𝑆 , =𝑆 , ⊕𝑆 , , ⊕𝑆 , ⊕ {03} ⋅ 𝑆 ⊕ {02} ⋅ 𝑆 , , , ⊕𝑆 , ⊕𝑆 , ⊕ {03} ⋅ 𝑆 In this figure we used XOR gates with inputs and MUX to get {02} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and {03} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and then calculate Byte_Out using XOR gates with inputs Because of using a parallel decoding architecture, we used four MixColumn blocks together to create the MixMatrix , 𝑆 , = {03} ⋅ 𝑆 , ⊕ 𝑆 , ⊕ 𝑆 , ⊕ {02} ⋅ 𝑆 , Firstly, we calculate: {02} ⋅ 𝑆 , = 𝑥𝑓(𝑥) Because of: 𝑆 , = 𝑓(𝑥) = 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 where 𝑏 are bits (get a value of '0' or '1'), and {02} = 𝑥 So {02} ⋅ 𝑆 , = 𝑥𝑓(𝑥) If 𝑏 = then: 𝑥𝑓(𝑥) = 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 (2) If 𝑏 = then: 𝑥𝑓(𝑥) = (𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) 𝑚𝑜𝑑𝑒 𝑚(𝑥) = [(𝑥 + 𝑥 + 𝑥 + 𝑥 + 1) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 + 1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥) = [𝑚(𝑥) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 + 1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥) = (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 + 1) (3) From (2) and (3), we have: 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000010 = 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑖𝑓 𝑏 = (4) 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 0⨁00011011 𝑖𝑓 𝑏 = So, we can write: {02} ∙ 𝐵𝑦𝑡𝑒 = (b b b b b b b & ′0′) xor "1B") when b = ′1′ else (b b b b b b b & ′0′); Secondly, we calculate: {03} ⋅ 𝑆 , Because of 0𝑥03 = 0𝑥02⨁0𝑥01 so we have: 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000011 = 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × (5) 00000010 ⨁ 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 In other words, {03} × {𝐵𝑦𝑡𝑒} = {02} × {𝐵𝑦𝑡𝑒}⨁{𝐵𝑦𝑡𝑒} So, we proposed the micro architecture of the MixColumn process as in Fig x In_0 In_0(7:0) In_0(6:0)&'0' 0b00011011 x In_0 x In_0 Out_0(7:0) In_0(7) x In_1 In_1(7:0) In_1(6:0)&'0' 0b00011011 In_2(6:0)&'0' 0b00011011 x In_3 In_3(7:0) In_3(6:0)& '0' 0b00011011 x In_1 Out_1(7:0) In_1(7) x In_2 In_2(7:0) x In_1 x In_2 x In_2 Out_2(7:0) In_2(7) x In_3 In_3(7) x In_3 Out_3(7:0) F Proposed AddRoundKey transformation The AddRoundKey transformation performs an operation on the State with one of the sub-keys The operation is a simple XOR function between each byte of the State and each byte of the sub-key G Proposed Key Expansion transformation We proposed a micro architecture to create sub-keys for each AES round as described in Fig In this architecture, the 128-bit key of the previous round is divided into words, each word is 32-bit and brought to the input of the Key Expansion round then performs transforms of RootWord, SubWord and Xored with the RCon register to generate a 128bit key for the next round KeyExp Round W_in0 32 W_out0 W_in1 32 W_out1 W_in2 32 W_out2 128 D_IN 32 RootWord W_in3 W_out3 SubWord 128 D_OUT S S S S RCon Fig The architecture of Key Expansion round III RESULTS AND DISCUSSION The proposed design of a high throughput outer round pipelined AES core has been modelled in RTL VHDL, then has been implemented using Synopsys tools with a CMOS 45nm technology from NANGATE The experimental results are shown in TABLE I From the obtained results, the proposed design achieves an ultra-high throughput of 111.3Gbps at the operating frequency of 870 MHz The power consumption is only 56.3mW The area cost is 0.13mm2 (equivalent to 164.5kGEs) Therefore, the design has a high area efficiency of 856 Gbps/mm2 From the experimental result, our design can run with the maximum operating frequency of 870MHz When the pipeline stages of our design are fully filled with data, each one cycle clock has 128 output bits Thus, the maximum throughput that our design can achieve is: 𝑓 × 128 0.87 × 128 = ≈ 111.3 𝐺𝑏𝑝𝑠 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑐𝑦𝑐𝑙𝑒 𝑐𝑦𝑐𝑙𝑒 In this design, we used 11 pipeline stages, so latency is 11 clock cycles (equivalent to 12.6ns): 11 11 = ≈ 12.6 𝑛𝑠 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 11 × 𝑇 = 𝑓 0.87 Fig The architecture of MixColumn transformation 198 2019 19th International Symposium on Communications and Information Technologies (ISCIT) TABLE I AES ENCRYPTION IMPLEMENTATION Technology (nm) Clk (MHz) Area (mm2) Latency (ns) Area (kGate) Number of clocks per encryption Throughput (Gbps) Power Consumption (mW) Area Efficiency (Mbps/kGate) Area Efficiency (Gbps/mm2) Energy Efficiency (Gbps/W) the other hand, the power consumption is also 2.2x lower In [8], although the throughput was higher, the efficient use of hardware resources is 400 times lower and the power consumption is 109 times higher than our design From the area point of view, our AES architecture is 48 times smaller than the design in [8] Therefore, our proposed design is suitable for real-time applications at a low-cost hardware implementation In Fig energy efficiency and area efficiency in our design are much higher than other related works 45 870 0.13 12.6 164.5 11 111.3 56.3 676.6 856.1 1977 TABLE I summarizes our design implementation results on NANGATE CMOS 45nm technology The comparison of our architecture with the related works is shown in TABLE II Compared to the works in [7] and [8] which were implemented at the same technology node, our design achieves a similar throughput as [8], twice higher throughput than the one in [7] but has a lower latency and occupies less area Compared to the other works, our design has the lowest latency and highest efficiency in using hardware resources (area efficiency in terms of Gbps/mm2) It has 2.4x better area efficiency and 2x lower latency than the one in [7], while on Fig Comparisons with the related works TABLE II COMPARISON OF THE PROPOSED DESIGN AND DIFFERENT AES ARCHITECTURES CLK (MHz) Cycles per encryption Tech (nm) Area (mm2) Area (kGate) Mathew et al [7] 2100 55 45 0.15 - Sayilar and D Chiou [8] 1000 20 45 6.32 Ali et al [9] 1015 21 180 - - Liu et al [11] 255 - 90 0.104 - Design Rahimunnisa et al [12] Power (mW) Throughput (Gbps) Latency (ns) Energy Efficiency (Gbps/W) Area Efficiency (Gbps/mm2) 125 53 26.2 424 353 6179 128 20 20.7 20.3 - 130 20.7 - - 7.1 2.97 - 418 28.6 200 55 130 0.1 40 25.6 275 640 256 2200 44 65 0.75 - 523 275.2 20 526 367 Hodjat et al [6] ver.1 234 11 180 - 180 - 30 47 - - Hodjat et al [6] ver.2 547 41 180 - 275 - 70 74.9 - - Our Work 870 11 45 0.13 164.5 56.3 111.3 12.6 1977 856.1 Erbagci et al [10] V ACKNOWLEDGMENT IV CONCLUSION Broadband communications (5G networks, IoT gateways, optical transmission systems, surveillance cameras, video conferencing) require increasing quality of service and data security Therefore, designing security cores with high throughput and low latency is always a challenge, especially with hardware cost and power consumption constraints We proposed in this paper an AES core architecture for high throughput and real-time applications The outer pipeline and fully parallel architecture allow us to increase the operating frequency and reduce the latency Our design operates at 870MHz on NANGATE CMOS 45nm technology, achieves a high throughput of 111.3Gbps and low latency of 12.6 ns while having high area efficiency (856 Gbps/mm2) and high energy efficiency (1977Gbps/W) This research is partly funded by the Ministry of Science and Technology (MOST) of Vietnam under grant number KC.01.21/16-20 VI REFERENCES [1] [2] [3] 199 “Cryptography and Network Security: Principles and Practice, 7th Edition.” [Online] Available: https://www.pearson.com/us/highereducation/program/Stallings-Cryptography-andNetwork-Security-Principles-and-Practice-7thEdition/PGM334401.html [Accessed: 21-Sep-2018] A Soltani and S Sharifian, “An ultra-high throughput and fully pipelined implementation of AES algorithm on FPGA,” Microprocessor and Microsystems, vol 39, no 7, pp 480–493, Oct 2015 D.-H Bui, D Puschini, S Bacles-Min, E Beigne, and X.-T Tran, “AES Datapath Optimization Strategies for 2019 19th International Symposium on Communications and Information Technologies (ISCIT) Low-Power Low-Energy Multisecurity-Level Internetof-Things Applications,” IEEE Transactions on Very Large Scale Integration (VLSI) systems, vol 25, no 12, pp 3281–3290, Dec 2017 [4] V.- Hoang, V.- Dao, and C.- Pham, “Design of ultralow power AES encryption cores with silicon demonstration in SOTB CMOS process,” Electronics Letters, vol 53, no 23, pp 1512–1514, 2017 [5] W Zhao, Y Ha, and M Alioto, “AES architectures for minimum-energy operation and silicon demonstration in 65nm with lowest energy per encryption,” in 2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015, pp 2349–2352 [6] A Hodjat and I Verbauwhede, “Area-throughput trade-offs for fully pipelined 30 to 70 Gbits/s AES processors,” IEEE Transactions on Computers, vol 55, no 4, pp 366–372, Apr 2006 [7] S K Mathew et al., “53 Gbps Native GF(24)2 Composite-Field AES-Encrypt/Decrypt Accelerator for Content-Protection in 45 nm High-Performance Microprocessors,” IEEE Journal of Solid-State Circuits, vol 46, no 4, pp 767–776, Apr 2011 [8] G Sayilar and D Chiou, “Cryptoraptor: High throughput reconfigurable cryptographic processor,” in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2014, pp 155–161 [9] L Ali, I Aris, F S Hossain, and N Roy, “Design of an ultra high speed AES processor for next generation IT security,” Comput Electr Eng., vol 37, no 6, pp 1160–1170, Nov 2011 [10] B Erbagci, N E C Akkaya, C Teegarden, and K Mai, “A 275 Gbps AES encryption accelerator using ROM-based S-boxes in 65nm,” in 2015 IEEE Custom Integrated Circuits Conference (CICC), 2015, pp 1–4 [11] P Liu, J Hsiao, H Chang, and C Lee, “A 2.97 Gb/s DPA-resistant AES engine with self-generated random sequence,” in 2011 Proceedings of the ESSCIRC (ESSCIRC), 2011, pp 71–74 [12] K Rahimunnisa, P Karthigaikumar, N Christy, S Kumar, and J Jayakumar, “PSP: Parallel sub-pipelined architecture for high throughput AES on FPGA and ASIC,” Open Comput Sci., vol 3, no 4, pp 173–186, 2013 200 ... throughput and low latency is always a challenge, especially with hardware cost and power consumption constraints We proposed in this paper an AES core architecture for high throughput and real-time applications... throughput than the one in [7] but has a lower latency and occupies less area Compared to the other works, our design has the lowest latency and highest efficiency in using hardware resources (area efficiency... types of delays can affect the quality of service Latency in AES encryption is defined by the number of cycles that each data sample has to take to go through the encryption data-path before the

Định dạng
Số trang	5
Dung lượng	1,44 MB