Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 60 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
60
Dung lượng
2,34 MB
Nội dung
TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI LUẬN VĂN THẠC SĨ Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu DƯƠNG VĂN HẢI Ngành: Điều khiển – Tự động hóa Giảng viên hướng dẫn: Nguyễn Quốc Cường Viện: Điện HÀ NỘI, 2021 CỘNG HOÀ XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc ———————— BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Dương Văn Hải Đề tài luận văn: Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu Chuyên ngành: Kỹ thuật điều khiển Tự động hóa Mã số học viên: CBC19009 Tác giả, người hướng dẫn khoa học hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày 22 tháng năm 2021 với nội dung sau: • Bổ sung tiêu đề thuật toán, thêm phần danh mục thuật tốn • Chỉnh sửa tiêu đề hình 4.1, thêm hình 5.2 • Chỉnh sửa cơng thức 5.1, bổ sung cơng thức 5.2 5.3 • Chỉnh sửa nội dung phần triển khai FPGA khối GRU • Di chuyển phần lập trình DCN, GRU chương xuống phần Phụ lục Ngày 17 tháng 05 năm 2021 Giáo viên hướng dẫn Tác giả luận văn Chủ tịch hội đồng ĐỀ TÀI LUẬN VĂN Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu Giáo viên hướng dẫn Ký ghi rõ họ tên Lời cảm ơn Luận văn khơng thể hồn thành khơng có cố vấn chun mơn hỗ trợ tinh thần mà nhận từ thầy hướng dẫn Nguyễn Quốc Cường Mặc dù có nhiều sinh viên hướng dẫn mình, thầy dành phần lớn thời gian quý báu để giúp cải thiện việc nghiên cứu Luận văn đúc kết vô số thời gian với người bạn phịng thí nghiệm C1-311, người ln động viên tơi kiên trì đường nghiên cứu đầy chông gai tương lai Cảm ơn gia đình bạn gái tơi tất tình yêu ủng hộ Đặc biệt cảm ơn Thành, người đồng hành suốt thời gian Đại học Bách Khoa Hà Nội Tóm tắt nội dung luận văn Trong năm gần đây, với phát triển nhận dạng tiếng nói, phát từ khóa trở thành cách phổ biến để bắt đầu tương tác giọng nói (ví dụ: “OK Google”, “Alexa” “Hey Siri”) Có vài phương pháp để giải vấn đề Ví dụ phương pháp nhận dạng cho từ khoá cố định, nhiều mẫu từ khố cụ thể thu thập sau mạng nơ-ron huấn luyện để phân loại từ khố hay khơng phải từ khố Bên cạnh đó, phương pháp “truy vấn theo ví dụ” lấy số ví dụ từ khóa làm mẫu so sánh phân đoạn âm với mẫu để đưa định phát có phải từ khố hay khơng Trong luận văn này, phương pháp áp dụng cho từ khóa cố định sử dụng, nữa, mơ hình mạng nơ-ron học sâu đề xuất với độ xác cao kích thước mơ hình mạng nơ-ron học sâu nhỏ Ngồi ra, mơ hình học sâu ln tốn nhiều lượng tài nguyên Để nâng cao hiệu mô hình mạng nơ-ron, kiến trúc mạng nơ-ron hiệu quả, phương pháp nén mơ hình lượng tử hóa mơ hình đề xuất Mặc dù vậy, tảng phần cứng, chẳng hạn GPU, không đủ linh hoạt để hỗ trợ tất loại tối ưu hóa Từ đó, cần phải có đồng thiết kế phần cứng phần mềm để tăng tốc tính tốn mức tối đa, tính linh hoạt tảng phần cứng trở thành đặc điểm FPGA trở thành lựa chọn thích hợp, mang lại hiệu lượng tốt tính linh hoạt để cấu hình phần cứng Vì vậy, để cải thiện hiệu tốc độ mô hình nhận dạng đề xuất, đề tài đưa kiến trúc thiết kế mạng nơ-ron tảng FPGA với độ xác khơng đổi Học viên Ký ghi rõ họ tên Acknowledgement The thesis would not have been completed without the academic advisement and emotional support that I received from my advisor Professor Nguyen Quoc Cuong Though having many students under his supervision, he still spends a huge chunk of his precious time helping be better-off doing research He also creates a wonderful environment for students to compete and improve their research prospects I would like to express my gratitude to Professor Cuong and wish his success in research and leadership as well as happiness in life The thesis is a summary of countless hours with people at C1 311 lab, who always encourage me to be persistent on present and future research pathway full of struggles Thanks a special friend who supported me at the worst time when I feel alienated Thanks other juniors who I spent time working in-depth and enjoyed coffee breaks together Thank you to my family and my girlfriend for all of their love and support Special thanks to Thanh for accompanying me throughout the whole time here at HUST Hanoi, April 15, 2021 Duong Van Hai Duong Van Hai - CBC19009 Abstract In recent years, with the development of voice assistants, keyword spotting has become a common way to begin an interaction by the voice interface (e.g “OK Google”, “Alexa” or “Hey Siri”) Despite all advances in understanding spoken commands, it is difficult for machines to decide whether an utterance is actually intended as a command or if it is just conversational speech Various approaches were developed to solve this problem, and an approach that retains these advantages is to use a pre-defined word or short phrase to wake up the machine and signal that the following speech will be a command, which socalled Keyword Spotting (KWS) There are several existing methods to tackle the KWS problem For example, numerous variations of a specific keyword utterance are collected and then training neural networks for classification of the keyword/non-keyword which have been promising methods in the field Besides, Query-by-Example (QbE) methods usually take several examples of the keywords as templates and compare the test audio segment against the templates to make detection decisions In this thesis, the method applying for a single keyword will be focused on Moreover, a new model architecture for KWS was proposed achieving high accuracy and small footprint Furthermore, the process of developing a new deep learning model is always energyintensive To improve the efficiency of KWS, efficient neural network architectures, model compression, or model quantization methods were proposed Despite the efficient algorithms and methods, the hardware platforms, such as GPUs, might not be flexible enough to support all sorts of optimizations Hence, a hardware-software co-design is required to accelerate the computation at the edge, and the flexibility of hardware platforms becomes a key feature Field Programmable Gate Array (FPGA) thus become a good candidate, providing good energy efficiency and flexibility to configure the hardware Thus, the proposed system was built on FPGA to improve the speed of the model without sacrificing the accuracy The main contribution in this work includes: 1) propose a new model architecture for KWS system and 2) derive a FPGA implementation of the proposed method The thesis is organized as follows: • Chapter introduces the overview of the KWS problem and its challenges • Chapter provides a brief review of some essential concepts of KWS system and FPGA design • Chapter presents most recently methods relating to KWS problem • Chapter proposes the new network architecture for KWS and represents experimental results • Chapter derives the FPGA implementation and provides an analysis of the proposed approach’s performance Duong Van Hai - CBC19009 Glossary DCN Deformable Convolutional Network GRU Gated Recurrent Unit CNN Convolutional Neural Networks RNN Recurrent Neural Networks KWS Keyword Spotting CRNN Convolutional Recurrent Neural Networks QbE Query-by-Example LSTM Long Short-Term Memory FPGA Field Programmable Gate Array DNN Deep Neural Network QoR Quality of Results IoT Internet of Things LVCSR Large-vocabulary continuous speech recognition HMM Hidden Markov Model DTW Dynamic Time Warping MFCC Mel Frequency Cepstral Coefficients RTF Real Time Factor HLS High-Level Synthesis Duong Van Hai - CBC19009 List of Tables 4.1 4.2 4.3 5.1 5.2 5.3 5.4 Dataset statistics 30 Performances of proposed model and CRNN-A model: Number of parameters and FRR (%) on clean and noisy conditions (SNR=5dB) at 0.5 FA per hour 32 Performances of proposed model and WaveNet model: Number of parameters and FRR (%) on clean and noisy conditions (SNR=5dB) at 0.5 FA per hour 33 Detail performance of the Design (100 MHz) Detail performance of the Design (100 MHz) Detail performance of the Design (200 MHz) Processing time for 200 frames 40-dimensional MFCC Duong Van Hai - CBC19009 41 41 41 42 List of Figures 2.1 2.2 KWS approaches 13 FPGA HSL design flow 17 3.1 3.2 3.3 Framework of Deep KWS system [23] 18 Attention-based end-to-end model for KWS [20] 20 WaveNet-based KWS architecture [29] 22 4.1 4.2 4.3 4.4 4.5 4.6 End-to-end proposed KWS system The deformable convolutional network Illustration for bilinear interpolation DET curve for different parameters of the proposed model The concentration of learned offset values of the DCN layer DET curve for the proposed architecture (blue) compared to the CRNN-A (dashed orange) baselines in clean (a) and noisy (b) conditions 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Block diagram of the proposed KWS system Time consuming (%) on CPU and GPU of each components in proposed system Vivado HLS report of a naive implementation Operation of the line buffer and window for 2D filtering Vivado HLS report of a line buffer and window implementation FPGA design performance compares with CPU and GPU A simple block design of system on Vivado Duong Van Hai - CBC19009 25 26 27 31 32 33 35 36 37 37 38 42 44 List of Algorithms 4.1 4.2 5.1 Deformable Convolution at channel c, timestep t, frequency f 28 Summary the operations of the proposed model for each input of T frames 29 Pseudocode of a convolution layer 36 Duong Van Hai - CBC19009 140 120 135 Time (ms) Power (W) 100 95 80 60 40 20 14 12.5 3.6 FPGA (ZCU 104) 2.8 CPU (i9-9900K) GPU (RTX 3090) Figure 5.6: FPGA design performance compares with CPU and GPU Platform FPGA CPU GPU Processing Time 3.6 ms 12.5 ms 2.8 ms Power Consumption 14 W 95 W 135 W Table 5.4: Processing time for 200 frames 40-dimensional MFCC critical path has been split Comparing with the same configurations ran on CPU (Intel Core i9-9900K) and GPU (NVIDIA RTX 3090), it yields the results as shown in Table 5.4 Also, Figure 5.6 visualizes the differences between those mentioned hardware In particular, FPGA processing time is neither fastest nor slowest It ran faster than the CPU approximately 3.5 times On the other hand, it ran marginally slower than GPU but consumes significantly less power The power consumption of the proposed system is at least times smaller than CPU and times than GPU With this power consumption, it is very cost-effective to deploy the system to keyword spotting tasks on the client side To verify the validity of the design, a block design using Zynq UltraScale+ MPSoC Processing System with DMA on Vivado is needed as shown in Figure 5.7 The block design was synthesized, implemented, and then generated a bitstream to write on the FPGA board A simple application on Vivado SDK is programmed to communicate with the proposed IP Core The detail is described in Listing 5.2 int main () { u32 inp [ IN_SIZE ] , out [ OUT_SIZE ]; init_platform () ; // Initialize the DMA engine init_dma (& axiDma ) ; // prepare input data prepareInput ( inp ) ; 10 11 X i l _ D C a c h e F l us h R a n g e (( u32 ) inp , IN_SIZE * sizeof ( u32 ) ) ; X i l _ D C a c h e F l us h R a n g e (( u32 ) out , OUT_SIZE * sizeof ( u32 ) ) ; // transfer data to ip core Duong Van Hai - CBC19009 42 X A x i D m a _ S i m p l e T r a n s f e r (& axiDma , ( u32 ) inp , IN_SIZE * sizeof ( u32 ) , XAXIDMA_DMA_TO_DEVICE ); // receive data from ip core X A x i D m a _ S i m p l e T r a n s f e r (& axiDma , ( u32 ) out , OUT_SIZE * sizeof ( u32 ) , XAXIDMA_DEVICE_TO_DMA ); // wait for ip core processes data while ( XAxiDma_Busy (& axiDma , X A X I D M A _ D E V I C E _ T O _ D M A ) ) ; 12 13 14 15 16 17 // processing the output from ip core processingOutput ( out ) cleanup_platform () ; return 0; 18 19 20 21 22 } Listing 5.2: Application to communicate with IP Core 5.3 Summary In this chapter, the FPGA implementation of each module in the proposed system has been presented The proposed implementation on FPGA runs significantly faster than CPU and consumes much less power than both CPU and GPU Duong Van Hai - CBC19009 43 Duong Van Hai - CBC19009 44 Figure 5.7: A simple block design of system on Vivado Conclusion In this thesis, the new deep neural network model has been proposed to improve the accuracy of the KWS system, while the computational complexity remains unchanged In particular, it is better than the recently proposed models in terms of FAR/FRR at least 95% on clean data and 66% in noisy environments Besides, the implementation of KWS system on FPGA has been presented and analyzed From the performed experiments, the FPGA can achieve the same latency with the GPU implementation In general, the GPU provides better execution time at cost of additional power consumption The power consumption for the computed operations on FPGA is times better than CPU and times than GPU The FPGA solutions are flexible and can be relatively small (hardware), and thus could be implemented as an end device Some improvements have been done to get better results and design flow compared to the GPU solution Future work on this thesis includes the improvement of the deep neural network module implementations and investigation of RTL scripts that could decrease much more processing time on FPGA Duong Van Hai - CBC19009 45 Publication [A1] Duong Van Hai, Nguyen Thi Diem Cai, Tran Nguyen Duc Tho, Pham Hoang, Nguyen Huu Binh, Tran Thi Anh Xuan, and Nguyen Quoc Cuong “Hệ thống nhận dạng từ khố tiếng nói tảng hệ thống nhúng” In: Hội nghị Khoa học Kỹ thuật Đo lường Toàn quốc lần thứ VII 2020 [S1] Huu Binh Nguyen, Van Hai Duong, Anh Xuan Tran Thi, and Quoc Cuong Nguyen “Efficient Keyword Spotting System using Deformable Convolutional Network” In: IETE Journal of Research (2020) (submitted on December 9, 2020) (ISI/Scopus) [S2] Binh Nguyen Huu, Hai Duong Van, Thanh Le Tien, Xuan Tran Thi Anh, and Cuong Nguyen Quoc “EQ-MVDR: Scalable multi-channel speech enhancement using QRNN and ensemble learning” In: Proc Interspeech 2021 2021 (submitted on March 27, 2021) Duong Van Hai - CBC19009 46 Bibliography [1] G Chen, C Parada, and G Heigold “Small-footprint keyword spotting using deep neural networks” In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014, pp 4087–4091 [2] Carolina Parada, Abhinav Sethy, and Bhuvana Ramabhadran “Query-by-example Spoken Term Detection For OOV terms” In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2009, Merano/Meran, Italy, December 13-17, 2009 IEEE, 2009, pp 404–409 [3] Guoguo Chen, Carolina Parada, and Tara N Sainath “Query-by-example keyword spotting using long short-term memory networks” In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 IEEE, 2015, pp 5236–5240 [4] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, et al “Deformable Convolutional Networks” In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Oct 2017 [5] Emma Strubell, Ananya Ganesh, and Andrew McCallum “Energy and Policy Considerations for Deep Learning in NLP” In: CoRR abs/1906.02243 (2019) arXiv: 1906.02243 [6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, et al “Attention Is All You Need” In: CoRR abs/1706.03762 (2017) arXiv: 1706.03762 [7] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, Song Han, William J Dally, and Kurt Keutzer “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and & in , DTYPE w_offset [ N_FILTERS ][ KERNEL_0 ][ KERNEL_1 ] , DTYPE b_offset [ N_FILTERS ] , hls :: stream < DTYPE > & offsets ) { # pragma HLS ARRAY_PARTITION variable = w_offset cyclic factor =8 dim =1 # pragma HLS ARRAY_PARTITION variable = b_offset cyclic factor =8 dim =1 DTYPE val1 , val2 ; DTYPE sum , placeholder ; Buffer < BUFFER_SIZE > buffer ; # pragma HLS ARRAY_PARTITION variable = buffer complete dim =0 11 12 13 14 15 16 17 18 // Initialize the buffer init_buffer : for ( int i = 0; i < BUFFER_SIZE ; i ++) { if ( in empty () == 0) { placeholder = in read () ; buffer insert_back ( placeholder ) ; } } 19 20 21 22 23 24 25 26 27 28 // Loop over output size loop_out0 : for ( int i = 0; i < OUT_0 ; i += STRIDE_0 ) { loop_out1 : for ( int j = 0; j < OUT_1 ; j += STRIDE_1 ) { # pragma HLS PIPELINE II =32 // Loop over number of filters loop_f : for ( int filter = 0; filter < N_FILTERS ; filter ++) { # pragma HLS UNROLL sum = 0; // Loop over each filter size Duong Van Hai - CBC19009 51 29 30 31 32 33 34 35 36 37 38 39 40 loop_k0 : for ( int row = 0; row < KERNEL_0 ; row ++) { # pragma HLS UNROLL loop_k1 : for ( int col = 0; col < KERNEL_1 ; col ++) { # pragma HLS UNROLL val1 = buffer getval ( row * IN_1 + col ) ; val2 = w_offset [ filter ][ row ][ col ]; sum += val1 * val2 ; } } // Complete one window and output offsets = OUT_1 ) ) { fill_buffer : for ( int p = ; p < KERNEL_1 ; p ++) { if ( in empty () == 0) { placeholder = in read () ; buffer insert_back ( placeholder ) ; } } } 42 43 44 45 46 47 48 49 50 51 52 53 54 55 } 56 } 57 58 } Listing A.1: C++/HLS implemention of Convolution using Line Buffer A.2 Bilinear Interpolation The detailed implemention of interpolation module can be seen in Listing A.2 This module computes each time output of the preceding convolution is written to stream (from Line to Line 3) The deformed feature is computed and save to the “deformed” variable (from Line 11 to Line 14) The deformed feature is also the output of the whole DCN block First, the program checks if the offsets are out of the boundary or not If it is not in the boundary, the deformed feature is set to zero Otherwise, if the offsets are valid, the program computes the deformed feature normally To compute the deformed feature, the program first gets the indices of the surrounding pixels (from Line 11 to Line 14) Then, it gets the weight of those pixels (from Line 17 to Line 20) Finally, the weights of the surrounding are multiplied with input in an element-wise manner to get the deformed values (from Line 23 to Line 37) // output of previous convolution module offsets >> offset_y ; offsets >> offset_x ; if (( offset_y =0) { p3 = dy_h * dx_l * in [ y_h * } if ( y_h < IN_0 && x_h < IN_1 ) { p4 = dy_h * dx_h * in [ y_h * } // deformed output deformed = p1 + p2 + p3 + p4 ; 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 IN_1 + x_l ]; IN_1 + x_h ]; IN_1 + x_l ]; IN_1 + x_h ]; } Listing A.2: C++/HLS implemention of Bilinear Interpolation Duong Van Hai - CBC19009 53 Appendix B FPGA Implementation of GRU B.1 Approach In the first approach, the detailed implementation of GRU using HLS (called “Design 1”) is shown in Listing B.1 The weight matrices are already concatenated when exported from the Python step Matrix multiplications then is computed with xt (from Line 12 to Line 17) and h(t−1) (from Line 20 to Line 27) After those operations are done, each part of equations in Equation 5.1 are extracted (from Line 29 to Line 37) At last, the remained operations in Equation 5.1 are executed (from Line 38 to Line 45) void gru ( hls :: stream < DTYPE > & in , DTYPE w_ih [ GRU_W_IH_SIZE ] , DTYPE b_ih [ GRU_B_IH_SIZE ] , DTYPE w_hh [ GRU_W_HH_SIZE ] , DTYPE b_hh [ GRU_B_HH_SIZE ] , hls :: stream < DTYPE > & out ) { 10 11 12 13 14 15 16 17 18 loop_seq_len : for ( int t = 0; t < CONV_OUT_0 ; t ++) { loop_in : for ( int j = 0; j < GRU_IN_SIZE ; j ++) { # pragma HLS PIPELINE II =16 // read input in >> data ; // matrix multiplication loop_mul1 : for ( int i = 0; i < GRU_G_SIZE ; i ++) { if ( j == 0) { mul1 [ i ] = 0;} mul1 [ i ] += w_ih [ j * GRU_G_SIZE + i ] * data ; } } 19 20 21 22 23 24 25 26 27 loop_hidden : for ( int j = 0; j < GRU_HIDDEN ; j ++) { # pragma HLS PIPELINE II =16 // matrix multiplication loop_mul2 : for ( int i = 0; i < GRU_G_SIZE ; i ++) { if ( j == 0) { mul2 [ i ] = 0;} mul2 [ i ] += w_hh [ j * GRU_G_SIZE + i ] * hidden [ j ]; } } 28 29 30 31 32 33 34 loop_chunk : for ( int i = 0; i < GRU_HIDDEN ; i ++) { # pragma HLS UNROLL i_r [ i ] = mul1 [ i ] + b_ih [ i ]; h_r [ i ] = mul2 [ i ] + b_hh [ i ]; i_i [ i ] = mul1 [ GRU_HIDDEN + i ] + b_ih [ GRU_HIDDEN + i ]; h_i [ i ] = mul2 [ GRU_HIDDEN + i ] + b_hh [ GRU_HIDDEN + i ]; Duong Van Hai - CBC19009 54 i_n [ i ] = mul1 [2* GRU_HIDDEN + i ] + b_ih [2* GRU_HIDDEN + i ]; h_n [ i ] = mul2 [2* GRU_HIDDEN + i ] + b_hh [2* GRU_HIDDEN + i ]; 35 36 37 38 39 40 41 42 43 44 45 46 47 } loop_gru_layer : for ( int i = 0; i < GRU_HIDDEN ; i ++) { # pragma HLS PIPELINE II =16 resetgate [ i ] = sigmoid ( i_r [ i ] + h_r [ i ]) ; inputgate [ i ] = sigmoid ( i_i [ i ] + h_i [ i ]) ; newgate [ i ] = ( resetgate [ i ] * h_n [ i ] + i_n [ i ]) ; hidden [ i ] = newgate [ i ] + ( hidden [ i ] - newgate [ i ]) * inputgate [ i ]; out > data0 ; in >> data1 ; in >> data2 ; in >> data3 ; 10 11 12 13 14 15 16 17 loop_mul1 : for ( int i = 0; i < GRU_G_SIZE ; i +=2) { if ( j == 0) { mul1 [ i +0] = 0; mul1 [ i +1] = 0; } mul1 [ i +0] += gru_core ( w_ih [( j +0) * GRU_G_SIZE +( i +0) ] , w_ih [( j +1) * GRU_G_SIZE +( i +0) ] , w_ih [( j +2) * GRU_G_SIZE +( i +0) ] , w_ih [( j +3) * GRU_G_SIZE +( i +0) ] , data0 , data1 , data2 , data3 ) ; mul1 [ i +1] += gru_core ( w_ih [( j +0) * GRU_G_SIZE +( i +1) ] , w_ih [( j +1) * GRU_G_SIZE +( i +1) ] , w_ih [( j +2) * GRU_G_SIZE +( i +1) ] , w_ih [( j +3) * GRU_G_SIZE +( i +1) ] , data0 , data1 , data2 , data3 ) ; 18 19 20 21 22 } 23 24 } Listing B.2: Design 2: Implementation of GRU using loop tiling and loop unrolling B.2 Approach In the second approach, instead of reducing the execution cycles of each blocks, the latency can be reduced by increasing the frequency of the system clock In particular, from Line 38 to Line 45 of Listing B.1, each line must excecute together multiple operations It might create a long critical path of the module, which decreases the operating frequency of the whole system To resolve this problem, these operations are split in the smaller parts then Duong Van Hai - CBC19009 55 executed The mentioned operations will be split as shown in Listing B.3 This updated version will be called “Design 3” 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 // resetgate for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS UNROLL tmp_r [ i ] = i_r [ i ] + h_r [ i ]; } for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS PIPELINE resetgate [ i ] = sigmoid_hls ( tmp_r [ i ]) ; } // inputgate for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS UNROLL tmp_i [ i ] = i_i [ i ] + h_i [ i ]; } for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS PIPELINE inputgate [ i ] = sigmoid_hls ( tmp_i [ i ]) ; } // newgate for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS PIPELINE tmp_n [ i ] = resetgate [ i ] * h_n [ i ] + i_n [ i ]; } for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS PIPELINE newgate [ i ] = hls :: ( tmp_n [ i ]) ; } // hidden for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS PIPELINE hidden [ i ] = newgate [ i ] + ( hidden [ i ] - newgate [ i ]) * inputgate [ i ]; } // output for ( int i = 0; i < GRU_HIDDEN_SIZE ; i ++) { # pragma HLS UNROLL out