Khóa luận tốt nghiệp Khoa học máy tính: Hiện thực RISC-V trên FPGA tích hợp thêm khối bảo mật

CISC architecture CISC Complex Instruction Set Computer is a processor architecture designed tosimplify programming and compilation processes.. RISC architecture RISC Reduced Instruction

Visual simulation of write and read channel

After the reset is finished, the Instruction Memory block in the Vivado Block Design does not contain any data To populate it, a request needs to be sent via the AXI4 bus to read the value from the Slave Once the value is retrieved, the CPU

Figure 3.5: Simulate read channels AXI4 of processor.

Figure 3.5 is about read channel simulation, CPU asserts the Arvalid and wait for Arready assert when both Arvalid and Arready is asserted which is call a

“Hand Shake” CPU will send Araddr to Slave and wait for Rvalid from Slave asserts and asserts Rready to execute a Hand Shake to read the data from Slave and

CPU will asserts Rlast and Rresp = 0 to announces that CPU already got the data

Figure 3.6: Simulate write channels AXI4 of processor

Figure 3.6 is about write channels simulation including Write Address, Write Data, and Write Response Channels.

After the RISC-V processor (Master) sends the address to be written and the accompanying data, or reads the address and the included data, the Slave will send a response signal to indicate whether the data write operation or the read operation was successful or not There are four types of responses in AXI4 for the response signal:

* 2'b00: OKAY, indicating a successful write operation.

+ 2b11: DECERR, indicating an unsuccessful write operation.

Implementation of the advanced encryption standard (AES)

An Overview of the Hardware Structure of AES-128

=e Key Expansion round_in | reon_in key startin Coy start_in expander Key_in Pp ey n resetn

Enerypt round_in start_in ciphertext key in laintext =} lain p plaintext ciphertext

Figure 3.7 presents an overview of the AES-128 algorithm's hardware structure It consists of the following components: e Key Expansion: Performs the key expansion process. e Encrypt: Executes the encryption operation. e Decrypt: Used for decrypting the encrypted data.

Input: e en_or_de: Control signal for selecting encryption or decryption. e start_in: Signal to enable or disable the system operation. e key_in: Key data with a size of 128 bits in hexadecimal format. e reset_in: Signal used to reset the system. e plaintext: Data to be encrypted with a size of 128 bits in decimal format.

Output: e ciphertext or plaintext: Signal that outputs the value of the encrypted or

Packaging of AES-128 with AXI4 protocol + <+ 45 3.4 Overview of designing archit€CtUTE - 5-52 5+c++t+cseersrererkerrrre 46 3.4.1 System architecture ccece ese eescseseseseeeescsessesesescseseeeseseessesneseseseeseees 46 3.4.2 Design on ViVAdO - 1 ST E2 HH 1H10 112200001 47 3.5 Ips of Xilinx used in Block design ¿- 5-5 S5 48 3.5.1 Memory Interface Generator IP (MIG 7 Series)

Figure 3.8 depicts AES-128 after it has been combined with the AXI4 protocol, including an AES core used for the ecryption mode or decryption mode, and a buffer used to create a temporary storage area for holding data before sending them into the AES core.

In the AXI4 protocol, data is transferred in 32-bit chunks To optimize data processing, a buffer block can be included to temporarily store the data When the buffer is full, a 128-bit output is generated and fed into the AES core.

Input: e Clk: clock signal e Reset_n: reset signal active low. e Key in: key data with a size of 128 bits in hexadecimal format.

Figure 3.9 is about System Architecture that we already made In the top left side, we have a RISC-V connected with I-cache and D-cache, and the rest of the structure is Xilinx Ips such as DMA, BRAM, MIG, UartLite.

During the thesis project, I used Vivado 2021.2 software along with the Xilinx Virtex 7 VC707 development kit for FPGA implementation The design has the following characteristics: e It includes two types of memory: BRAM (Block RAM) and SDRAM

(Synchronous Dynamic RAM). e The CDMA IP (Direct Memory Access Controller) is used to facilitate data transfer.

The assembly instructions are translated into hex code and stored in either the BRAM or SDRAM After the reset is completed, the processor starts reading from the Instruction Memory If there are no instructions available, the Control block mentioned in section 3.3.1 sends a read request to the AXI4 bus and simultaneously signals the processor to halt its operation The data returned by theAXI4 bus Slave is then handled in the CPU.

Figure 3.10: System design on Vivado.m

Figure 3.10 depicts the configuration of the Block Design using the Vivado software In this setup, the RISC-V processor operates as the Master, while the IPs

47 function as Slaves The processor initiates the reading process from the address 0xC000_ 0000 Moreover, the IP Clocking Wizard acts as the PLL, generating the primary clock signal for the design, and the IP Processor System Reset aids in synchronizing the IPs.

3.5 Ips of Xilinx used in Block design.

3.5.1 Memory Interface Generator IP (MIG 7 Series)

—"[|+ svs_cix ‘ul_clk_sync_rat uLek mmem _locked

Memory Interface Generator (MIG 7 Series)

Figure 3.11: Memory Interface Generator IP.

Figure 3.11 is about Memory Interface Generator IP which is available in Vivado In the context of IP Memory Interface Generator, the Vivado software provides support for IP MIG to initiate the Memory Controller interface, which is responsible for establishing communication with DDR3 SDRAM using the AXI4 protocol After the signal init_calib_complete asserted, we can do handshake with this IP.

=r]+ S AXI_LITE m_ax_aclk s_muó_Mle_acik

MAXI + [Hm M_AXLSG + |ẽ= cdma_Introut

AXI Central Direct Memory Access

Figure 3.12 is about Central Direct Memory Access IP which is available in Vivado.

The Central Direct Memory Access IP (CDMA) is a Xilinx IP that provides mapped to memory and a destination address mapped to memory using the AXI4 protocol.

This IP ensures uninterrupted data reception by the master from the slave, without any interruptions Additionally, the CDMA IP helps avoid bottlenecks in scenarios where the master processor is simultaneously requesting reads from and writes to the slave.

+ Sax seo! eck BRAM_PORTA +||=———————I|

Figure 3.13: Bram Controller and Block Memory Generator IP.

Figure 3.13 is about Bram Controller and Block Memory Generator IP which is available in Vivado.

The IP Block Memory Generator only has simple ports as shown in Figure 5.5 to enable communication on the AXI4 bus, an additional IP called Bram

Figure 3.14 is about Uartlite IP which is available in Vivado.

The IP UARTLite is used for serial communication and is commonly used to interface with external devices such as serial consoles, terminals, or other UART-

49 based peripherals It provides a simple and standardized way to transmit and receive data in a serial format The UARTLite IP core handles the conversion of parallel data to serial data and vice versa, as well as the timing and synchronization aspects of the communication.

3.5.5 System Cache IP system_cache_0

Figure 3.15 is about System Cache IP which is available in Vivado.

When the CPU needs to access data or instructions, if they are already stored in the cache, the CPU can quickly access them from the cache without having to access the main memory This helps reduce access time and improves the system's performance.

Before the System Cache core is available for general data access the entire memory is cleared (all previous content is discarded) The time it takes to clear the cache depends on the configuration; the approximate time is 2*C_CACHESIZE/(4*C_ CACHE_LINE_LENGTH) clock cycles After that approximate time,the Initializing signal becomes low, and the IP is ready to operate.

AXI Bram Controller P - ¿s55 s+crtcverrrerrsrerrerrrrerrerv 49 3.5.4 Uartlite IP

Sub† Sub2 (oop) (Floating point)

Finding greatest common es AD CC OO divisor of 2 numbers Brecision HOỚNG Polat numbers

The testing scenario depicted in Figure 4.1 comprises three primary programs: "Initial," "Main," "Subl," and "Sub2." The "Initial" program initializes values by executing logical and arithmetic computations or by transferring data between the Register File and Data Memory.

Within the thesis, there exist 11 distinct branching and jumping instructions. These instructions are employed to call specific code segments from the main program, after which control flow is redirected back to the main program using the

"jump and link register" instruction (jalr $1, $1, 0).

Following the completion of the "Initial" program, the execution proceeds to the "Main" program This program encompasses two subroutines that are invoked using different branching and jumping instructions.

Subsequently, the "Subl” program is executed to address the problem of

“finding the greatest common divisor of two numbers” Upon completion, the control flow returns to the "Main" program, followed by a jump to the "Sub2"

SIMULATION AND DESIGN REVIEW 51 4.1 Instruction simulation .- ceeeeecseeseseseetseeseseseseeesseseasseeeeeeneaee 51

Sub† Sub2 (oop) (Floating point)

Finding greatest common es AD CC OO divisor of 2 numbers Brecision HOỚNG Polat numbers

The testing scenario depicted in Figure 4.1 comprises three primary programs: "Initial," "Main," "Subl," and "Sub2." The "Initial" program initializes values by executing logical and arithmetic computations or by transferring data between the Register File and Data Memory.

Within the thesis, there exist 11 distinct branching and jumping instructions. These instructions are employed to call specific code segments from the main program, after which control flow is redirected back to the main program using the

"jump and link register" instruction (jalr $1, $1, 0).

Following the completion of the "Initial" program, the execution proceeds to the "Main" program This program encompasses two subroutines that are invoked using different branching and jumping instructions.

Subsequently, the "Subl” program is executed to address the problem of

“finding the greatest common divisor of two numbers” Upon completion, the control flow returns to the "Main" program, followed by a jump to the "Sub2"

5I program to solve the problem of “performing arithmetic operations (+, -, *, /) on two double precision floating-point numbers” Finally, the control flow returns to the "Main" program, where it concludes.

The testing scenario involves a lengthy script and utilizes a "while" loop for iterative purposes.

JAL $1, SUBI ADDI $8, $0, 17 xor $12 $5 $6 FCVT.D.W $1, $5 or $13 $5 $6 FCVT.D.W $2, $6 xori x14 x57 FCVT.D.W $3, $7

SUBI: FSUB.D $8, $5, $6 addi $5, $0, 108 FMUL.D $9, $5, $6 sw $5, 0($3) FDIV.D $10, $5, $6 addi $6, $0, 56 FSD $7, 12($3) sw $6, 4($3) FSD $8, 16($3)

Table 4.1 is about the assembly code that we already tested in this simulation.

The "Initial" program includes a series of instructions that encompass logical and arithmetic computations, as well as instructions for transferring data between the Register File and Data Memory These instructions initialize values necessary

Figure 4.2: Request from CPU through Read Address Channel.

Figure 4.2 is about request from CPU through the Read Address Channel to get the data from C000_0000 address.

Once the rst signal concludes, the axi_arvalid signal becomes active-high,simultaneously transmitting the address to be read over the AXI4 bus During the data read process, the processor suspends its operation.

Figure 4.3: Receive data through Read Channel.

Figure 4.3 is about receiving data through Read Channel that we already sent the address above.

Data is returned via the axi_rdata bus in bursts, consisting of four beats distributed over four cycles Each data within the burst is 32 bits long, representing a single instruction of 32 bits In Figure 6.4, the burst returns four instructions from address C000_ 0000 to C000_ 000C The axi_rlast signal is asserted at a high level, indicating the final data within the burst.

Figure 4.4 is about storing instructions that we already read from Slave to CPU through AXI4 BUS.

Upon receiving the initial burst, the instructions are fed into the CPU TheCPU simultaneously receives and executes instructions Upon receiving the last data, the CPU completes any remaining instructions in its respective stages and actively asserts axi_arvalid to indicate readiness for receiving a new burst of data.

Figure 4.5: Executing process in CPU (1).

Figure 4.5 is about the executing process in CPU of 8 instructions that we already read from Slave.

During the Instruction Fetch (IF) phase, the processor initiates reading from address C000_ 0000, followed by C000_ 0004 in the subsequent cycle, and so forth until it reaches address C000_000C, where it pauses and proceeds to request a read from the AXI4 bus The Instruction Decode (ID) stage follows, wherein the processor decodes instructions, checks for potential conflicts, or resolves preceding data dependencies The calculated results are displayed as signed decimal numbers.

Instruction Instruction Code (Hex) Result (Decimal) ADDI $5, $0, 8 0x00800293 $5=8

Table 4.2 is about the results of 8 instructions that we already read from Slave.

In the above instruction, there are 2 instructions to write values into Data Memory, the processor calculates the write address, the value is written and stored in Data Memory, we have SB (Store Byte), SH (Store Half Word) and SW (StoreWord), they are signed instructions.

Figure 4.6: Write data on AXI4 BUS.

Figure 4.6 Data will be written into SDRAM right after writing into Data Memory in CPU by Handshake at Write Address and Write Data Channels at the same time.

The CPU proceeds with the execution of instructions from C000_ 0020 to C000_0030 This instruction sequence includes three Load instructions and a Jump instruction In the RISC-V architecture, Load instructions are classified into three types, similar to Store instructions, with the additional distinction of signed and unsigned Load instructions As a result, there are a total of five Load instructions in this CPU Considering the inclusion of the LD (Load Double Word) instruction from the double precision floating-point extension, the total number of Load instructions would be six Due to the presence of the Jump instruction, any instructions that follow it will be skipped, and the CPU will directly jump to the address (C000 0050(SUBI Address) to execute the subsequent instruction. Furthermore, the CPU will store the address of the instruction immediately following the Jump instruction in a register.

Instruction Instruction Code (Hex) Result (Decimal)

Table 4.3 is about the results of 8 instructions that we already read from Slave.

Table 4.4: Finding 2 numbers GCD in Assembly.

SUBI: addi $5, $0, 108 sw $5, 0($3) addi $6, $0, 56 sw $6, 4($3)

Iw $18, 0($3) 1w $19, 4($3) mul $20, $18, $19 beq $20, $0, 24 blt $19, $18, 12 mod $19, $19, $18 jal $2, -16 mod $18, $18, $19 jal $2, -24 add $21, $18, $19 sw $21, 8($3) JALR $1, $1, MAIN

Table 4.4 is about coding to find 2 numbers greatest common divisor in Assembly.

We will now determine the greatest common divisor (GCD) of 108 and 56.

To do this, we will utilize a while loop that will only terminate when either 'a' or 'b' equals 0 During each iteration, we will divide the larger number by the other number, obtaining the remainder This process will be repeated until we achieve the desired result In terms of algorithmic logic, both the Assembly code and the C++ code share similar principles.

After jumping to SUB1, we are preparing the data to compute the greatest common divisor (GCD) of 108 and 56.

Instruction Instruction Code (Hex) Result (Decimal) addi $5, $0, 108 00719423 $5 = 108 sw $5, 0($3) 00018583 Mem (0) = 108 addi $6, $0, 56 0001C583 $6 = 56 sw $6, 4($3) 0001D583 Mem (4) = 56 lw $18, 0($3) 0001A903 | $18 = 108

Table 4.5 is about the results of 8 instructions that we already read from

Here, a Branch instruction is employed as ‘a’ is greater than 'b.'

Consequently, the CPU jumps to the address containing the instruction for computing 'a mod b' Once this instruction is executed, the CPU then proceeds to the instruction for calculating 'a * b' Following that, the condition is assessed to determine if it meets the specified criteria If it does not, the loop will persist.

Instruction Instruction Code (Hex) Result (Decimal) beq $20, $0, 24 020A0863 bit $19, $18, 12 0129CC63 mod $18, $18, $19 11390933 $18 R jal $2, -24 FDIFFI6F

Table 4.6 is about the results of 8 instructions that we already read fromSlave. xứ

In this particular iteration, the scenario differs from the previous one as 'a' is now 52 and 'b' is 56, indicating that 'a' is smaller than 'b' Consequently, the CPU does not execute a jump and proceeds directly to the subsequent instruction, which involves 'b mod a’ Subsequently, the CPU jumps back to the instruction for computing 'a * b' and proceeds to evaluate the condition.

Instruction Instruction Code (Hex) Result (Decimal) beq $20, $0, 24 020A0863 bit $19, $18, 12 0129CC63 mod $19, $19, $18 112989B3 $19 =4 jal $2, -16 FEIFFI6F

Table 4.7 is about the results of 8 instructions that we already read from

Figure 4.11: Integer Regfile after finishing the loop in SUB1.

Figure 4.11 is about the result in Integer Register File after completing finding 2 numbers greatest common divisor.

Figure 4.12 is about the result in Data Memory after completing finding 2 numbers greatest common divisor.

Following three iterations, we have successfully obtained the accurate result.The final result has been stored in the Data Memory and has also been propagated through the AXI4 bus.

After completing SUB1, the CPU will return to MAIN and perform some logic and arithmetic instructions to prepare for SUB2 and then jump to SUB2.

Table 4.8: Performing (+-*/) floating point numbers in Assembly.

ADDI $5, $0, 57 ADDI $6, $0, 21 ADDI $7, $0, 75 ADDI $8, $0, 17 FCVT.D.W $1, $5 FCVT.D.W $2, $6 FCVT.D.W $3, $7 FCVT.D.W $4, $8 FDIV.D $5, $1, $2 FDIV.D $6, $3, $4 FADD.D $7, $5, $6 FSUB.D $8, $5, $6 FMUL.D $9, $5, $6 FDIV.D $10, $5, $6 FSD $7, 12($3)

Table 4.8 is about Performing (+-*/) double precision floating point numbers in Assembly with 1% number = 57/21 and 2"d number = 75/17.

Our task involves executing the arithmetic operations of addition, subtraction, multiplication, and division on two double precision floating point numbers: 57/21 (~2.71429) and 75/17 (~6.81818) While the assembly code implementation for these operations may not be overly complex, the challenge lies in constructing a double precision floating point extension for RISC-V32 architecture.

Figure 4.13: Floating point Regfile after finishing SUB2.

Figure 4.13 is about the result in Floating Point Register File after completing performing (+-*/) double precision floating point numbers. oo000000000000€ Gly

Figure 4.14: Data memory after finishing SUB2.

Figure 4.14 is about the result in Data Memory after completing performing (+-*/) double precision floating point numbers.

Table 4.9: Comparing division results between IEEE754 caculator and Risc-V core

Sample TEEE754 Caculator Risc-V core

In Table 4.9 we already compared 10 samples that we tested on our self- designed processor and the IEEE754 caculator developed by Prof Dr Edmund

Weitz based on JavaScript to compare in terms of accuracy, the rounding mode we used is round to nearest, ties to even and all calculations complete after 52 cycles, processor error from 15 to 16 digits including numbers before floating-point.

After receiving data from the CPU through the AXI4 bus, buffer will output a complete 128bit data to the plain_text_in for the AES core to do the encryption mode.

Figure 4.16: The data of AES core

In Figure 4.16 after finishing rounds, the AES core will output the cipher_text and save it in the AES register file.

Tiêu đề	Implement RISC-V and integrate with Security Block on FPGA
Tác giả	Huynh Le Anh Bao, Tran Huu Chau
Người hướng dẫn	Nguyen Minh Son, PhD
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh

Định dạng
Số trang	98
Dung lượng	25,76 MB