Khóa luận tốt nghiệp Kỹ thuật máy tính: Nghiên cứu thiết kế bộ vi xử lý RISC-V theo vi kiến trúc multithread

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF COMPUTER ENGINEERING LE NGUYEN HOANG THIEN W RESEARCH DESIGN RISC-V WITH MULTITHREADING MICR

Instruction Fetch 00001 ố IỌI

Figure 2.5 describes IF stage with some main block and the connection of them. branch sesule [> = ee | ⁄ stall enable predict branch_enable

| predict branch_ena bramch.tag [>] Branch Target Buffer predic

(BTB) † > pretiet branch tag rediet_bran!

10 > predict branch result tength —>— T T new thead_eo > branch_ address instruction

Figure 2.5: System design of Instruction Fetch stage

This stage's primary purpose is to fetch instructions from I-mem and send them to the next stage By doing this purpose, this stage needs some I/O signals that be described in Table 2.3.

This stage supports dynamic prediction by Branch Target Buffer (BTB) For briefly describing, Dynamic Branch Prediction is a technique that the processor determines the flowing the branch's instructions rely on the preceded execution Besides, JAL instruction is also executed in this stage for increasing the throughput of IF There are some main blocks that connect with each other to make the stage work stably They are:

- Branch Target Buffer (BTB): This hardware support Dynamic Branch Prediction

— Instruction Memory (I-mem): Storing the instruction (more details in 3.2.3).

- IF Controller: Using for control the MUX to determine the following address to store into the PC register.

— JAL Address Calculation: Relying on the PC to calculate the next address of the

- PC Register: This is a 32-bit register.

Table 2.3: The I/O’s description of the Instruction Fetch stage Name Length| MO Description

The signal is from the EX-stage, and the signal describes:

When the signal equals 1-logic, the processor is correct in predicting, and the branch_result 1 Input | processor does not have to recover.

When the signal equals 0-logic, the processor fails to predict, and the processor needs to recover the missed instructions. branch_done Input

The signal is from the EX-stage, and the signal describes:

When the signal equals 1-logic, there is a branch instruction done in the EX-stage.

When the signal equals 0-logic, there is no branch instruction done in the EX-stage. branch_address 32 Input

The signal is from the EX-stage The signal is the flowing address of the branch’s instruction branch_tag Input

The signal is from the EX-stage This is the tag of the branch _ instruction corresponding to the location of this branch instruction in BTB. stall_en Input

The signal is from the ID stage, and the signal describes:

When the signal equals 1-logic, the processor is stalled for some reasons. Moreover, this IF stage will not fetch new instructions and update the PC register.

When the signal equals 0-logic, there is no branch instruction done in the EX-stage. start_addr 32 Input Start address of the thread. length 10 Input Length of thread. new_thread_en Input

OS drive this signal goto 1-logic to load the start location of the next thread into the PC. fetch_address 32 Output

This signal is the address used for fetching the instruction from I-mem. instruction [31:0] 32 Output

Instruction is fetched associated with fetch_address signal. predict_branch_tag n Output

The tag of the branch instruction associated with its location in BTB. predict_branch_result Output

1-logic: The following instruction of the branch is TAKEN value, which means next address in BTB.

0-logic: The next instruction of the branch instruction is NOT TAKEN value, which means the PC+4.

This signal is sent to Execute stage to determining the processor is recovered or not. predict_branch_enable Output This signal describes: l-logic: There is the branch or jump instruction is fetched.

0-logic: There is no branch or jump instruction is fetched.

This signal went to I-logic when the new_thread_ready 1 Output | processor did the current thread and ready to execute the new thread.

To more specific, the first task of this stage is determining the next instruction There are 4 cases the processor needs to handle and store this value into PC register:

- Regular instruction: This case means the IF stage fetches the instruction that does not have permission to update the PC register, for example, R-type or I-type In this case, the /F controller generates the selecting signal equal to 2’d/, meaning the following address equal PC + 32 44.

- Branch Instruction: When the IF stage fetches a branch instruction, the following instruction results from the branch instruction The IF controller takes the predict_sbranch_result from BTB to determine the next instruction: e When the predict_branch_result signal equals 1-logic, this controller generates the selected signal is 2’d3 to choose the next instruction is predict_branch_address. e When the predict_branch_result signal equals 2-logic, this controller generates the selected signal is 2’d1 to choose the next instruction isPC- 32 44.

— jal instruction: When the IF stage fetches a jal instruction, the flowing instruction’s address is calculated by the JAL Address Calculation The IF Controller generates the selected signal equal to 2’d0.

— stall: When the processor is stalled, this means the stall_en signal equal 1-logic.

The IF _controller generates the selected signal equal to 2 ’d2 means do not update the PC register.

The second task of the stage is to handle when a branch instruction is done in the EX stage When a branch instruction is done in the EX stage, the branch_done signal is set into 1-logic In this case, the address is sent into I-mem equals the branch address If there is no branch instruction, the address is sent into -mem equal to the PC register.

Besides, this stage supports the switching mechanisms between threads When a thread is done, this means the start_addr + length > current_address signals, the new_thread_signal go to 1-logic to notice the operating system (OS) that this thread has done, and the processor can execute other thread The OS need to read the data in the thread control block (TCB) and load suitable value to start_addr and length signals.

Instruction Decode (ID) ceeceeesesessseseeseseseseneneseeseseseseseseseseecseseaeseneeeeenseseaeaeee 33 2.2.3 Issue stage 4 ,⁄2⁄222 468E À, IỀP, Hee 40 2.2.4 Execute stage (EX) 2.2.5 Write-back (WB) and Mem-Access (MEM)

In the previous section, we have already discussed the system design of the IF stage. After having the instruction from the previous stage, ID will convert from instruction to operands Besides, it also has to do some job supporting for Out-of-Order execution All of these things are discussed in this section Figure 2.6 shows the ID stage's system design, and each input and output of it is described in branch_done ©_predict_branch_enable i_predict_branch_enable > i_prediet_branch_result o_predict_branch_result_/-—, i_predict_branch_tag o_predict_branch tag -—, in address out address

E> immediate Extended immediate 32-bit -— w_alu_en

E—> opcode, fune3, func7 wos en instruction sl, 192, rd wh_data

(RRF) des_tag read_data_l operand_1 read_tag_l read_taz_2 read_checker_l read_checker_2

Figure 2.6: System design of Instruction Decode stage

In the ID stage, there are necessary blocks They are:

— Extended: Expanding the immediate field of the instruction into the full 32-bit immediate.

- ID Controller: Controlling the write enable signal of the ARF and RRF Besides, this block generates dispatch signals for Reservation Buffer.

-— Architecture Register File (ARF): This is the main Register file of the processor.

This block is modified for renaming (more details in 3.2.4).

— Renaming Register File (RRF): This block also supports rename mechanism

— Forwarding: When the destination register of a writing-back instruction matches the decoded value, this decode value is replaced with the writing-back value.

Table 2.4: The I/O’s description of the Instruction Decode stage Name Length | VO Description

The signal is from the EX stage, and the signal describes:

When the signal equals 1-logic, the processor is correct in predicting, and branch_done 1 Input | the processor does not have to recover.

When the signal equals 0-logic, the processor fails to predict, and the processor needs to recover the missed instructions.

The signal is from the IF stage This signal is used to generate the i_predict_branch_enable 1 Input | o_predict_branch_enable to stall the processor when a branch instruction is decoded.

The signal is from the IF stage The i_predict_branch_result 1 Input | signal goes through the ID stage and sent to the Issue stage for execution. i_predict_branch_tag Input

The signal is from the IF stage The signal goes through the ID stage and write-back to BTB after the branch’s execution. in_address 32 Input

The signal is the address of the current instruction. instruction 32 Input

Current instruction needs to decode in this stage. wb_enable Input

The signal is from the EX stage, and the signal describes:

When the signal equals 1-logic, the processor is correct in predicting, and the processor does not have to recover.

When the signal equals 0-logic, the processor fails to predict, and the processor needs to recover the missed instructions. wb tag Input

The signal is from the EX stage The signal is the tag of the destination register This signal is sent toArchitecture Register File (ARF) andRenaming Register File (RRF) to release and write back. wb_ data 32 Input

The signal is from the EX stage The signal is the value of the destination register associated with the wb_tag in ARF. o_predict_branch_enable Output

The signal used for stalling the processor.

When the signal equals 1-logic, the processor is stalled.

When the signal equals 0-logic, the processor is not stalled. o_predict_branch_result Output

This signal sent to the Issue stage and used for further execution. o_predict_branch_tag Output

The signal is the tag of the branch instruction corresponding with BTB. The signal is sent to Issue for update BTB when the branch instruction is done. o_address 32 Output

The current address of instruction but is sent to Execute (EX) stage. immediate_32-bit 32 Output The extended immediate of instruction. operand_I Output The first operand of the instruction. operand_2 32 Output The second operand of the instruction. tag_1 32 Output

The tag of the first operand of the instruction. tag_2 32 Output

The tag of the second operand of the instruction. checker_1 Output

The checker signal of the first operand of the instruction This signal describes:

When the signal equals 1-logic, this operand is ready for execution.

When the signal equals 0-logic, this operand is not ready for execution. checker_2 Output

The checker signal of the second operand of the instruction This signal describes:

When the signal equals 0-logic, this operand is not ready for execution. des_tag Output

The name of destination register after remaining. opcode Output The opcode field of the instruction. func3 Output The func3 field of the instruction. func7 7 Output | The func7 field of the instruction. w_alu en 1 Output | The write enables the RB for ALU.

The write enables the Load Store w_lIsq_en 1 Output

Queue. w_branch_en 1 Output | The write enables of the RB for Branch.

The stage has four main tasks that run in parallel They are:

Decode: The ARF take rs/, rs2, rd field from the instruction port The rd is used for initial the busy-bit in ARF The read_data_l, read_data_2, read_tag_1, read_tag_2, read_checker_I and read_checker_2 is the data value in ARF associate with the rs/ and rs2 field These value of through the Forwarding block to update the newest value that is write-back from the EX stage.

Renaming: In this stage, the destination register is renaming into a new name for Out-of-Order execution This new name is sent to the Issue stage through the des_tag port.

Releasing and Updating ARF and RRF: When an instruction is done in the EX stage, the result of this is sent to the ID stage through wb_tag, wb_en and wb_ data port. e The ARF update the data register and releasing the busy register associating with the wb_tag. e The RRF release the busy register corresponding with the wb_tag.

Dispatch: Besides controlling the write enable signal of ARF and RRF, the JD controller also generate some dispatch signal They are w_alu_en, w_branch_en and w_lsq_en These signals determine which Reservation Buffer (RB) that the decoded instruction is sent.

Figure 2.7 shows the system design of the Issue stage, and Table 2.5 show the I/O connection of this stage Issue stage used for storing unready instruction means operands of the instruction is not written back Besides, this stage also determines which instruction is ready and send it to the EX stage.

Isq_address w_Isq_en des_tag operand_l operand_2 tag_l tag 2 checker_I Load Store Queue checker_2 opcode, func3, fune7 wb_ data wb_en wb_tag

FS Preiet_branch_tag w_branch_en mm w_alu_en

|——>===_ bạ immediat —— Isq_address_en

Isq_pop_en Isq_address

Controller checker_I Los checker_2 opcode/func7

Figure 2.7: System design of Issue stage branch_operand_ 1

— => branch_operand_2 ơ branch_address renee aed branch_immediate

———. branch_func3 branch_predict_result branch_predict_tag each pele branch_en ED branch_wb_tag branch_full

Pane ful > alu_operand_1 alu_operand_2

———=., alu_func3 etic and — ufuneS > alu_en ED alu_wh_tag_>— alu_full —

Load Store Queue (LSQ): This is a queue that store mem-access instruction They are: e Load instruction: lw, lh, lb. e Store instruction: sw, sh, sb. e Atomic instruction: /r.w, sc.w.

Reservation Buffer for Branch (RBB): This used for storing the branch and jump instruction They are: e Branch instruction: beg, bne, bit, bge, bltu, bgeu. e Jump instruction: jal, jalr.

Reservation Buffer for Arithmetic and Logic (RBAL): This used for storing the arithmetic and logic calculation They are: e Arithmetical instruction: add, addi, sub, sit, slti, sltu, sltiu, sll, slli, srl, sra, srai, srli. ¢ Logical instruction: and, andi, or, ori, xor, xori

Issue Write controller: Because the Arithmetical and Logical instruction belong to 2-types: R-type and I-type, this block determines what fields are stored.

Table 2.5: The I/O’s description of the Issue stage Name Length 1O Description w_Isq_en 1 Input | The write enables the Load Store Queue.

The name of destination register after des_tag 5 Input remaining. operand_I 32 Input The first operand of the instruction. operand_2 32 Input The second operand of the instruction. tag I Input

The tag of the first operand of the instruction. tag_2 Input

The tag of the second operand of the instruction. checker_1 Input

The checker signal of the first operand of the instruction This signal describes:

When the signal equals 0-logic, this operand is not ready for execution. checker_2 Input

The checker signal of the second operand of the instruction This signal describes:

When the signal equals 0-logic, this operand is not ready for execution. opcode Input The opcode field of the instruction. func3 Input The func3 field of the instruction. func7 Input The func7 field of the instruction. immediate 32 Input The extended immediate of instruction. wb_data 32 Input

The signal is from the EX stage; the signal describes:

When the signal equals 1-logic, the processor is correct in predicting, and the processor does not have to recover.

When the signal equals 0-logic, the processor fails to predict, and the processor needs to recover the missed instructions. wb_en Input

The signal is from the EX stage The signal is the tag of the destination register. This signal is sent to Architecture Register File (ARF) and Renaming Register File (RRF) to release and write back. wb_tag Input

The signal is from the EX stage The signal is the value of the destination register associated with the wb_tag in ARF. predict_branch_result Input

This signal sent to the Issue stage and used for further execution. predict_branch_tag Input

The signal is the tag of the branch instruction corresponding with BTB The signal is sent to Issue for update BTB when the branch instruction is done. w_branch_en Input The write enables of the RB for Branch. address 32 Input

The current address of instruction but is sent to Execute (EX) stage. w_alu_en Input The write enables the RB for ALU.

The operand of the mem-access instruction.

The immediate of the mem-access instruction. lsq_address_en Output

When the signal equals 1-logic, ready to calculate the address of mem-access instruction.

When the signal equals 0-logic, there is no instruction ready to calculate the address of mem-access instruction.

The func3 field of the mem-access instruction.

The func7 field of the mem-access lsq_func7 7 Output instruction.

The opcode field of the mem-access Isq_opcode 7 Output instruction.

The type of mem-access instruction This signal describes:

Isq_type 1 Output | I-logic: Store instruction (sb, sh, sw, sc.w).

0-logic: Load instruction (Ib, lh, lw, Ir.w)

Isq_store_data 32 Output | The data will be store in D-mem. lsq_wb_tag 5 Output | The tag used for writing-back.

Isq_address 32 Output | The address of mem-access instruction.

Isq_full 1 Output | I-logic: The LSQ is full.

0-logic: The LSQ is not full.

This signal describes: wi : “lòi : lsq_pop_en 1 Output hen the signal equals 1-logic, there is a mem-access instruction that ready to calculate.

When the signal equals 0-logic, there is no mem-access instruction that ready to calculate.

The first operand of the branch branch_operand_ I 32 Output | . instruction.

The second operand of the branch branch_operand_2 32 Output instruction.

The current address of the branch branch_address 32 Output | : instruction.

The 32-bit immediate of the branch branch_immediate 32 Output instruction. branch_func3 3 Output | The func3 field of the branch instruction.

The taken or not taken value rely on the branch_predict_result 1 Output

BTB and is stored in RB.

The tag of the branch instruction in the branch_predict_tag 5 Output ơ

BTB and is stored in RB for writing-back.

This signal describes: branch_en 1 Output | When the signal equals 1-logic, there is a branch instruction that ready to calculate.

When the signal equals 0-logic, there is no branch instruction that ready to calculate. branch_wb_tag 5 Output | The tag used for writing-back.

This signal describes: branch_full 1 Output | 1-logic: The RB is full.

0-logic: The RB is not full.

The first operand of the arithmetical or alu_operand_1 32 Output — logical instructions.

The second operand of the arithmetical or alu_operand_2 32 Output logical instructions.

The opcode or func7 field of the alu_opcode/func7 7 Output arithmetical or logical instruction.

The func3 field of the arithmetical or alu_func3 3 Output logical instruction.

This signal describes: alu_full 1 Output | 1-logic: The RB is full.

0-logic: The RB is not full. alu_wb_tag 5 Output | The tag used for writing-back. alu_en 1 Outputs | This signal describes:

When the signal equals 1-logic, there is an arithmetical or logical instruction that ready to calculate.

When the signal equals 0-logic, there is no arithmetical or logical instruction that ready to calculate.

Detail Designn - - 5S ST HH1 H0 HH0 gi 61 3.1 SOÍLWAT€ tt ng HH HH 10 TH HH g0 Ho 62

S690

In this section, we will show our verification plan is shown in Figure 4.1 contains:

- Verify with small test: Run pre- simulation of the processor with some small test to some basic features There are:

+ Verifying Out-of-Order Execution.

+ + Verifying lr.w and sc.w instructions.

- Verify a myriad of test cases: Run pre- and post- simulation for many test cases automatically by the software and check the result This can verify the design more deeply There are three sources for verifying:

— Installing into DE2 KIT for testing.

Experiment 2 - 522323 + 222 92921117113 21171111111 xe 03 4.1 Verification Plan 4.1.1 Verify with small tests 4.1.2 Verify a myriad of test CaSCS occ Sàn 25 4.1.3 Giang” RT cc tt tình H111 re 29 4.1.4 Test-case G€TI€TAẨOT 5c tt TT 12 12T 11.1101 xe 31 9ˆ

Standard Synchronization Problem - - + 5+ s++x+x+t+zvEexererrveeeerereree 36

There are two common causes of synchronization problem for testing:

— Two threads increase one memory register.

31 addi x31, x0, 9 addi x30, x0, addi x31, x0, sesessSeeess6S6S6eeesee6ee

Figure 4.37: The assembly code for Figure 4.38: The assembly code for producer — Thread 1 customer — Thread 2

We code two assembly files for each thread to increase and decrease one memory register in this problem Figure 4.39 shows the waveform when the processor run these files.

Figure 4.39: The waveform of Producer and Customer problem.

In Figure 4.39, the “clk” signal is the processor's clock, and the “D_MEM_ 1002” signal is the register's data controlled by two threads The register is increased from zero to five and decrease from five to zero.

4.1.6.2 Increasing scenario padi x1, xÐ, addi x2, x, addi x3, x0, addi x4, xô,

7 8 9 addi x2, x0, addi x3, x0, addi x4, x0, addi x5, x0, addi x6, x0, addi x8, x0, add S1 Be addi x9, x9, addi x8, x0, addi x10, x0, 5 addi x9, x0,

30 addi x30, x9, addi x6, x0, addi x7, x9, sSsceSSS6S6S6S6S ONAUARWNE sSSGSS6SS6S6e6e6e sS®©&®&G&=&&G®S&6&S6€6S66SS6eSS6e6ee sœG®G&G6S6S6e€eeeee6eeeeeeeeeee

Figure 4.41: The assembly code for Fi; 4.40: Th bly codigure € assembly code for increasing — Thread 2 increasing — Thread 1

The Increasing problem is the case that two processor increases one register of the processor Figure 4.42 shows the waveform when two threads run the model code for this problem.

Figure 4.42: The waveform of the Increasing problem.

In Figure 4.42, the “clk” signal is the processor's clock, and the “D_MEM_ 1002” signal is the register's data controlled by two threads The register increases from zero to

232 and goes back to zero.

To testing on FPGA, the design is connected to an UART module that helps to send data of register file or data memory to our computer for futhure observation. evr ee eno) Sp ay sp +

Lxai0arA LENGTH 9) VÀ LỆN nes Iifeger

DOR HUỆP mutfiveadrsppgtlib cada resister le tread TiỂtmng ste ie thread 1 sv Ngự [TADOR FILE imullihvead/suppar tb Ladd tea sna. Ẻ JUSY FILE mulltMead'suppgtlib busy /edert Me tvead 1 bi.

Figure 4.43: The schematic of the design installed into DE2 Board

To runing the design in FPGA board, these steps need to follow:

— Step 1: Pressing KEY[0] to reset the system.

— Step 2: Waiting for seconds (The processor runs).

— Step 3: Switching SW[2] to stop the processor.

- Step 4: Switching SW[0] to start the reading mode.

— Step 5: Switching SW[1] to select ARF or D-MEM (1 => RF Il 0 => D-MEM).

— Step 6: Run a Python code to capture data on the computer which is shown in

— Step 7: Pressing KEY[I] to send data.

— Step 8: The result is show in terminal.

Table 4.5: The Python to capture data

Python Code: The code to capture data from the FPGA board

2 : ser = serial.Serial(port=”COM§”, baudrate00)

5 : print(x, “\t’, int.from_bytes(read_data[0:-3], byteorder=’little’, signed=True))

You can see the demo in this link: Demo - Google Drive

Harward Utilization and Max Frequency Report from Quartus

The design is installed into DE2 Education and Development board for testing Table4.6 shows the hardware utilization and max frequency of the design.

Table 4.6: Hardware Uiilization and Max Frequency

DE2 — Education and | Total register 8885

81.74MHz Mhz on DE2 Max frequency board

Table 4.7 illustrates the comparison between the current processor and some sing’ threaded-processors.

Table 4.7: The comparison to the pipelined processors

RV32L, Ir.w, sc.w RV32IMFD RV32IMAF extension

Max 81.74MHz Mhz on | 190 Mhz on Nexys 4 | 40 Mhz on Virtex-7 frequency DE2 board DDR board board

5 stage pipelines 5 stage pipelines 5 stage pipelines type

Highlight processor with Out- | Pipeline processor | architecture with features of-Order execution | optimization Dynamic branch and Dynamic branch prediction. prediction.

Table 4.8 illustrates the comparison between the current processor and some multithreaded processors.

Table 4.8: The comparison to the multithreaded processors

My design IoT end-nodes multithread

RV32I, Ir.w, sc.w RV32IM Not-mention extension

135.14MHz Not-mention frequency DE2 board

5 stage pipelines 4 stage pipelines Not-mention type

Two-slot Multithreaded processor with Out- of-Order execution and Dynamic branch prediction.

Interleaved multithreading architecture have a mathetical accelerator.

Table 4.9 illustrates the comparison between the current processor and some multicore processors because they all ac! hives task-level parallelism.

Table 4.9: The comparison to the multicore processors

Criteria processor with Many-core

ISA RISC-V Not mentioned RISC-V

RV32I, Ir.w, sc.w Not mentioned RV32IMC extension

Max 81.74MHz Mhz on | 176 Mhz on Spartan | 120 Mhz on Virtex frequency DE2 board board Ultra board

5 stage pipelines 4 stage pipelines 4 stage pipelines type

: Centralized shared | Distributed features of-Order execution and Dynamic branch memory shared memory prediction.

Relying on the comparison above, advantages and disadvantages exist in the processor of this project, listed in Table 4.10.

Table 4.10: Pros and cons of the multithreaded processor

The mulithreaded processor can execute two no-violation programs independently.

Reducing stall cycles by Out-of-Order and Dynamic Execution Can fail when two programs of threads have violations.

Optimized pipeline for R-tyype and I-type instruction.

Chapter 5 Conclusion and Future Work

In conclusion, the design performs well with RV32I, and two synchronizing instructions at the hardware level are /r.w and sc.w Besides, the processor also supports Out-of-Order execution and Dynamic Branch Prediction techniques and can communicate with TCB, highlighting this project The processor can function appropriately at 55MHz, which is relatively high compared with some other RISC-V processors in both pre-simulation and post-simulation.

In addition, this work proves the RISC-V is an ISA that is suitable for developing modern processors, especially multithreaded processors Besides, some advantage techniques to reduce stall cycles can be applied So, in the future, a RISC-V processor is integrated highest technologies that can be released in the commercial market This project also proves that students or academics can freely research and develop their own processor with their ideas for studing purposes.

However, the processor has some limitations The design frequency is still low compared with other released multithread and RISC-V papers around the world In addition, the design does not support bus systems, SRAM, DMA, etc That is why it is hard to increase the number of threads.

- Does not implement bus systems.

— Instruction memory & data memory are synthesized as register arrays but not external SDRAM.

— Does not have an automatic process to verify the design in the FPGA board, that why the number of test cases is verified in FPGA is very slim.

— Optimizing critical path to enhance the operating the frequency.

— Researching and implementing bus systems such as ABP, AHB, AXI or CHI.

- Applying a RAM to storing instructions and data.

- Update the Dynamic Branch Prediction.

- Developing a process to run testcase in FPGA board automatically.

An instruction set architecture (ISA) is an abstract model of a computer It is also referred to as architecture or computer architecture [14].

Before I discuss more detail about RISC-V ISA, we will consider others ISA, such as MIPS, ARMv7, ARMv8, x86 then compare them to each other.

MIPS (Microprocessor without Interlocked Pipeline Stages) is a reduced instruction set computer (RISC) instruction set architecture (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States [15].

There are multiple versions of MIPS: including MIPS I, II, IH, IV and V, as well as five releases of MIPS32/64 (for 32-bit and 64-bit implementations, respectively) [15].

More detail of these versions of MIPS will be discussed more detail in the next section [15].

- MIPS I: is a load/store architecture (also known as a register-register architecture); except for the load/store instructions used to access memory, all instruction operates on the registers.

— MIPS II: is an upgrade of MIPS I by “removed the load delay slot and added several sets of instructions.

— MIPS III: is a backwards-compatible extension of MIPS II that added 64-bit memory addressing and integer operations.

— MIPS IV: is the fourth version of the architecture It is a superset of MIPS III and is compatible with all existing versions of MIPS MIPS IV was designed to improve floating-point (FP) performance mainly.

— MIPS V: added a new data type, the Paired Single (PS), which consisted of two single-precision (32-bit) floating-point numbers stored in the existing 64-bit floating-point registers.

— MIPS32/MIPS64: The first release of MIPS32, based on MIPS II, added conditional moves, prefetch instructions, and other features from the R4000 and R5000 families of 64-bit processors The first release of MIPS64 adds a MIPS32 mode to run 32-bit code The MUL and MADD (multiply-add) instructions, previously available in some implementations, were added to the MIPS32 and MIPS64 specifications, as were cache control instructions.

MIPS processor is used in [15]:

— Embedded systems such as residential gateways and routers.

— Personal, Workstation and server computers.

—_ Video game consoles such as Nintendo 64, Sony PlayStation, Sony PlayStation 2 and Sony PlayStation Portable.

These are some advantages of MIPS ISA [7]:

— Load-Store instructions are to copy data to or from the register For example, when one wants to store data in memory, the data is already located in the register file In the same way, when ones want to load data for some reason, that means ones copy that data into the registers.

— Arithmetic and logic instructions only operate on the register.

— The MIPS user-level integer instruction set comprised just 58 instructions.

This kind of instruction’s design reduces the complexity of both the instruction set and the hardware, facilitating inexpensive pipelined implementations However, the MIPS ISA is not attractive for high-performance implementations [7].

— Over 30 years, it has evolved into 400 instructions.

- The ISA is over-optimized for a specific microarchitecture pattern, the five-stage, single-issue, in-order pipeline Branch and jump are delay by one instruction, complicating superscalar and super-pipeline implementations.

- Poor support for Position-Independent Code (PIC) and dynamic linking.

— Multiplication and division use special architectural registers, increase context size, instruction count, code size, and microarchitectural complexity.

— The floating unit is a separate coprocessor.

-— Handling misaligned loads and stores with special instructions.

ARMvV7 is a popular 32-bit RISC-inspired ISA and the most widely implemented architecture in the world However, this ISA cannot be adapted because it is a closed standard [7].

Several technical deficiencies in ARMv7 strongly disinclined us to use it [7].:

— At the time, there was no support for a 64-bit address, and the ISA lacked hardware support for the IEEE 754-2008 standard (ARMvV8 rectified these deficiencies, as discussed in the next section.

— ARMv7 is packed with a compressed ISA with fixed-width 16-bit instruction, called Thumb, and the next version of Thumb is Thumb-2 However, 32-bit instructions in Thumb-2 are encoded differently than the 32-bit instructions in the base ISA (the 16-bit instruction in Thumb-2 are also encoded differently than the 16-bit instructions in the original Thumb ISA) The instruction decoders need to effectively understand three ISAs, adding to energy, latency, and design cost.

- ARMv7 has many features that complicate implementations.

In summary, ARMv7 is vast and complicated Between ARM and Thumb, there are over 600 instructions in the integer ISA alone The integer SIMD and floating-point extension add hundreds more Even if it has been legally feasible for us to implement ARMV/7, it would have been quite challenging technically.

This new architecture removed several features of ARMv7 that complicated implementations For example [7]:

— The PC register is no longer part of the integer register set.

- Instructions are no longer predicted.

— The load-multiple and store-multiple instruction was removed.

— The instructions encoding was regularized.

However, some warts remain, including:

— The use of condition codes and not-quite-general-registers (the link register is implicit, and depending on the context, x31 is either the stack pointer or is hard- wired to zero).

More blemishes were added still, including:

— A massive subword-SIMD architecture that is effectively mandatory.

Overall, the ISA is complex and unwieldy:

— All of which takes 5778 pages to document.

Given that, it is perhaps surprising that important features were left out: for example, the ISA lacks a fused compare-and-branch instruction.

This is the most popular instruction set in laptop, desktop, and server markets The x86 of 2015 is extremely complex and comprise 1300 instructions One wonders how much thought went into the design [7]:

— The ISA is not classically virtualizable since some privileged instructions silently fail in user mode rather than trapping VMware’s engineers famously worked around this deficiency with complex dynamic binary translation software.

— The ISA has an anaemic register set The 32-bit architecture, [A-32, has just eight integer registers Recognizing this deficiency, AMD’s 63-bit extension, x86-64, double the number of integers to 16.

- Most x86 instruction has only a destructive form that overwrites one of the source operands with the result Frequently, this necessitates extra moves to preserve values that remain live across destructive instructions.

RISC-V (pronounced “risk five’) is a new instruction-set architecture (ISA) that was initially designed to support computer architecture research and education There are some advantages of this ISA [8]:

- Acompletely open ISA that is freely available to academia and industry.

— A real ISA is suitable for direct native hardware implementation, not just simulation or binary translation.

- An ISA that avoids “over-architecting” for a particular microarchitecture style

(e.g., microcode, in-order, decoupled, out-of-order) or implementation technology (e.g., full-custom, ASIC, FPGA), but which allows efficient implementation in any of these.

- An ISA is separated into a small base integer ISA, usable as a base for customized accelerators or educational purposes, and optional standard extensions to support general-purpose software development.

— Support for the revised 2008 IEEE-754 floating-point standard.

— An ISA is supporting extensive ISA extensions and specialized variants.

— Both 32-bit and 64-bit address space variants for applications, operating system kernels, and hardware implementations.

- An ISA with support for highly parallel multicore or manycore implementations, including heterogeneous multiprocessors.

— Optional variable-length instructions to expand available instruction encoding space and support an optional dense instruction encoding for improved performance, static code size, and energy efficiency.

— A fully virtualizable ISA to ease hypervisor development.

- An ISA that simplifies experiments with new privileged architecture designs.

To reach these advantages, RISC-V supports base and extensions Table 5.1 describes in more detail these bases and extensions.

Table 5.1: ISA base and extension [8].

RV32I Base Integer Instruction Set, 32-bit | 2.0 Y

Base Integer Instruction Set RV32E l 1.9 N

RV64I Base Integer Instruction Set, 64-bit | 2.0 Y

Base Integer Instruction Set, 128- RVI28I 17 N bit

Standard Extension for Decimal gó

Standard Extension for Packed- SIMD Instructions

Standard Extension for Vector Operations

Standard Extension for User-Level Interrupts

Not only ISA base and extension, but RISC-V also support simple Instruction Format RISC-V instruction format.

Register/Register funct7 rs2 rsl funct3 | rd opcode

Immediate (I) imm [11:0] rsl funct3 rd opcode

Store (S) imm [11:5] rs2 rsl funct3 imm[4:0] opcode

Branch (SB) imm[12:10:5] rs2 rsl funct3 imm[4:1I11] | opcode

Jump (UJ) imm[31:12] rd opcode

Upper Immediate w) imm[20110:1111119:12] rd opcode

Tiêu đề	Research Design RISC-V with Multithreading Micro-Architecture
Tác giả	Le Nguyen Hoang Thien
Người hướng dẫn	M.Eng Ho Ngoc Diem
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Capstone Project
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	192
Dung lượng	43,71 MB