Khóa luận tốt nghiệp Kỹ thuật máy tính: Nghiên cứu thiết kế bộ vi xử lý đa lõi dựa trên kiến trúc tập lệnh RISC-V

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF COMPUTER ENGINEERING NGUYEN PHAN HOANG PHUC CAPSTONE PROJECT RESEARCH DESIGN A MULTI-CORE PR

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER ENGINEERING

NGUYEN PHAN HOANG PHUC

CAPSTONE PROJECT RESEARCH DESIGN A MULTI-CORE PROCESSOR

BASED ON RISC-V INSTRUCTION SET

ARCHITECTURE

NGHIÊN CỨU THIET KE VI XỬ LÝ ĐA NHÂN DUA TREN

KIEN TRÚC TẬP LỆNH RISC-V

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER ENGINEERING

NGUYEN PHAN HOANG PHUC - 17520909

CAPSTONE PROJECT

RESEARCH DESIGN A MULTI-CORE PROCESSOR

BASED ON RISC-V INSTRUCTION SET

ARCHITECTURE

NGHIEN CUU THIET KE VI XU LY DA NHAN DUA TREN

KIEN TRUC TAP LENH RISC-V

BACHELOR OF ENGINEERING

IN COMPUTER ENGINEERING

SUPERVISOR

Trang 3

LIST OF DEFENSE COMMITTEE

Defense committee of the capstone project, established under Decision No DHCNTT dated July 23, 2021 of Rector of the University of InformationTechnology

462/QD- — Chairman

— Secretary

E — Commissioner

Trang 4

The basis of this research initially stemmed from my passion for developing abetter architecture for a modern processor As the world moves further into the digitalage, there will be more significant and more robust hardware, especially processors,

to calculate and handle increasingly complicated programs The low-performanceprocessor will be outdated technology and removed to make room for the new one

How will we improve processor performance? It is my passion to find out and develop

a new method to break down the limitation of processors today

In truth, I could not have achieved my current level of success without a strongsupport group First of all, all lecturers and staff at the University of Information

Technology — VietNam National University - Ho Chi Minh City, who always teach

a solid foundation of knowledge, especially the ones in the department of computer

engineering, each of whom has given me opportunities to absorb expertise experienceeffectively Secondly, I also appreciate my parents, who supported me with love and

understanding Finally, I am so grateful to Hồ Ngọc Diễm, who has provided patient

advice and brightly guidance throughout the research process

Once again, thank you all for your unwavering support

Trang 5

VIETNAM NATIONAL UNIVERSITY HCMC THE SOCIALIST REPUBLIC OF

UNIVERSITY OF INFORMATION VIETNAM

TECHNOLOGY Independence - Freedom - Happiness

OUTLINE DETAILS

VIETNAMESE NAME: NGHIEN CUU THIET KE BO VI XU LY DA LOI DUA

TREN KIEN TRUC TAP LENH RISC-V

ENGLISH NAME: RESEARCH DESIGN A MULTI-CORE PROCESSOR

BASED ON RISC-V INSTRUCTION SET ARCHITECTURE

Supervisor: M.Eng Hé Ngoc Diém

Start day — End day: From March 08, 2021 to June 25, 2021

Student: Nguyén Phan Hoang Phtic - 17520909

Content

1 Overview

When a single-core processor rapidly reaches complexity and speed limit A multi-core

processor is the simplest way to enhance processor performance This methodology is also

a trend of developing processors in recent years to exploit entirely current bleeding-edgesemiconductor technology Besides, the RISC-V instruction set architecture (ISA) is being

provided without charging any license fee, and it makes room for designing your own

processor more effortlessly The RISC-V community are well-known, and their processorshave applied in numerous different SoC as well as boards; some of them are available forthe commercial market

Researching relating work is a trend over the world:

e Single-core RISC-V processor examples include RVCoreP and Skati processor

These processors are multi-cycle processors that can perform base

Trang 6

instructions, multiplication/division and floating-point operations (EspeciallyShakti processor is capable of executing atomic instructions) By implementingthe Atomic standard extension, Skakti’s frequency decreased dramaticallycompared to RVCoreP, purely focusing on ALU optimization.

Multi-core processors can be mentioned that is the quad-core processor

designed by Manoj Kumar Gouda and D Yugandhar ISA of this processor isnot mentioned, and its memory is organized as centralized shared- memory.Another multi-core processor is the many-core processor designed by Ahmed

Kamaleldin team, which includes 36 cores and distributed shared-memory

implementation This processor can perform RV32I and two other extensions

(RV32M and RV32C) Atomic operations are not applied to both processors

There are some free available cores similar to this work containing Pulpino

RISCY, PicoRV32, Freedom Three processors are famous in the RISC-V

community and often used as the core processor in other work They canexecute basic arithmetic & logic operations as well as compressed instructions.The drawbacks of these processors are that they include much auxiliaries

hardware on the processor These components may no need in other application

and lead to performance reduction

In Viet Nam, there is research relating to RISC-V ISA:

The most recent research is designing a RISC-V processor with supervisormode, which can execute the RV32I base instruction set and other systeminstructions This work is in charge by An Xuan Tuan from the Faculty ofComputer Engineering Although this processor can handle exceptions andsupport supervisor mode, it only is implemented as a single-core processor.That is the reason why the operating frequency is limited to 12.94 Mhz

The first reason for researching is applying atomic instructions, a straight forward

method to maintain synchronization in a multi-core processor but was not applied in the

Trang 7

multi-core processor above to measure processor performance in each technique Besides,

no research is recorded in Viet Nam about designing a multi-core processor based on

RISC-V ISA; this work will contribute to helping Viet Nam caught up with the world

advancement in the integrated circuit design field in general and RISC-V processor inspecific

2 Project objective

Design a 32-bit multi-cycle processor that is capable of executing basic operations

in the RV32I instruction set and their atomic extension RV32A

Design an arbitrator handles access memory violations that may occur during anexecution time

Verify the design by industrial methods

The operating frequency for the processor at least is 100 Mhz

Simulation method: Run simulation on ModelSim with test cases generated

randomly by the RTG unit (Random test generator)

Practical method: Embed and verify on the FPGA board (Using DE2 kit)

4 Main content

Design processor core includes PC, Register files, ALU, Hazard detection unit,Branching unit, Forwarding units,

Design data memory and instruction memory

Design arbitrator unitDesign simulation on Model Sim

Test on the FPGA board

Trang 8

5 Implementation plan

Research RISC-V architecture March 08, 2021 — March 14, 2021

Trang 9

Hồ Ngọc Diễm Nguyễn Phan Hoàng Phúc

Trang 10

TABLES OF CONTENTS

Chapter 1 Introduction wo cece seeeceseeeesesesessesesescsesseseseecsssnsseseseensnaesesseneneneees 4

1.1 Introduction to instruction set architecture

1.1.1 RISC-V architecture OV€TVI€W càng Hư 4

1.1.2 Comparison of instruction set architectures

1.2 Details in RISC-V instruction set architeCfUTe ¿5-5552 5+5<+s+<c++ 8

1.3 Introduction to microarChit€CUTC + ¿5+ 522*+*££+£+£++x+Eexererrtrrsrerre 16

1.3.1 Multi-core processor OV€TVIW St Stttrterrrrrrererrree 16

1.3.2 Microarchitecture COITATiSONI - 5-5 5c S252 S+‡++E£xEerrkekerrrrkree 17

1.4 Memory coherence & consistency problem -s-¿- «+ +sc++s+++£++e++ 18

1.5 CPU system designn 2c 22222 t2 2221 21211121212101.111121.1 re 20

Chapter 2 System design ¿5c c2 22.2.2122 1221112101121 re 21

2.1 Software đ€SIET th ng HT HH HT ngư 2

2.2 Hardware đe€Sign tư 24

Chapter 3 Detail Desig1n - 13212121212 E2 2121210 11 1212101 111111 te 27

3.1 Software €SIØT tàn nnTHHHHHHHHHHHHHHHH T1 H0 1g re 27

3.1.1 Data initialiZat[OI +6 1S St HH 010101200 0011 27

3.1.2 Creating label hash tabÌe -¿- + - + St +k+k*k£kEEEEkekEEEEEkrkrkerree 27

3.1.3 Reformat & splitting instructiOn - 5-5-5 S+S++xcscseccxexerrrere 29

3.1.4 Verify and convert inSITUCtÏOI 5 5+ sssceceesevrerererereeeeexrecee2 Í

3.1.5 Export the result to output ẨiÏ€s 5++5ccxseeeeeeeexeeexexee.er.e 24

3.2 Hardware design - 5-5

Trang 11

3.2.2 Data memory đ€SigI cece ee SE SE k1 2212181 1111211 re 35

3.2.3 Arbiter

3.2.4 PTOC€SSOT COFC S1 ST HH HH HH hà 50

Chapter 4 Verifcation & Evaluation

4.1 Verification pÏan - ¿5c 52t 2S 321 12 2212111 1.11112121010111 HH rườn 75

4.2 Directed test verification

4.2.1 Test case Ø€T€TA(ÏONI 6 + 1 vn HH HH HH gi T7

4.2.2 Test case manipulation

4.2.3 Automation verification method - - - 2 + 5++++++++ce£zx+xexerereree 82

4.3 Random test V€TiÍÏCAfiOH ¿525266 + St E12 11111111 ren 84

4.4 Riscv-tests V€TIÍICAfÏO SG ST nh TH nh TH nh nh nàn như 85

Appendix A Introduction other instruction set architeCfUre - - + ++s<+ 96

A.I X86 architecture OV€TVI€W nh HH re 96

A.2 ARM architecture OV€TVIW + nành HH1 gu 97

A.3 MIPS architecture OV€TVICW nành H000 ru 98

Appendix B Introduction other microarchit©CfUT€ - - + ¿5-55 c+c+s++x+xe++ 100

Trang 12

B.1 Very long instruction Word processor ¿5-5 552 S*‡++v+csevrereerree 100

B.2 Multithreaded processor

B.3 Out-of-order Processor S1 12121 1 11 1 12121 111111112 g0 tr 101

References

Trang 13

LIST OF FIGURES

Figure 1.1: RV32I instructions formats - ¿c5 5c + ‡£‡£sxeekzrrerersee 10

Figure 1.2: Example of CPU system design - - 5252555 5+scc>xzeseersrx 20

Figure 2.1: The general system đesign -. - + ¿5522 ++2St2txtxekerrererrrree 21

Figure 2.2: General algorithm of asseImbÏeT ¿- + 5+ *£++x+x£xexeEvzxexererre 23Figure 2.3: General block diagTaIm - ¿+ + + St S**E£k£k#EvErkekekekrrkrkrkererre 26Figure 3.1: Flow chart for creating label hash table - ¿+ -<5+++++s+cexse+ 28

Figure 3.2: Flow chart of SpÏitting - 5c 55c sssseeeeeererrrerrrerece OO

Figure 3.3: Flow chart of convert and verify instructiOn ‹-‹-+-+<+-+ 33

Figure 3.4: The design of instruction I€IOYY ¿+ s5 5+s+xs+s+vevxsxseeseeerxee DDFigure 3.5: Data memory schematic ¿- ¿c6 St*‡EvekekerrrrkkrrekrkrrkrkreÐ Z

Figure 3.6: Data memory parftÏOI - ¿5c 255252 S*2*‡*‡+‡£v>t+t+xexexerrxexerrrre 39

Figure 3.7: Schematic of sub_ data_ meim - - - + ¿+6 +E‡£k+k+xexexerrkevererere 40Figure 3.8: Schematic of ari(€T St t St SEEvEEEkskekekrkrrrrereerkrkrkrkrerrrd AlFigure 3.9: sub_arbitrator schematic c cccccccccssesessessseseseeneseseseeeeseseseseaneacssseseeees 43

Figure 3.10: The pipeline S(TUCfUTC ¿- ¿+ 5S S2££E+E‡E£kEEEEEEEekekrkekrkrrrrree 50

Figure 3.11: the IF stage schematc c.ccccessssesesesseseseseseseeeseseseesesesessseeneseseeeeneees 51Figure 3.12: Schematic of the ID Stag€ - - + c5 xv#EkErkekekekekrkrkrrrree 53

Figure 3.13: Schematic of the contTỌer + ¿<< +S+5++++x+xe+ez+xexexvxss+ 57

Figure 3.14:Schematic of the B stage - «5+5 tk 63Figure 3.15: Schematic of the EX s(age€ - ¿552 2 Sr2t+t2tterrrrkrrerrrree 67Figure 3.16: Schematic of the MEM stage - St steeerrrrrerrrree 71

Figure 3.17: Schematic of the WB stage cccccseessesesecesneseseseeeeesesesesnesssesesesees 73

Figure 4.1: Verification process cccceescssesesesesssseaesessesesesssesesseseseeesnsneseaeseeesees T7Figure 4.2: Test case generation flow - c6 tt ngư 78

Figure 4.3: Data generator algorithm - + ¿5+5++s++S>t+tzxexexerrxexerrree 80

Figure 4.4: File generator algOrithim - - S5 Sky 81Figure 4.5: Automation verification process 82Figure 4.6: Automation verification process (Cont.)

Trang 14

Figure 4.7: The LUI instruction eXeCUfiOH - - + ¿5-5 SE St+t+kevekerrkrkererre §6Figure 4.8: The BNE instruction eX€CUtiOI 6-5 tt svEvvexeeeserrrerree 86

Figure 4.9: The LR/SC sequence eX€CutiON ¿5-5555 S+2x+xcxexerexexsrrrre 87

Figure 4.10: A parf of post-synthesize simulation waveform - - ¿+ 88Figure 4.11: Contents of data MeMory ceceessssessseeeteneeseeeseseeceeeeeesseesseseeeneneeee 88

Figure 4.12: Schematic of top level for FPGA implementation - 90

Trang 15

LIST OF TABLES

Table 1.1: Status of ISA base and ex†€nSIOIS ¿cà tre 5

Table 1.2: Pros and cons of each ISA [3] cecccecceessesesecssesesecseseesecseseeseeessesseeseeeeees 6Table 1.3: RISC-V register CONVENTIONS 0.0 ceseeeseeseseseseeeteeeeseeeseseeesenseeteseeaeseneees 9

Table 1.4: RISC-V assembly language [2] ¿- ¿5-5 eteeeeescseeeseseneneeeee 11

Table 1.5: Advantages and drawbacks of mentioned microarchitectures 17Table 1.6: Relaxed consistency models + 5+ + x‡k#£vEvxexeeexserkrrerree 19Table 2.1: Bus system of the design 25

Table 3.1: I/O interface of instruction memory design 235

Table 3.2: I/O interface of data memory design 36Table 3.3: I/O interface of sub_data_memory design

Table 3.4: I/O interface of sub_arbitrator 44

Table 3.5: Writable boundaries corresponding to each COFe -¿- - + 5+++ 46Table 3.6: I/O interface of arbitrator_ontrOlÏe- + s2 s++ss+es+zs++sx+zs£zssz 47Table 3.7: Memory operation TuÌ€S - ¿ ¿+5 5+ 5+ 5*2*+££++£e>++t+xexexerrerxerrrre 48

Table 3.8: Instruction decoding table eccecceeecce cscs eeteneseseseeeesenesestsneseneseeeanes 49

Table 3.9: I/O interface of the IF stage - + 5526k 51Table 3.10: I/O interface of the ID stage 5-5-5252 S++c+cesrsrrtrksrrrre 54

Table 3.11: Hazard Scenarios 0 cccececesescseeeescsescseseseessesesesesenesesesesseseseseeeeeaeee 56

Table 3.12: I/O interface of the controller oo ccc cece 5< + +++£+s++££vzxzxeeexer+ 58Table 3.13: control_field signal encoding ¿ -5- 5252 5++x+x+xexsrerxzterrrrx 59Table 3.14: Instructions are decoded by ØTOUD - ¿5c 5+5 s*£vxseeeeeeeererereed 62

Table 3.15: I/O interface of the B sfage c5 2 cty 63

Table 3.16: I/O interface of the EX stage ¿6 Sky 67Table 3.17: Operation deCOding - ¿5:56 1+ ESx k2 2 2212111111111 te 68

Table 3.18: alu_control_ out signal đefinitiOn ¿ - 555+s<cc+xecveeerereesed 69

Table 3.19: I/O interface of the MEM sfage - - - (5c «Street 71Table 3.20: I/O interface of the WB stage - cty 74Table 4.1: Verification plan oo ccc cseeseseeeenseeeessescsenseseesseseseseneeeessseessenseeneeeeee 75

Trang 16

Table 4.2: RV32A violation simulation reSuÏt -sc555++cvc++evrxereerrxey 84Table 4.3: RTG Verificaiton result cceceesesseseeseseeseeteseessseeseessseesseessesseensseeneneenes 85

Table 4.4: Verification T€SUÏL +6 5+5 SE2EE1Ek ST ng HH 85

Table 4.5: Synthesis parameter cOnfiguratiOI ¿-¿- 5c s52 5++++x+sexervzxzverererx 88Table 4.6: Fmax of the processor cecsessseseeeseeseeeseseenenseeessseseseneeeeesseasseneeeneneeee 88Table 4.7: Utilization SuImTHATY 5° 552222 +$2StS*2E‡E£ESEvEkekekererkrkrkerrree 89

Table 4.8: The comparison to pipeline DTOC€SSOTS -. ¿- 555252 5+2++x+scs++ 91

Table 4.9: The comparison to other multi-core DFOC€SSOTS ¿55-52 25+ 5<++ 92Table 4.10: The comparison to popular RISC-V coTes -. - ¿s55 c+cs<++ 92

Table 4.11: Pros and cons of the dual-core DTOC€SSOT - - - - ¿55555225 5+5+ 93

Trang 17

List of Acronym

ISA Instruction set architecture

RISC Reduced instruction set computer

RV32I Risc-v 32 Interger

RV32A Risc-v 32 Atomic

SMP Symmetric multiprocessor

DSM Distributed shared memory

RMWs Read-Modify-Write instructions AMOs Atomic instructions

FPGA Field Programmable gate array

PC Program Counter

UART Universal asynchronous

receiver-transmitter RTG Random test generator

Trang 18

The world always seeks a new architecture processor to speed up current computerperformance Many technology companies have invested their money to find new

methods to accelerate their processors and achieve that purpose Multi-core processor

development is the most excellent technique and essential for any modern systembecause it is a practical and straightforward approach to enhance processorperformance while developing a new hardware architecture is increasinglychallenging and costly

An open-source instruction set architecture allows researchers to research and

develop their processor without any license fee, namely RISC-V Currently, RISC-VISA's ecosystem is large and enough information for a new developer RISC-V isnow becoming a potential candidate to join the competition with ARM, Intel

Hence, this project is going to implement a dual-core processor based on RISC-Varchitecture, including RV32I instructions to handle integer operation & RV32A forsynchronization purposes Some problems have to be dealt with that maintain thevalidation status for data memory because it is being shared for two cores.Furthermore, each core inside the processor is developed as a 6-stage pipeline

processor to maximize performance, so handling hazard occurrences are a critical

task that has to be considered Besides that, this project is tended to develop someadditional tools to support the processor and get familiar with industrial processes

The project's result is promising, dual-core processor stably runs as expected andexecutes the function of every assembly code program correctly in a simulation

environment, and there is no flaw in the memory violation handling task This project

also can be extendable to become an out-of-order processor or superscalar processor

Trang 19

Problem statementThe microprocessor industry always has a vital role in the technological

advancement aspect since their coming appearance in the 1970s The growing market

demands for faster performance drove the industry to produce more powerful andsmarter devices The most classic technique is to operate the chip at a higher

frequency to achieve that goal, allowing the processor to execute tasks much quicker

in the same period, and this tendency full bloom from 1983 - 2002 Researchers havediscovered additional techniques to improve performance by exploiting parallelismexecution, including parallel processing, data-level parallelism, and instruction-level

parallelism These methods have all been determined to be very useful, and A

multi-core processor is a significant way to improve performance Another technique thatimproves substantial performance is a multi-core processor In fact, the multi-core

processor microarchitecture has existed for the past decade; however, it has gained

more importance today due to the technical limitations of single-core processorsfacing high throughput and long-lasting battery life with high energy efficiency [1]

Driven by a performance-hungry market, microprocessor designers have alwaysbeen kept performance and cost in mind Gordon Moore, the founder of IntelCorporation, predicted that the number of transistors on a chip would be double onceevery 18 months to meet this ever-growing demand, popularly known as Moore’sLaw semiconductor industry Besides bleeding-edge chip fabrication technology,integrated circuit processing technology introduces the possibility to integrate one

billion transistors on a chip to enhance performance by increasing integration density

However, bleeding-edge chip fabrication techniques regularly alongside majorbottleneck and power dissipation issues due to the microarchitecture's performanceincrease obeyed Pollack's rule is roughly proportional to the square root of the rise incomplexity That is means doubling the logic on a processor would only improveperformance by 40% Studies have revealed that the more chip size shrinks, the moreleakage increases, which increases static power dissipation to great value Although

the mentioned mean of improving performance is operation frequency increment, the

Trang 20

frequency is currently limited to 4GHzPower dissipation increases again if thefrequency goes beyond that level) [1].

Performance is still a major design objective of semiconductor manufacture, butother essential considerations include chip fabrication costs, fault tolerance, powerefficiency, and heat dissipation It leads to the development of multi-core processors

is an effective way to address these challenges [1]

Thus, demands research and design of a higher performance multi-core processor

is necessary to innovate modern chips This project is based on RISC-V instructionset architecture (an open-source ISA developed at the University of California,

Berkeley, and volunteers at the RISC-V) to build a basic multi-core processor with

two cores inside This processor will then be synthesized by Quartus & simulated onModel Sim to verify the design correctness The verification process is automaticallyrun by tools developed by my own (including instruction generator — generate test

cases randomly to verify, simulation supporter which shortens simulation time & get

the result faster)

Trang 21

Chapter 1 Introduction

The first chapter introduces three main components of computer architecture

(instruction set architecture (ISA), microarchitecture & system design) and their

actual implementations, from that information to clarify why this project actualizemulti-core processor based on RISC-V principles

1.1 Introduction to instruction set architecture

Instruction set architecture, also called architecture, is an abstract interface

between the hardware and the lowest-level software that encompasses all theinformation necessary to write a machine language program that will run correctly,

including instructions, registers, memory access, I/O, [2]

Both hardware and software consist of hierarchical layers using abstraction, with

each lower layer hiding details from the level above One key interface between the

levels of abstraction is the instruction set architecture — the interface between thehardware and low-level software This abstract interface enables many

implementations of varying cost and performance to run identical software [2]

Many popular ISA is implemented on chips today, such as x86, ARMv7, ARMv8,MIPS, while they still have flaws and need improvement That is why a new ISA

like RISC-V is developed to remove their ancestors' drawbacks The next subsections

will introduce more detail about each ISA mentioned above and compare theiradvantages/disadvantages

1.1.1 RISC-V architecture overview

V is an open ISA based on RISC principles like MIPS, except that

RISC-V ISA is offered open-source licenses that do not require any fee to use Many

companies are producing or have introduced RISC-V hardware, open-source OS withRISC-V support, such as Nvidia, Western Digital

It is structured as a small base ISA with various optional extensions The base ISA

is very simple, making RISC-V suitable for research and education, but complete

Trang 22

enough to be a suitable ISA for inexpensive, low-power embedded devices Theoptional extensions form a more powerful ISA for general-purpose and high-

e The standard integer multiplication and division extension is named “M” andadds instructions to multiply and divide values held in the integer registers

e The standard atomic instruction extension, denoted by “A”, adds instructions

that atomically read, modify, and write memory for inter-processor synchronization

e The standard single-precision floating-point extension, denoted by “F”, addsfloating-point registers, single-precision computational instructions, and single-

precision loads and stores

e The standard double-precision floating-point extension, denoted by “D”,expands the floating-point registers, and adds double-precision computational

instructions, loads, and stores

e The standard “C” compressed instruction extension provides narrower 16-bitforms of common instructions

e Other standards (L, B, J, T, ) are introduced, but it is only draft versions and

needs time to develop (See other extensions in detail in Table 1.1)

Table 1.1: Status of ISA base and extensions

Base Version | Status

RVWMO | 2.0 RatifiedRV321 2.1 Ratified

RV641 2.1 Ratified

RV32E 1.9 Draft

Trang 23

Zifencei | 2.0

Zam 0.1Ztso 0.1

Ratified

RatifiedRatified

Ratified

Draft

DraftDraft

Draft

DraftDraft

Ratified

Draft

Frozen

1.1.2 Comparison of instruction set architectures

x86 is a CISC architecture which is a significant difference from ARM, MIPS,

V; they are all varieties of RISC designs In more detail, ARM, MIPS,

RISC-V architecture is based on a common type of architecture, “load-store architecture”

or “register-register architecture”, meaning data-processing operations operate onregister contents and only load and store instructions can access memory

Each architecture still has its pros and cons, and Table 1.2 illustrates both somegood and drawback features of their ISA; processors are built from these ISA

Table 1.2: Pros and cons of each ISA [3]

x86 e All popular software has been

ported to or was developed for the

Trang 24

e The most popular instruction inthe laptop, desktop, and server

markets

ARMv7 | ¢ ARMv7 define 3 classes (A, R, | ®s No support for 64-bit addresses.

M) for a specific application ® ARMV7 is vast and complicated

se The great quantity of software to implement

that has been ported to the ISA | s License requirement

and to its ubiquity in embedded | e«ARMv7 ¡is not classically

and mobile devices virtualizable

e ARMv7 is by far is the most

widely implemented architecture

in the world

ARMv8 | ¢ ARMVv8 was extended to 64-bit} se The ISA still is complex and

the instruction set was designed unwieldy

from scratch, fixing many old | e No support for a compressed

issues instruction encoding.

® Backward compatibility e License requirement

MIPS All opcodes are 4 bytes which ® MIPS has poor encoding, which

simplify the instruction decoder

eThe MIPS user-level integer

instruction set comprised just 58

instructions

se The complexity instruction set

and hardware is _ reduced,

facilitating inexpensive pipelined

¢ The ISA is over-optimized for

a specific microarchitecturalpattern, the five-stage, single-

issue, in-order pipeline

e License requirement

Trang 25

RISC-V | ¢ Good encoding instruction mm

field), can use the same decoderfor multiple instruction formats

e Wide range of communities,support tools

® RISC-V is open source

e Highly scalable ability

eMany extensions are notstandard version, need time to

optimize & develop

X86, ARM, MIPS architecture lacks several technical features and requireslicenses fee to work on their ISA That is the reason why I choose RISC-V for my

roject, a free and open ISA that avoids technical drawbacks of the old ISA and

straightforward to implement in many microarchitectural styles The next sectionintroduces more about RISC-V architecture

1.2 Details in RISC-V instruction set architecture

RISC-V was defined to support research in data-parallel architectures, the Romannumeral ‘V’ also conveniently served as an acronymic pun for “Vector” RISC-V was

to make an ISA suitable for nearly any computing device, and they met all specific

technical goals when they start to develop this ISA [3]:

e Separate the ISA into a small base ISA and optional extensions

e Support both 32-bit and 64-bit address spaces

e Facilitate custom ISA extensions

e Support variable-length instruction set extensions

e Provide efficient hardware support for modern standards

¢ Orthogonalize the user ISA and privileged architecture

This project implements RV32I, which is the base 32-bit integer ISA It is a simple

instruction set, the number of mandatory user-level hardware instructions to 40 Like

other RISC instruction sets, these instructions are divided into three categories

(computation, control flow, and memory access) Because RISC-V is a load-store

Trang 26

architecture, in which arithmetic instructions operate only on the registers, only loadsand stores allow transferring data to and from memory [3].

The addressing modes of the RISC-V instructions are the following [2]:

e Immediate addressing, where the operand is a constant within the

instruction itself

e Register addressing, where the operand is a register

e Base or displacement addressing, where the operand is at the memory

location whose address is the sum of a register and a constant in the

instruction

e PC-relative addressing, where the branch address is the sum of the PC and

a constant in the instruction

Table 1.3 shows that 32 registers have 32 bits wide and are assigned to a specificuse Register x0 is hardwired with all bits equal to 0 General-purpose registers x1 —x31 hold values that various instructions interpret as a collection of Boolean values,

or as two’s complementation signed binary integers or unsigned binary integer Theonly additional register is the program counter namely pe which holds the address ofthe current instruction [4]

Table 1.3: RISC-V register conventions

Name Register Usage Preserved

number on call?

x0 0 The constant value 0 N.a

x1 (ra) 1 Return address (link Yes

register)

x2 (sp) 2 Stack pointer Yes

x3 (gp) 3 Global pointer Yes

x4 (tp) 4 Thread pointer Yes

x5-x7 5-7 Temporaries Yes

Trang 27

x8-x9 8-9 Saved No

x10-x17 10-17 Arguments/results No

x18-x27 18-27 Saved Yes

x28-x31 28-31 Temporaries No

Forty instructions are encoded to instruction formats: R, I, S, B, U, J (described

in Figure 1.2) In these formats, instructions source up to two register operands,identified by rs1 and rs2, and produce up to one result, recognized as rd A significant

feature of this encoding is that these register specifiers, when present, always occupy

the same position in the instruction This property allows register fetch to proceed in

funct7 ts2 rsl funct3 rd opcode

imm{11:0] rs] funct3 rd opcode

imm[11:5] rs2 rsl Tunct3 imm[#0] “opcode

imm[12] | imm[10:5] Ts2 rsl funet3_ | imm[4:1] | imm[TT] | opcode

imm[31:12 rd opcode

immj20] imm[T0:1] imm[T] imm[19:12] td opcode

Figure 1.1: RV321 instructions formats

Another feature of this encoding scheme is that generating the immediate operand

from the instruction word is inexpensive Of the 32 bits in the immediate operand,

seven always come from the same position in the instruction, including the sign bit,which, due to its high fan-out, is the most critical 24 more bits come from one of twopositions, and the final immediate bit has three sources The SB and UJ formats,

which have their immediates scaled by a factor of two, rotate the bits in the

immediate, rather than using hardware MUXes to do so, as was the case in MIPS,

SPARC, and Alpha This design reduces hardware cost for low-end implementationsthat reuse the ALU data path to compute branch targets [3]

Trang 28

In highly parallel systems such as this project (when a memory word is contended

by two different cores), it needs atomic memory instructions to send the memory

word to data memory To handle this task, RISC-V has ‘A’ standard extension

including LR/SC (load-reserved/store-conditional) and several other atomic memoryoperations, which perform arithmetic and logic operations on a memory word, thenreturn the old value (signed and unsigned minimum and maximum; bitwise AND,

OR, and XOR; addition; and swap) RV32I and RV32A instruction set are described

more precisely in Table 1.4

Table 1.4: RISC-V assembly language [2]

Instruction Example Meaning Comments

Trang 29

Word from memory to

Load word lw x5, 40(x6) | x5 = Memory[x6 + 40]

register

Word from register to

Store word sw x5, 40(x6) | Memory[x6 + 40] = x5

memory

Halfword fromLoad halfword | Ih x5, 40(x6) | x5 = Memory[x6 + 40]

memory to register

Load Unsigned halfword

halfword, lhu x5, 40(x6) | x5 = Memory[x6 + 40] | from memory to

unsigned register

Store halfword sh x5, 40(x6) Memory[x6 + 40] = x5

Halfword from

register to memory

Byte from memory to

Load byte Ib x5, 40(x6) | x5 = Memory[x6 + 40] :

register

Load byte, Byte halfword from

Ibu x5, 40(x6) | x5 = Memory[x6 + 40]

unsigned memory to register

Byte from register to

Store byte sb x5, 40(x6) | Memory[x6 + 40] = x5

memory

Load; Ist half of anLoad reserved Ir.w x5, (x6) x5 = Memory[x6] l

atomic swap sequence

Store sc.w x7, x5, | Memory[x6] = x5; x7 = | Store; 2nd half of an

conditional (x6) 0/1 atomic swap sequence

Load upper lui x5, Loads 20-bit constant

x5 = 0x12345000

immediate 0x12345 shifted left 12 bits

Trang 30

immediate with constant

Inclusive or Bit-by-bit OR reg

ori x5, x6, 20 x5 = x61 20

immediate with constant

Exclusive or Bit-by-bit XOR reg

Trang 31

Branch if beq x5, x6, if (x5 == x6) go to PC-relative branch if

equal 100 PC+100 registers equal

Branch if not bne x5, x6, if (x5 != x6) go to PC-relative branch if

equal 100 PC+100 registers not equal

Branch if less

blt x5, x6, 100

if (x5 < x6) go to PC-relative branch if

than PC+100 registers less

Branch if PC-relative branch if

bge x5, x6, if (x5 >= x6) go to

greater or registers greater or

100 PC+100

equal equal

Branch if less, bltu x5, x6, if (x5 < x6) go to PC-relative branch if

unsigned 100 PC+100 registers less

Branch if PC-relative branch if

bgeu x5, x6, if (x5 >= x6) go to

greatr/eq, registers greater or

100 PC+100unsigned equal

l l x1 =PC+4; go to PC-relative procedureJump and link jal x1, 100

PC+100 call

Jump and link jalr x1, x1 = PC+4; go to Procedure return;

register 100(x5) x5+100 indirect call

amoswap.w x1 = Memory[x3];

AMO swap Load word from

x1, x2, (x3) Memory[x3] = x2 memory to register,

Trang 32

then swap word fromregister to memory

amoadd.w x1,

x1 = Memory[x3];

Load word from

memory to register,AMO add Memory[x3] = x2 +

x2, (x3) then store the addition

amoxor.w x1, memory to register,

AMO xor Memory[x3] = x2 ^

x2, (x3) then store the XOR

Memory[x3]

result to memory

l Load word from

xI =Memory[x3]; if

amomin.w xI, memory to register,

AMO min (Memory[x3] > x2)

x2, (x3) then store the smaller

Memory[x3] = x2

value to memory

Load word from

x1 = Memory[x3]; if

amomax.w x1, memory to register,

AMO max (Memory[x3] < x2)

x2, (x3)

Memory[x3] = x2

then store the greatervalue to memory

Trang 33

Load word from

l l xl = Memory[x3]; if | memory to register,AMO min amominu.w

l (Memory[x3] >x2) | then store the

unsigned x1, x2, (x3)

Memory[x3] = x2 unsigned smaller

value to memory

Load word from

xl = Memory[x3]; if | memory to register,

AMO max amomaxu.w

l (Memory[x3]< x2) | then store the

unsigned x1, x2, (x3)

Memory[x3] = x2 unsigned greater value

to memory

1.3 Introduction to microarchitecture

Microarchitecture (also known as computer organization) defines the processor's

organization, including the major functional units, their interconnection, and control

[2] as well as how a given instruction set architecture is implemented in a processor.The following subsections introduce some common architectures which are usually

implemented in modern processors, including very-long-instruction-word processor,

multithreaded processor, out-of-order processor & multi-core processor

1.3.1 Multi-core processor overview

A multi-core processor is an architecture that multiple processor cores are unified

ona single die A processor core is known as a processing component of the processor

that can independently fetch and execute instructions from a least one instructionstream A core contains typically logic components such as instruction fetch unit,program counter (PC), instruction scheduler, functional units, register file, A multi-

core architecture is a relatively recent design, having emerged only in the last decade

of four decades of microprocessor history [5]

The emergence of multi-core architecture marked a significant turning point inparallel computer architectures’ evolution They play a vital role in the architecture

Trang 34

of large, powerful, and expensive computer systems as well as multi-core systems inthe mainstream architecture of servers, desktops, and even mobile devices such as

cell phones Before 2001, parallel computers were mainly used in servers and

supercomputers Client machines (desktops, laptops, and mobile devices) weresingle-core systems For various reasons, 2001 marked a turning point when the first

multi-core chip, the IBM Power4, was shipped The Power4 chip was the first

non-embedded microprocessor that combined two cores on a single die, and the cores are

integrated tightly to support parallel computation The three decades before 2001 saw

a design approach where a single core became more complex and faster, and the

decade since 2001 has seen a design approach where multiple processor cores are

implemented in a single chip [5]

One of major trends that inspired the advancement of multi-core architecture is

the shrinkage of transistors; for that reason, more and more transistors are able toplace in a single die The pace for transistor integration has been significantly

tremendous [5]

1.3.2 Microarchitecture comparison

The four computer architectures introduced above are the most classicalarchitectures to achieve higher performance of a processor; however, each of them

always has disadvantages that architects have to trade off to gain superior

performance (more detail in Table 1.5) Moreover, modern processors can beimplemented in many other techniques (Superscalar processor, vector processor, )

to enhance their power

Table 1.5: Advantages and drawbacks of mentioned microarchitectures

Microarchitecture Advantages Disadvantages

VLIW e Reduce hardware complexity e The higher complexity

[6] of the compiler [6].

e Low power consumption [6]

Trang 35

Suitable to handle compression/

decompression of image and

speech data [6]

¢ Challenging to managebackward compatibility

Multithreaded | s Improved throughput Deadlock, livelock,

¢ Optimizing system resource starvation issue.

usage ¢ Synchronization of

shared resources

Out of order ® Significant performance ¢ Complex structure.

improvement ® Increased replay traps

e Higher utilization of functional [7]

units Increased cache misses

® Less processor stalling [7]

Multi-core e Easy to implement e It cannot work at twice

the speed of a regularprocessor They getonly 60% - 80% more

speed

® Interconnect Issues

e Thermal issues

Because of effortless implementation and the benefits they return, this project will

build a multi-core processor to boost performance However, the designers must dealwith the synchronization problem — maintaining the accuracy of shared resources

(Specifically, Shared resources in this project are data memory)

1.4 Memory coherence & consistency problem

Because the multi-core processor contains a shared memory, a reading/writingprocess by any cores in the system can affect memory coherence & consistency

(synchronization problems), so it has to be considered Coherence and consistency

are complementary: Coherence defines the behavior of reads and writes to the same

Trang 36

memory location, while consistency defines the behavior of reads and writes withrespect to accesses to other memory locations [8].

The first (coherence) in this project is maintained by adding an arbitrator thathandles any memory violations between 2 cores & implementing “A” standardextension instructions of RISC-V ISA

There are several models to keep the memory consistency The first model is

sequential consistency Sequential consistency requires that the result of any

execution be the same as though the memory accesses executed by each processorwere kept in order and the accesses among different processors were arbitrarily

interleaved [8] Sequential consistency is implemented in this project because it is

straightforward to understand, but the trade-off is processor performance reduction

Architectural optimizations that are suitable for uniprocessors usually violate

sequential consistency and result in a new memory model for multiprocessors is born.One of them is relaxed consistency which allows some loads and stores to be operatedout of order There are many different variations of consistency models; there some

example in Table 1.6

Table 1.6: Relaxed consistency models

Type of consistency model ExampleLoads may be reordered after loads | PA-RISC, Power, Alpha

Loads may be reordered after stores | PA-RISC, Power, Alpha

Stores may be reordered after stores | PA-RISC, Power, Alpha, PSO

Stores may be reordered after loads | PA-RISC, Power, Alpha, PSO, TSO, x86

RISC-V has specified their memory consistency called RVWMO (RISC-V WeaMemory Ordering), a set of rules specifying the values that loads of memory can

return Under RVWMO, code running on a single hart appears to execute in order

from the perspective of other memory instructions in the same hart (hardware thread)

Trang 37

instructions from the first hart being executed in a different order [4] Relaxedconsistency may be required for out-of-order processors, which may not suit this

project

1.5 CPU system design

System design includes all of the other hardware components within a computingsystem, such as data processing other than processor (E.g., direct memory access),

virtualization, and multiprocessing Figure 1.2 is an example of a CPU system

DMA

The processor controller MMU

Interconnect

1/O interface Main memory

Figure 1.2: Example of CPU system design

The system design includes:

e The processor

® DMA (Direct memory access) controller

e MMU (Memory management unit)

e Input and output interface

e Main memory

e Interconnect

This project focuses on implementing a multi-core processor and the interaction

between that processor and the memory

Trang 38

Chapter 2 System design

This chapter introduces the general system design As shown in Figure 2.1,

including six components:

e Assembly program is the system's input containing code segments

requesting to be converted to binary strings

e An assembler is a tool for switching assembly program to binary string

format

e The binary instruction file contains the assembly program in binary digit

string format

e Instruction memory stores the binary program

e The processor fetches instructions from the instruction memory, then

execute them and read/write from/to data memory/register file

e Data memory stores data and can be accessed by the processor

Assembly Binary

program —*_ Assembler ———> _ instruction

(asm) format file

Instruction

memory

Processor

Trang 39

The system is split into software design and hardware design from thespecification above In the software design aspect, the task is to create an assembler

that has the ability to convert assembly language based on the RISC-V instruction set

to binary instruction format In hardware design, the task is building a processor withtwo cores 5 stage pipelines based on RV32I instruction set and RV32A extension,dual ports data memory with different regions accessed by dual cores of the processor,

and instruction memory where stores binary instructions from the assembler output

The following subsections are going to introduce more details about the two tasks

2.1 Software design

An assembler is a program that computer instructions that are converted to binary

strings based on their ISA The processor can use that set of binary strings to performsome basic operations Assembler development is vital to the verification processwhere test cases are instruction sequences in the form of assembly language; the

assembler converts these test cases automatically help to reduce verification time

Some basic features are available in the assembler program: syntax verification,converting certain pseudo instructions, convert labels to specific addresses in jump/

branch instructions Figure 2.2 shows a general flow chart for an assembler program

The assembler flow chart is presented in detail:

e Create hash tables for opcode, funct3, funct7, registers

e Read assembly program from asm file to an array

e Scan labels in the entire assembly program, then add them to a hash table

e Search in the array for comments or blank lines, then remove them

e Split an instruction into specific fields (mm, rs0, rs1, rd) depending on that

instruction format

e Verify instruction fields; if all of them exist in hash tables, they will be

transfer to binary string; otherwise, an error message appears

Trang 40

e Export the result to output files, including a bin file which contains only

binary strings, and another one is called a display file (.dis) which is used

for debugging purpose

Create hash tables foropcode, funct3, funct7,

blank lines

v

Splitinstructions

output files

Ỷ

End

Tiêu đề	Research Design A Multi-Core Processor Based On RISC-V Instruction Set Architecture
Tác giả	Nguyen Phan Hoang Phuc
Người hướng dẫn	M.Eng. Ho Ngoc Diem
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Capstone Project
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	122
Dung lượng	28,49 MB