VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF COMPUTER ENGINEERING NGUYEN PHAN HOANG PHUC CAPSTONE PROJECT RESEARCH DESIGN A MULTI-CORE PR
Trang 1VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER ENGINEERING
NGUYEN PHAN HOANG PHUC
CAPSTONE PROJECT RESEARCH DESIGN A MULTI-CORE PROCESSOR
BASED ON RISC-V INSTRUCTION SET
ARCHITECTURE
NGHIÊN CỨU THIET KE VI XỬ LÝ ĐA NHÂN DUA TREN
KIEN TRÚC TẬP LỆNH RISC-V
Trang 2VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER ENGINEERING
NGUYEN PHAN HOANG PHUC - 17520909
CAPSTONE PROJECT
RESEARCH DESIGN A MULTI-CORE PROCESSOR
BASED ON RISC-V INSTRUCTION SET
ARCHITECTURE
NGHIEN CUU THIET KE VI XU LY DA NHAN DUA TREN
KIEN TRUC TAP LENH RISC-V
BACHELOR OF ENGINEERING
IN COMPUTER ENGINEERING
SUPERVISOR
Trang 3LIST OF DEFENSE COMMITTEE
Defense committee of the capstone project, established under Decision No DHCNTT dated July 23, 2021 of Rector of the University of InformationTechnology
462/QD- — Chairman
— Secretary
E — Commissioner
Trang 4The basis of this research initially stemmed from my passion for developing abetter architecture for a modern processor As the world moves further into the digitalage, there will be more significant and more robust hardware, especially processors,
to calculate and handle increasingly complicated programs The low-performanceprocessor will be outdated technology and removed to make room for the new one
How will we improve processor performance? It is my passion to find out and develop
a new method to break down the limitation of processors today
In truth, I could not have achieved my current level of success without a strongsupport group First of all, all lecturers and staff at the University of Information
Technology — VietNam National University - Ho Chi Minh City, who always teach
a solid foundation of knowledge, especially the ones in the department of computer
engineering, each of whom has given me opportunities to absorb expertise experienceeffectively Secondly, I also appreciate my parents, who supported me with love and
understanding Finally, I am so grateful to Hồ Ngọc Diễm, who has provided patient
advice and brightly guidance throughout the research process
Once again, thank you all for your unwavering support
Trang 5VIETNAM NATIONAL UNIVERSITY HCMC THE SOCIALIST REPUBLIC OF
UNIVERSITY OF INFORMATION VIETNAM
TECHNOLOGY Independence - Freedom - Happiness
OUTLINE DETAILS
VIETNAMESE NAME: NGHIEN CUU THIET KE BO VI XU LY DA LOI DUA
TREN KIEN TRUC TAP LENH RISC-V
ENGLISH NAME: RESEARCH DESIGN A MULTI-CORE PROCESSOR
BASED ON RISC-V INSTRUCTION SET ARCHITECTURE
Supervisor: M.Eng Hé Ngoc Diém
Start day — End day: From March 08, 2021 to June 25, 2021
Student: Nguyén Phan Hoang Phtic - 17520909
Content
1 Overview
When a single-core processor rapidly reaches complexity and speed limit A multi-core
processor is the simplest way to enhance processor performance This methodology is also
a trend of developing processors in recent years to exploit entirely current bleeding-edgesemiconductor technology Besides, the RISC-V instruction set architecture (ISA) is being
provided without charging any license fee, and it makes room for designing your own
processor more effortlessly The RISC-V community are well-known, and their processorshave applied in numerous different SoC as well as boards; some of them are available forthe commercial market
Researching relating work is a trend over the world:
e Single-core RISC-V processor examples include RVCoreP and Skati processor
These processors are multi-cycle processors that can perform base
Trang 6instructions, multiplication/division and floating-point operations (EspeciallyShakti processor is capable of executing atomic instructions) By implementingthe Atomic standard extension, Skakti’s frequency decreased dramaticallycompared to RVCoreP, purely focusing on ALU optimization.
Multi-core processors can be mentioned that is the quad-core processor
designed by Manoj Kumar Gouda and D Yugandhar ISA of this processor isnot mentioned, and its memory is organized as centralized shared- memory.Another multi-core processor is the many-core processor designed by Ahmed
Kamaleldin team, which includes 36 cores and distributed shared-memory
implementation This processor can perform RV32I and two other extensions
(RV32M and RV32C) Atomic operations are not applied to both processors
There are some free available cores similar to this work containing Pulpino
RISCY, PicoRV32, Freedom Three processors are famous in the RISC-V
community and often used as the core processor in other work They canexecute basic arithmetic & logic operations as well as compressed instructions.The drawbacks of these processors are that they include much auxiliaries
hardware on the processor These components may no need in other application
and lead to performance reduction
In Viet Nam, there is research relating to RISC-V ISA:
The most recent research is designing a RISC-V processor with supervisormode, which can execute the RV32I base instruction set and other systeminstructions This work is in charge by An Xuan Tuan from the Faculty ofComputer Engineering Although this processor can handle exceptions andsupport supervisor mode, it only is implemented as a single-core processor.That is the reason why the operating frequency is limited to 12.94 Mhz
The first reason for researching is applying atomic instructions, a straight forward
method to maintain synchronization in a multi-core processor but was not applied in the
Trang 7multi-core processor above to measure processor performance in each technique Besides,
no research is recorded in Viet Nam about designing a multi-core processor based on
RISC-V ISA; this work will contribute to helping Viet Nam caught up with the world
advancement in the integrated circuit design field in general and RISC-V processor inspecific
2 Project objective
Design a 32-bit multi-cycle processor that is capable of executing basic operations
in the RV32I instruction set and their atomic extension RV32A
Design an arbitrator handles access memory violations that may occur during anexecution time
Verify the design by industrial methods
The operating frequency for the processor at least is 100 Mhz
Simulation method: Run simulation on ModelSim with test cases generated
randomly by the RTG unit (Random test generator)
Practical method: Embed and verify on the FPGA board (Using DE2 kit)
4 Main content
Design processor core includes PC, Register files, ALU, Hazard detection unit,Branching unit, Forwarding units,
Design data memory and instruction memory
Design arbitrator unitDesign simulation on Model Sim
Test on the FPGA board
Trang 85 Implementation plan
Research RISC-V architecture March 08, 2021 — March 14, 2021
Trang 9Hồ Ngọc Diễm Nguyễn Phan Hoàng Phúc
Trang 10TABLES OF CONTENTS
Chapter 1 Introduction wo cece seeeceseeeesesesessesesescsesseseseecsssnsseseseensnaesesseneneneees 4
1.1 Introduction to instruction set architecture
1.1.1 RISC-V architecture OV€TVI€W càng Hư 4
1.1.2 Comparison of instruction set architectures
1.2 Details in RISC-V instruction set architeCfUTe ¿5-5552 5+5<+s+<c++ 8
1.3 Introduction to microarChit€CUTC + ¿5+ 522*+*££+£+£++x+Eexererrtrrsrerre 16
1.3.1 Multi-core processor OV€TVIW St Stttrterrrrrrererrree 16
1.3.2 Microarchitecture COITATiSONI - 5-5 5c S252 S+‡++E£xEerrkekerrrrkree 17
1.4 Memory coherence & consistency problem -s-¿- «+ +sc++s+++£++e++ 18
1.5 CPU system designn 2c 22222 t2 2221 21211121212101.111121.1 re 20
Chapter 2 System design ¿5c c2 22.2.2122 1221112101121 re 21
2.1 Software đ€SIET th ng HT HH HT ngư 2
2.2 Hardware đe€Sign tư 24
Chapter 3 Detail Desig1n - 13212121212 E2 2121210 11 1212101 111111 te 27
3.1 Software €SIØT tàn nnTHHHHHHHHHHHHHHHH T1 H0 1g re 27
3.1.1 Data initialiZat[OI +6 1S St HH 010101200 0011 27
3.1.2 Creating label hash tabÌe -¿- + - + St +k+k*k£kEEEEkekEEEEEkrkrkerree 27
3.1.3 Reformat & splitting instructiOn - 5-5-5 S+S++xcscseccxexerrrere 29
3.1.4 Verify and convert inSITUCtÏOI 5 5+ sssceceesevrerererereeeeexrecee2 Í
3.1.5 Export the result to output ẨiÏ€s 5++5ccxseeeeeeeexeeexexee.er.e 24
3.2 Hardware design - 5-5
Trang 113.2.2 Data memory đ€SigI cece ee SE SE k1 2212181 1111211 re 35
3.2.3 Arbiter
3.2.4 PTOC€SSOT COFC S1 ST HH HH HH hà 50
Chapter 4 Verifcation & Evaluation
4.1 Verification pÏan - ¿5c 52t 2S 321 12 2212111 1.11112121010111 HH rườn 75
4.2 Directed test verification
4.2.1 Test case Ø€T€TA(ÏONI 6 + 1 vn HH HH HH gi T7
4.2.2 Test case manipulation
4.2.3 Automation verification method - - - 2 + 5++++++++ce£zx+xexerereree 82
4.3 Random test V€TiÍÏCAfiOH ¿525266 + St E12 11111111 ren 84
4.4 Riscv-tests V€TIÍICAfÏO SG ST nh TH nh TH nh nh nàn như 85
Appendix A Introduction other instruction set architeCfUre - - + ++s<+ 96
A.I X86 architecture OV€TVI€W nh HH re 96
A.2 ARM architecture OV€TVIW + nành HH1 gu 97
A.3 MIPS architecture OV€TVICW nành H000 ru 98
Appendix B Introduction other microarchit©CfUT€ - - + ¿5-55 c+c+s++x+xe++ 100
Trang 12B.1 Very long instruction Word processor ¿5-5 552 S*‡++v+csevrereerree 100
B.2 Multithreaded processor
B.3 Out-of-order Processor S1 12121 1 11 1 12121 111111112 g0 tr 101
References
Trang 13LIST OF FIGURES
Figure 1.1: RV32I instructions formats - ¿c5 5c + ‡£‡£sxeekzrrerersee 10
Figure 1.2: Example of CPU system design - - 5252555 5+scc>xzeseersrx 20
Figure 2.1: The general system đesign -. - + ¿5522 ++2St2txtxekerrererrrree 21
Figure 2.2: General algorithm of asseImbÏeT ¿- + 5+ *£++x+x£xexeEvzxexererre 23Figure 2.3: General block diagTaIm - ¿+ + + St S**E£k£k#EvErkekekekrrkrkrkererre 26Figure 3.1: Flow chart for creating label hash table - ¿+ -<5+++++s+cexse+ 28
Figure 3.2: Flow chart of SpÏitting - 5c 55c sssseeeeeererrrerrrerece OO
Figure 3.3: Flow chart of convert and verify instructiOn ‹-‹-+-+<+-+ 33
Figure 3.4: The design of instruction I€IOYY ¿+ s5 5+s+xs+s+vevxsxseeseeerxee DDFigure 3.5: Data memory schematic ¿- ¿c6 St*‡EvekekerrrrkkrrekrkrrkrkreÐ Z
Figure 3.6: Data memory parftÏOI - ¿5c 255252 S*2*‡*‡+‡£v>t+t+xexexerrxexerrrre 39
Figure 3.7: Schematic of sub_ data_ meim - - - + ¿+6 +E‡£k+k+xexexerrkevererere 40Figure 3.8: Schematic of ari(€T St t St SEEvEEEkskekekrkrrrrereerkrkrkrkrerrrd AlFigure 3.9: sub_arbitrator schematic c cccccccccssesessessseseseeneseseseeeeseseseseaneacssseseeees 43
Figure 3.10: The pipeline S(TUCfUTC ¿- ¿+ 5S S2££E+E‡E£kEEEEEEEekekrkekrkrrrrree 50
Figure 3.11: the IF stage schematc c.ccccessssesesesseseseseseseeeseseseesesesessseeneseseeeeneees 51Figure 3.12: Schematic of the ID Stag€ - - + c5 xv#EkErkekekekekrkrkrrrree 53
Figure 3.13: Schematic of the contTỌer + ¿<< +S+5++++x+xe+ez+xexexvxss+ 57
Figure 3.14:Schematic of the B stage - «5+5 tk 63Figure 3.15: Schematic of the EX s(age€ - ¿552 2 Sr2t+t2tterrrrkrrerrrree 67Figure 3.16: Schematic of the MEM stage - St steeerrrrrerrrree 71
Figure 3.17: Schematic of the WB stage cccccseessesesecesneseseseeeeesesesesnesssesesesees 73
Figure 4.1: Verification process cccceescssesesesesssseaesessesesesssesesseseseeesnsneseaeseeesees T7Figure 4.2: Test case generation flow - c6 tt ngư 78
Figure 4.3: Data generator algorithm - + ¿5+5++s++S>t+tzxexexerrxexerrree 80
Figure 4.4: File generator algOrithim - - S5 Sky 81Figure 4.5: Automation verification process 82Figure 4.6: Automation verification process (Cont.)
Trang 14Figure 4.7: The LUI instruction eXeCUfiOH - - + ¿5-5 SE St+t+kevekerrkrkererre §6Figure 4.8: The BNE instruction eX€CUtiOI 6-5 tt svEvvexeeeserrrerree 86
Figure 4.9: The LR/SC sequence eX€CutiON ¿5-5555 S+2x+xcxexerexexsrrrre 87
Figure 4.10: A parf of post-synthesize simulation waveform - - ¿+ 88Figure 4.11: Contents of data MeMory ceceessssessseeeteneeseeeseseeceeeeeesseesseseeeneneeee 88
Figure 4.12: Schematic of top level for FPGA implementation - 90
Trang 15LIST OF TABLES
Table 1.1: Status of ISA base and ex†€nSIOIS ¿cà tre 5
Table 1.2: Pros and cons of each ISA [3] cecccecceessesesecssesesecseseesecseseeseeessesseeseeeeees 6Table 1.3: RISC-V register CONVENTIONS 0.0 ceseeeseeseseseseeeteeeeseeeseseeesenseeteseeaeseneees 9
Table 1.4: RISC-V assembly language [2] ¿- ¿5-5 eteeeeescseeeseseneneeeee 11
Table 1.5: Advantages and drawbacks of mentioned microarchitectures 17Table 1.6: Relaxed consistency models + 5+ + x‡k#£vEvxexeeexserkrrerree 19Table 2.1: Bus system of the design 25
Table 3.1: I/O interface of instruction memory design 235
Table 3.2: I/O interface of data memory design 36Table 3.3: I/O interface of sub_data_memory design
Table 3.4: I/O interface of sub_arbitrator 44
Table 3.5: Writable boundaries corresponding to each COFe -¿- - + 5+++ 46Table 3.6: I/O interface of arbitrator_ontrOlÏe- + s2 s++ss+es+zs++sx+zs£zssz 47Table 3.7: Memory operation TuÌ€S - ¿ ¿+5 5+ 5+ 5*2*+££++£e>++t+xexexerrerxerrrre 48
Table 3.8: Instruction decoding table eccecceeecce cscs eeteneseseseeeesenesestsneseneseeeanes 49
Table 3.9: I/O interface of the IF stage - + 5526k 51Table 3.10: I/O interface of the ID stage 5-5-5252 S++c+cesrsrrtrksrrrre 54
Table 3.11: Hazard Scenarios 0 cccececesescseeeescsescseseseessesesesesenesesesesseseseseeeeeaeee 56
Table 3.12: I/O interface of the controller oo ccc cece 5< + +++£+s++££vzxzxeeexer+ 58Table 3.13: control_field signal encoding ¿ -5- 5252 5++x+x+xexsrerxzterrrrx 59Table 3.14: Instructions are decoded by ØTOUD - ¿5c 5+5 s*£vxseeeeeeeererereed 62
Table 3.15: I/O interface of the B sfage c5 2 cty 63
Table 3.16: I/O interface of the EX stage ¿6 Sky 67Table 3.17: Operation deCOding - ¿5:56 1+ ESx k2 2 2212111111111 te 68
Table 3.18: alu_control_ out signal đefinitiOn ¿ - 555+s<cc+xecveeerereesed 69
Table 3.19: I/O interface of the MEM sfage - - - (5c «Street 71Table 3.20: I/O interface of the WB stage - cty 74Table 4.1: Verification plan oo ccc cseeseseeeenseeeessescsenseseesseseseseneeeessseessenseeneeeeee 75
Trang 16Table 4.2: RV32A violation simulation reSuÏt -sc555++cvc++evrxereerrxey 84Table 4.3: RTG Verificaiton result cceceesesseseeseseeseeteseessseeseessseesseessesseensseeneneenes 85
Table 4.4: Verification T€SUÏL +6 5+5 SE2EE1Ek ST ng HH 85
Table 4.5: Synthesis parameter cOnfiguratiOI ¿-¿- 5c s52 5++++x+sexervzxzverererx 88Table 4.6: Fmax of the processor cecsessseseeeseeseeeseseenenseeessseseseneeeeesseasseneeeneneeee 88Table 4.7: Utilization SuImTHATY 5° 552222 +$2StS*2E‡E£ESEvEkekekererkrkrkerrree 89
Table 4.8: The comparison to pipeline DTOC€SSOTS -. ¿- 555252 5+2++x+scs++ 91
Table 4.9: The comparison to other multi-core DFOC€SSOTS ¿55-52 25+ 5<++ 92Table 4.10: The comparison to popular RISC-V coTes -. - ¿s55 c+cs<++ 92
Table 4.11: Pros and cons of the dual-core DTOC€SSOT - - - - ¿55555225 5+5+ 93
Trang 17List of Acronym
ISA Instruction set architecture
RISC Reduced instruction set computer
RV32I Risc-v 32 Interger
RV32A Risc-v 32 Atomic
SMP Symmetric multiprocessor
DSM Distributed shared memory
RMWs Read-Modify-Write instructions AMOs Atomic instructions
FPGA Field Programmable gate array
PC Program Counter
UART Universal asynchronous
receiver-transmitter RTG Random test generator
Trang 18The world always seeks a new architecture processor to speed up current computerperformance Many technology companies have invested their money to find new
methods to accelerate their processors and achieve that purpose Multi-core processor
development is the most excellent technique and essential for any modern systembecause it is a practical and straightforward approach to enhance processorperformance while developing a new hardware architecture is increasinglychallenging and costly
An open-source instruction set architecture allows researchers to research and
develop their processor without any license fee, namely RISC-V Currently, RISC-VISA's ecosystem is large and enough information for a new developer RISC-V isnow becoming a potential candidate to join the competition with ARM, Intel
Hence, this project is going to implement a dual-core processor based on RISC-Varchitecture, including RV32I instructions to handle integer operation & RV32A forsynchronization purposes Some problems have to be dealt with that maintain thevalidation status for data memory because it is being shared for two cores.Furthermore, each core inside the processor is developed as a 6-stage pipeline
processor to maximize performance, so handling hazard occurrences are a critical
task that has to be considered Besides that, this project is tended to develop someadditional tools to support the processor and get familiar with industrial processes
The project's result is promising, dual-core processor stably runs as expected andexecutes the function of every assembly code program correctly in a simulation
environment, and there is no flaw in the memory violation handling task This project
also can be extendable to become an out-of-order processor or superscalar processor
Trang 19Problem statementThe microprocessor industry always has a vital role in the technological
advancement aspect since their coming appearance in the 1970s The growing market
demands for faster performance drove the industry to produce more powerful andsmarter devices The most classic technique is to operate the chip at a higher
frequency to achieve that goal, allowing the processor to execute tasks much quicker
in the same period, and this tendency full bloom from 1983 - 2002 Researchers havediscovered additional techniques to improve performance by exploiting parallelismexecution, including parallel processing, data-level parallelism, and instruction-level
parallelism These methods have all been determined to be very useful, and A
multi-core processor is a significant way to improve performance Another technique thatimproves substantial performance is a multi-core processor In fact, the multi-core
processor microarchitecture has existed for the past decade; however, it has gained
more importance today due to the technical limitations of single-core processorsfacing high throughput and long-lasting battery life with high energy efficiency [1]
Driven by a performance-hungry market, microprocessor designers have alwaysbeen kept performance and cost in mind Gordon Moore, the founder of IntelCorporation, predicted that the number of transistors on a chip would be double onceevery 18 months to meet this ever-growing demand, popularly known as Moore’sLaw semiconductor industry Besides bleeding-edge chip fabrication technology,integrated circuit processing technology introduces the possibility to integrate one
billion transistors on a chip to enhance performance by increasing integration density
However, bleeding-edge chip fabrication techniques regularly alongside majorbottleneck and power dissipation issues due to the microarchitecture's performanceincrease obeyed Pollack's rule is roughly proportional to the square root of the rise incomplexity That is means doubling the logic on a processor would only improveperformance by 40% Studies have revealed that the more chip size shrinks, the moreleakage increases, which increases static power dissipation to great value Although
the mentioned mean of improving performance is operation frequency increment, the
Trang 20frequency is currently limited to 4GHzPower dissipation increases again if thefrequency goes beyond that level) [1].
Performance is still a major design objective of semiconductor manufacture, butother essential considerations include chip fabrication costs, fault tolerance, powerefficiency, and heat dissipation It leads to the development of multi-core processors
is an effective way to address these challenges [1]
Thus, demands research and design of a higher performance multi-core processor
is necessary to innovate modern chips This project is based on RISC-V instructionset architecture (an open-source ISA developed at the University of California,
Berkeley, and volunteers at the RISC-V) to build a basic multi-core processor with
two cores inside This processor will then be synthesized by Quartus & simulated onModel Sim to verify the design correctness The verification process is automaticallyrun by tools developed by my own (including instruction generator — generate test
cases randomly to verify, simulation supporter which shortens simulation time & get
the result faster)
Trang 21Chapter 1 Introduction
The first chapter introduces three main components of computer architecture
(instruction set architecture (ISA), microarchitecture & system design) and their
actual implementations, from that information to clarify why this project actualizemulti-core processor based on RISC-V principles
1.1 Introduction to instruction set architecture
Instruction set architecture, also called architecture, is an abstract interface
between the hardware and the lowest-level software that encompasses all theinformation necessary to write a machine language program that will run correctly,
including instructions, registers, memory access, I/O, [2]
Both hardware and software consist of hierarchical layers using abstraction, with
each lower layer hiding details from the level above One key interface between the
levels of abstraction is the instruction set architecture — the interface between thehardware and low-level software This abstract interface enables many
implementations of varying cost and performance to run identical software [2]
Many popular ISA is implemented on chips today, such as x86, ARMv7, ARMv8,MIPS, while they still have flaws and need improvement That is why a new ISA
like RISC-V is developed to remove their ancestors' drawbacks The next subsections
will introduce more detail about each ISA mentioned above and compare theiradvantages/disadvantages
1.1.1 RISC-V architecture overview
V is an open ISA based on RISC principles like MIPS, except that
RISC-V ISA is offered open-source licenses that do not require any fee to use Many
companies are producing or have introduced RISC-V hardware, open-source OS withRISC-V support, such as Nvidia, Western Digital
It is structured as a small base ISA with various optional extensions The base ISA
is very simple, making RISC-V suitable for research and education, but complete
Trang 22enough to be a suitable ISA for inexpensive, low-power embedded devices Theoptional extensions form a more powerful ISA for general-purpose and high-
e The standard integer multiplication and division extension is named “M” andadds instructions to multiply and divide values held in the integer registers
e The standard atomic instruction extension, denoted by “A”, adds instructions
that atomically read, modify, and write memory for inter-processor synchronization
e The standard single-precision floating-point extension, denoted by “F”, addsfloating-point registers, single-precision computational instructions, and single-
precision loads and stores
e The standard double-precision floating-point extension, denoted by “D”,expands the floating-point registers, and adds double-precision computational
instructions, loads, and stores
e The standard “C” compressed instruction extension provides narrower 16-bitforms of common instructions
e Other standards (L, B, J, T, ) are introduced, but it is only draft versions and
needs time to develop (See other extensions in detail in Table 1.1)
Table 1.1: Status of ISA base and extensions
Base Version | Status
RVWMO | 2.0 RatifiedRV321 2.1 Ratified
RV641 2.1 Ratified
RV32E 1.9 Draft
Trang 23Zifencei | 2.0
Zam 0.1Ztso 0.1
Ratified
Ratified
RatifiedRatified
Ratified
Ratified
Draft
DraftDraft
Draft
Draft
DraftDraft
Ratified
Ratified
Draft
Frozen
1.1.2 Comparison of instruction set architectures
x86 is a CISC architecture which is a significant difference from ARM, MIPS,
V; they are all varieties of RISC designs In more detail, ARM, MIPS,
RISC-V architecture is based on a common type of architecture, “load-store architecture”
or “register-register architecture”, meaning data-processing operations operate onregister contents and only load and store instructions can access memory
Each architecture still has its pros and cons, and Table 1.2 illustrates both somegood and drawback features of their ISA; processors are built from these ISA
Table 1.2: Pros and cons of each ISA [3]
x86 e All popular software has been
ported to or was developed for the
Trang 24e The most popular instruction inthe laptop, desktop, and server
markets
ARMv7 | ¢ ARMv7 define 3 classes (A, R, | ®s No support for 64-bit addresses.
M) for a specific application ® ARMV7 is vast and complicated
se The great quantity of software to implement
that has been ported to the ISA | s License requirement
and to its ubiquity in embedded | e«ARMv7 ¡is not classically
and mobile devices virtualizable
e ARMv7 is by far is the most
widely implemented architecture
in the world
ARMv8 | ¢ ARMVv8 was extended to 64-bit} se The ISA still is complex and
the instruction set was designed unwieldy
from scratch, fixing many old | e No support for a compressed
issues instruction encoding.
® Backward compatibility e License requirement
MIPS All opcodes are 4 bytes which ® MIPS has poor encoding, which
simplify the instruction decoder
eThe MIPS user-level integer
instruction set comprised just 58
instructions
se The complexity instruction set
and hardware is _ reduced,
facilitating inexpensive pipelined
¢ The ISA is over-optimized for
a specific microarchitecturalpattern, the five-stage, single-
issue, in-order pipeline
e License requirement
Trang 25RISC-V | ¢ Good encoding instruction mm
field), can use the same decoderfor multiple instruction formats
e Wide range of communities,support tools
® RISC-V is open source
e Highly scalable ability
eMany extensions are notstandard version, need time to
optimize & develop
X86, ARM, MIPS architecture lacks several technical features and requireslicenses fee to work on their ISA That is the reason why I choose RISC-V for my
roject, a free and open ISA that avoids technical drawbacks of the old ISA and
straightforward to implement in many microarchitectural styles The next sectionintroduces more about RISC-V architecture
1.2 Details in RISC-V instruction set architecture
RISC-V was defined to support research in data-parallel architectures, the Romannumeral ‘V’ also conveniently served as an acronymic pun for “Vector” RISC-V was
to make an ISA suitable for nearly any computing device, and they met all specific
technical goals when they start to develop this ISA [3]:
e Separate the ISA into a small base ISA and optional extensions
e Support both 32-bit and 64-bit address spaces
e Facilitate custom ISA extensions
e Support variable-length instruction set extensions
e Provide efficient hardware support for modern standards
¢ Orthogonalize the user ISA and privileged architecture
This project implements RV32I, which is the base 32-bit integer ISA It is a simple
instruction set, the number of mandatory user-level hardware instructions to 40 Like
other RISC instruction sets, these instructions are divided into three categories
(computation, control flow, and memory access) Because RISC-V is a load-store
Trang 26architecture, in which arithmetic instructions operate only on the registers, only loadsand stores allow transferring data to and from memory [3].
The addressing modes of the RISC-V instructions are the following [2]:
e Immediate addressing, where the operand is a constant within the
instruction itself
e Register addressing, where the operand is a register
e Base or displacement addressing, where the operand is at the memory
location whose address is the sum of a register and a constant in the
instruction
e PC-relative addressing, where the branch address is the sum of the PC and
a constant in the instruction
Table 1.3 shows that 32 registers have 32 bits wide and are assigned to a specificuse Register x0 is hardwired with all bits equal to 0 General-purpose registers x1 —x31 hold values that various instructions interpret as a collection of Boolean values,
or as two’s complementation signed binary integers or unsigned binary integer Theonly additional register is the program counter namely pe which holds the address ofthe current instruction [4]
Table 1.3: RISC-V register conventions
Name Register Usage Preserved
number on call?
x0 0 The constant value 0 N.a
x1 (ra) 1 Return address (link Yes
register)
x2 (sp) 2 Stack pointer Yes
x3 (gp) 3 Global pointer Yes
x4 (tp) 4 Thread pointer Yes
x5-x7 5-7 Temporaries Yes
Trang 27x8-x9 8-9 Saved No
x10-x17 10-17 Arguments/results No
x18-x27 18-27 Saved Yes
x28-x31 28-31 Temporaries No
Forty instructions are encoded to instruction formats: R, I, S, B, U, J (described
in Figure 1.2) In these formats, instructions source up to two register operands,identified by rs1 and rs2, and produce up to one result, recognized as rd A significant
feature of this encoding is that these register specifiers, when present, always occupy
the same position in the instruction This property allows register fetch to proceed in
funct7 ts2 rsl funct3 rd opcode
imm{11:0] rs] funct3 rd opcode
imm[11:5] rs2 rsl Tunct3 imm[#0] “opcode
imm[12] | imm[10:5] Ts2 rsl funet3_ | imm[4:1] | imm[TT] | opcode
imm[31:12 rd opcode
immj20] imm[T0:1] imm[T] imm[19:12] td opcode
Figure 1.1: RV321 instructions formats
Another feature of this encoding scheme is that generating the immediate operand
from the instruction word is inexpensive Of the 32 bits in the immediate operand,
seven always come from the same position in the instruction, including the sign bit,which, due to its high fan-out, is the most critical 24 more bits come from one of twopositions, and the final immediate bit has three sources The SB and UJ formats,
which have their immediates scaled by a factor of two, rotate the bits in the
immediate, rather than using hardware MUXes to do so, as was the case in MIPS,
SPARC, and Alpha This design reduces hardware cost for low-end implementationsthat reuse the ALU data path to compute branch targets [3]
Trang 28In highly parallel systems such as this project (when a memory word is contended
by two different cores), it needs atomic memory instructions to send the memory
word to data memory To handle this task, RISC-V has ‘A’ standard extension
including LR/SC (load-reserved/store-conditional) and several other atomic memoryoperations, which perform arithmetic and logic operations on a memory word, thenreturn the old value (signed and unsigned minimum and maximum; bitwise AND,
OR, and XOR; addition; and swap) RV32I and RV32A instruction set are described
more precisely in Table 1.4
Table 1.4: RISC-V assembly language [2]
Instruction Example Meaning Comments
Trang 29Word from memory to
Load word lw x5, 40(x6) | x5 = Memory[x6 + 40]
register
Word from register to
Store word sw x5, 40(x6) | Memory[x6 + 40] = x5
memory
Halfword fromLoad halfword | Ih x5, 40(x6) | x5 = Memory[x6 + 40]
memory to register
Load Unsigned halfword
halfword, lhu x5, 40(x6) | x5 = Memory[x6 + 40] | from memory to
unsigned register
Store halfword sh x5, 40(x6) Memory[x6 + 40] = x5
Halfword from
register to memory
Byte from memory to
Load byte Ib x5, 40(x6) | x5 = Memory[x6 + 40] :
register
Load byte, Byte halfword from
Ibu x5, 40(x6) | x5 = Memory[x6 + 40]
unsigned memory to register
Byte from register to
Store byte sb x5, 40(x6) | Memory[x6 + 40] = x5
memory
Load; Ist half of anLoad reserved Ir.w x5, (x6) x5 = Memory[x6] l
atomic swap sequence
Store sc.w x7, x5, | Memory[x6] = x5; x7 = | Store; 2nd half of an
conditional (x6) 0/1 atomic swap sequence
Load upper lui x5, Loads 20-bit constant
x5 = 0x12345000
immediate 0x12345 shifted left 12 bits
Trang 30immediate with constant
Inclusive or Bit-by-bit OR reg
ori x5, x6, 20 x5 = x61 20
immediate with constant
Exclusive or Bit-by-bit XOR reg
Trang 31Branch if beq x5, x6, if (x5 == x6) go to PC-relative branch if
equal 100 PC+100 registers equal
Branch if not bne x5, x6, if (x5 != x6) go to PC-relative branch if
equal 100 PC+100 registers not equal
Branch if less
blt x5, x6, 100
if (x5 < x6) go to PC-relative branch if
than PC+100 registers less
Branch if PC-relative branch if
bge x5, x6, if (x5 >= x6) go to
greater or registers greater or
100 PC+100
equal equal
Branch if less, bltu x5, x6, if (x5 < x6) go to PC-relative branch if
unsigned 100 PC+100 registers less
Branch if PC-relative branch if
bgeu x5, x6, if (x5 >= x6) go to
greatr/eq, registers greater or
100 PC+100unsigned equal
l l x1 =PC+4; go to PC-relative procedureJump and link jal x1, 100
PC+100 call
Jump and link jalr x1, x1 = PC+4; go to Procedure return;
register 100(x5) x5+100 indirect call
amoswap.w x1 = Memory[x3];
AMO swap Load word from
x1, x2, (x3) Memory[x3] = x2 memory to register,
Trang 32then swap word fromregister to memory
amoadd.w x1,
x1 = Memory[x3];
Load word from
memory to register,AMO add Memory[x3] = x2 +
x2, (x3) then store the addition
amoxor.w x1, memory to register,
AMO xor Memory[x3] = x2 ^
x2, (x3) then store the XOR
Memory[x3]
result to memory
l Load word from
xI =Memory[x3]; if
amomin.w xI, memory to register,
AMO min (Memory[x3] > x2)
x2, (x3) then store the smaller
Memory[x3] = x2
value to memory
Load word from
x1 = Memory[x3]; if
amomax.w x1, memory to register,
AMO max (Memory[x3] < x2)
x2, (x3)
Memory[x3] = x2
then store the greatervalue to memory
Trang 33Load word from
l l xl = Memory[x3]; if | memory to register,AMO min amominu.w
l (Memory[x3] >x2) | then store the
unsigned x1, x2, (x3)
Memory[x3] = x2 unsigned smaller
value to memory
Load word from
xl = Memory[x3]; if | memory to register,
AMO max amomaxu.w
l (Memory[x3]< x2) | then store the
unsigned x1, x2, (x3)
Memory[x3] = x2 unsigned greater value
to memory
1.3 Introduction to microarchitecture
Microarchitecture (also known as computer organization) defines the processor's
organization, including the major functional units, their interconnection, and control
[2] as well as how a given instruction set architecture is implemented in a processor.The following subsections introduce some common architectures which are usually
implemented in modern processors, including very-long-instruction-word processor,
multithreaded processor, out-of-order processor & multi-core processor
1.3.1 Multi-core processor overview
A multi-core processor is an architecture that multiple processor cores are unified
ona single die A processor core is known as a processing component of the processor
that can independently fetch and execute instructions from a least one instructionstream A core contains typically logic components such as instruction fetch unit,program counter (PC), instruction scheduler, functional units, register file, A multi-
core architecture is a relatively recent design, having emerged only in the last decade
of four decades of microprocessor history [5]
The emergence of multi-core architecture marked a significant turning point inparallel computer architectures’ evolution They play a vital role in the architecture
Trang 34of large, powerful, and expensive computer systems as well as multi-core systems inthe mainstream architecture of servers, desktops, and even mobile devices such as
cell phones Before 2001, parallel computers were mainly used in servers and
supercomputers Client machines (desktops, laptops, and mobile devices) weresingle-core systems For various reasons, 2001 marked a turning point when the first
multi-core chip, the IBM Power4, was shipped The Power4 chip was the first
non-embedded microprocessor that combined two cores on a single die, and the cores are
integrated tightly to support parallel computation The three decades before 2001 saw
a design approach where a single core became more complex and faster, and the
decade since 2001 has seen a design approach where multiple processor cores are
implemented in a single chip [5]
One of major trends that inspired the advancement of multi-core architecture is
the shrinkage of transistors; for that reason, more and more transistors are able toplace in a single die The pace for transistor integration has been significantly
tremendous [5]
1.3.2 Microarchitecture comparison
The four computer architectures introduced above are the most classicalarchitectures to achieve higher performance of a processor; however, each of them
always has disadvantages that architects have to trade off to gain superior
performance (more detail in Table 1.5) Moreover, modern processors can beimplemented in many other techniques (Superscalar processor, vector processor, )
to enhance their power
Table 1.5: Advantages and drawbacks of mentioned microarchitectures
Microarchitecture Advantages Disadvantages
VLIW e Reduce hardware complexity e The higher complexity
[6] of the compiler [6].
e Low power consumption [6]
Trang 35Suitable to handle compression/
decompression of image and
speech data [6]
¢ Challenging to managebackward compatibility
Multithreaded | s Improved throughput Deadlock, livelock,
¢ Optimizing system resource starvation issue.
usage ¢ Synchronization of
shared resources
Out of order ® Significant performance ¢ Complex structure.
improvement ® Increased replay traps
e Higher utilization of functional [7]
units Increased cache misses
® Less processor stalling [7]
Multi-core e Easy to implement e It cannot work at twice
the speed of a regularprocessor They getonly 60% - 80% more
speed
® Interconnect Issues
e Thermal issues
Because of effortless implementation and the benefits they return, this project will
build a multi-core processor to boost performance However, the designers must dealwith the synchronization problem — maintaining the accuracy of shared resources
(Specifically, Shared resources in this project are data memory)
1.4 Memory coherence & consistency problem
Because the multi-core processor contains a shared memory, a reading/writingprocess by any cores in the system can affect memory coherence & consistency
(synchronization problems), so it has to be considered Coherence and consistency
are complementary: Coherence defines the behavior of reads and writes to the same
Trang 36memory location, while consistency defines the behavior of reads and writes withrespect to accesses to other memory locations [8].
The first (coherence) in this project is maintained by adding an arbitrator thathandles any memory violations between 2 cores & implementing “A” standardextension instructions of RISC-V ISA
There are several models to keep the memory consistency The first model is
sequential consistency Sequential consistency requires that the result of any
execution be the same as though the memory accesses executed by each processorwere kept in order and the accesses among different processors were arbitrarily
interleaved [8] Sequential consistency is implemented in this project because it is
straightforward to understand, but the trade-off is processor performance reduction
Architectural optimizations that are suitable for uniprocessors usually violate
sequential consistency and result in a new memory model for multiprocessors is born.One of them is relaxed consistency which allows some loads and stores to be operatedout of order There are many different variations of consistency models; there some
example in Table 1.6
Table 1.6: Relaxed consistency models
Type of consistency model ExampleLoads may be reordered after loads | PA-RISC, Power, Alpha
Loads may be reordered after stores | PA-RISC, Power, Alpha
Stores may be reordered after stores | PA-RISC, Power, Alpha, PSO
Stores may be reordered after loads | PA-RISC, Power, Alpha, PSO, TSO, x86
RISC-V has specified their memory consistency called RVWMO (RISC-V WeaMemory Ordering), a set of rules specifying the values that loads of memory can
return Under RVWMO, code running on a single hart appears to execute in order
from the perspective of other memory instructions in the same hart (hardware thread)
Trang 37instructions from the first hart being executed in a different order [4] Relaxedconsistency may be required for out-of-order processors, which may not suit this
project
1.5 CPU system design
System design includes all of the other hardware components within a computingsystem, such as data processing other than processor (E.g., direct memory access),
virtualization, and multiprocessing Figure 1.2 is an example of a CPU system
DMA
The processor controller MMU
Interconnect
1/O interface Main memory
Figure 1.2: Example of CPU system design
The system design includes:
e The processor
® DMA (Direct memory access) controller
e MMU (Memory management unit)
e Input and output interface
e Main memory
e Interconnect
This project focuses on implementing a multi-core processor and the interaction
between that processor and the memory
Trang 38Chapter 2 System design
This chapter introduces the general system design As shown in Figure 2.1,
including six components:
e Assembly program is the system's input containing code segments
requesting to be converted to binary strings
e An assembler is a tool for switching assembly program to binary string
format
e The binary instruction file contains the assembly program in binary digit
string format
e Instruction memory stores the binary program
e The processor fetches instructions from the instruction memory, then
execute them and read/write from/to data memory/register file
e Data memory stores data and can be accessed by the processor
Assembly Binary
program —*_ Assembler ———> _ instruction
(asm) format file
Instruction
memory
Processor
Trang 39The system is split into software design and hardware design from thespecification above In the software design aspect, the task is to create an assembler
that has the ability to convert assembly language based on the RISC-V instruction set
to binary instruction format In hardware design, the task is building a processor withtwo cores 5 stage pipelines based on RV32I instruction set and RV32A extension,dual ports data memory with different regions accessed by dual cores of the processor,
and instruction memory where stores binary instructions from the assembler output
The following subsections are going to introduce more details about the two tasks
2.1 Software design
An assembler is a program that computer instructions that are converted to binary
strings based on their ISA The processor can use that set of binary strings to performsome basic operations Assembler development is vital to the verification processwhere test cases are instruction sequences in the form of assembly language; the
assembler converts these test cases automatically help to reduce verification time
Some basic features are available in the assembler program: syntax verification,converting certain pseudo instructions, convert labels to specific addresses in jump/
branch instructions Figure 2.2 shows a general flow chart for an assembler program
The assembler flow chart is presented in detail:
e Create hash tables for opcode, funct3, funct7, registers
e Read assembly program from asm file to an array
e Scan labels in the entire assembly program, then add them to a hash table
e Search in the array for comments or blank lines, then remove them
e Split an instruction into specific fields (mm, rs0, rs1, rd) depending on that
instruction format
e Verify instruction fields; if all of them exist in hash tables, they will be
transfer to binary string; otherwise, an error message appears
Trang 40e Export the result to output files, including a bin file which contains only
binary strings, and another one is called a display file (.dis) which is used
for debugging purpose
Create hash tables foropcode, funct3, funct7,
blank lines
v
Splitinstructions
output files
Ỷ
End