Khóa luận tốt nghiệp: Hiện thực MIPS 32-bit Superscalar trên FPGA

ABBREVIATION GLOSSARYAbbreviations IP Intellectual Property Internet of Thing Advanced eXtensible Interface Expanded form Description Xilinx's intellectual property core is used in the t

Trang 1

HO CHI MINH NATIONAL UNIVERSITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER ENGINEERING FACULTY

VO HOANG NGUYEN TIN - 19522352

NGUYEN PHAN NHAT QUANG - 19522095

Trang 2

First and foremost, we would like to express my sincere gratitude to my advisor,

Dr Nguyen Minh Son, for his dedicated guidance throughout the thesis process Heprovided valuable assistance, equipment, and invested considerable time in reviewingand providing feedback, enabling me to complete the thesis in a successful manner

In addition, we would like to extend my thanks to the faculty members,colleagues, and friends of the Computer Engineering Department, as well as the

University of Information Technology - Vietnam National University Ho Chi Minh

City, for their dedicated teaching, imparting invaluable knowledge, and supporting methroughout my academic journey at the University of Information Technology Theknowledge and experiences gained from them will serve as a solid foundation,instilling confidence as I embark on my future endeavors

Lastly, we would like to express my deep appreciation to my parents and familyfor their unwavering support, encouragement, and creating a strong emotional supportsystem throughout my educational journey

During the course of conducting this thesis, I encountered inevitable difficultiesand made mistakes due to certain limitations within my specialized field Therefore, Isincerely hope that the esteemed professors and colleagues will understand and provideconstructive feedback to further develop and refine this thesis in the future

Once again, we would like to express my heartfelt gratitude

Ho Chi Minh City, June 29, 2023

The students

Vo Hoang Nguyen Tin

Nguyen Phan Nhat Quang

Trang 3

TABLE OF CONTENTS

wChapter 1 OVERVIEW nhe

LL Architecture ccecesessesesessesesssesessscseenessseessssesesessssesssssssesesssseeeisscsesesessseeeisies O

1.11 Processor overview

1.1.2 CISC architecture ác sec scseeeeeeerrrrrrrrrerrree A

1.143 RISC architecture + treo

1.1.4 Compare CISC and RISC ccccccescescesessesseeecseseeseesesneseeseessenssnessensseesees 7

1⁄2 MIPS architecture

1.2.1 Development hiS(OTY S2 £‡E‡E‡EEEEzkekerrkrkrtrrsrksre §

1.2.2 Instruction S€ SĐT HH HH 0 g1 10

1.2.3 Executable instructions in the thesis

Chapter 2 DESIGNING MIPS32 PROCESSOR ::ccccssesssssessssessesseseesesseeeensseese 16

2.1 Overview in processor design c.ccccesecsesseeseseseseseseseecesscseseseseereeseetseeeseaeaeee 16

2.2 Function blocks in processor ccccseseseseseseesessseseseseeesesrsescsesssesesseetseetetseseseee LT

2.2.1 IP System cache

2.2.2 Program Counter DIOCK -. - 5 c+csxsseeeeeeeeeeerererrrrerreeeee LD

2.2.3 Control Unit blocK - 556 55>ccssssseseeeseeeeeseeeeerreere LO

22.4 ;C2)10801100u AT" ố

2.2.5 Register High/Low bÌOCK -c-cscscsxseeeeererererrereerrreevecee 23

Trang 4

2.2.6 Comparator ĐÌOCÍK - 2 5S S‡S*2E#E£EEEEEEEEEEEEkerrkrkerrrrrrke

2.2.7 Sign Extend bBIOCK 0 cece esesesesesescsescseseseeeesesesescscscseseseseneeeseseseeeeees

2.2.8 ALU DIOCK

2.2.10 MemLoad DIOcK 55222 St‡x‡xeEtztrxexerrtererrrrrrrrrrrrrree

2.2.11 Forwarding Unit blOCĂK 6S St Svsetevekekekrrrtrrkrrerre

2.2.12 Hazard Unit block ¿S2 S+Stketeteeketerirkrkrrrrrree

2.3 Resolving pipeline COnfÏiCS - ¿+ 5c treerrrrerrrrrerrrererrrrrrrrrke

2.3.1 Processor pipeline architecture

2.3.3 Load Data hazard - S2 St t2 2 HH

23.4 Branch haZaTd ¿- c5 + St 3 kề 21212112101 11 121.1 1.100

Chapter 3 DESIGNING MIPS32 PROCESSOR WITH AXI4 PROTOCOL

3.1 Package MIPS core with AXI4 protocol ccccce cesses eseeeseseeeeneeeseeneeeseaee

3.2 AXI4 prOtOCOI - - SĐT HH giàn

3.2.1 Protocol OVTVICW Sàn HH gà

3.2.2 Specific signals of the AXI4 protocol in the thesis

3.2.3 Overview Simulation of five communication channels AX14

3.3 Additional function BIOCK c.ccccccseseeseseseseessseeeesesesseeessssseesssessanenseneeeseerseaes

Trang 5

3.3.1 CDMA Control block 5-55 552c+csccsczxeeeesreeerxeexr TỔ,

Chapter 4 BLOCK DESIGN ON VIVADO ii 40

4.1 Overview of Block Design architeCfUre c-cscscccceceeeeeeee 4Ô)

4.1.1 Architecture design OV€TVI€W à sec 4D

4.1.2 Design on 2c 0 ÓA

4.2 Xilinx IPs used in Block Design

4.2.1 IP Memory Interface Generator (MIG 7 Series) OO

4.2.2 IP Central Direct Memory Access (IP CDMA) OL

4.2.3 IP AXI Bram Controller

Chapter 5 SIMULATION AND EVALUATTION -c-cc5<5c+c+ccrcrxereree 52

5.1 Instruction simulation c.c.cceece cscs ee cece eseeceseseeesesesseeseseeeneeeseseseeneneas 52

5.1.1 ONS eR ee 52

5.1.2 Simulation result

Chapter 6 CONCLUSION AND DEVELOPMENT 5-5-5 5cscccccee 77

6.1 s9 T7

6.1.1 Accumulated eXPeri€TIC€S - 5 222 S* S2 tt rrrrrrrrrree T7

6.1.2 Backlog issues cà HH He T7

6.1.3 0 —.A

T300 .4dd

Trang 6

APPENDIX I KIT FPGA XILINX VIRTEX-7 VC707 -.-. - BO

APPENDIX II GENERATE BITSTREAM STEPS

Trang 7

Nintendo 64 (MIPS R4300i CPU)

Tesla Model S (MIPS I-class CPU) cccccescsesssseescsesssseescseeesseeseseseseeeeseeeeee O

Processing stages Of DFOC€SSOT ¿c5 tt St vườn 15

System Cache ST HH 0H 00g uy 16

Program Counter BÏoCK - + +5 s+ssxssseseeeeersrsrerrrrrrrrrrrrrrerrrrer TỶ

Control Unit BIOCK 55233 S2E2 tt ‡kExektrkekererrrkrrrrrrrrrrrrrree 18

Register File BÏOCK 5c t2 12 12H 21

Register High/Low Block

Comparator BIOCK «+55 5232 26211212 22tEvErrrekrkrkrkrkrerkrerkrkrree 23

Sign Extend Block - sscsxsesesrerererererersrsrssrrrrrrrrree 24

ALU BLOCK 11 25

D/00/90985)099) 21 26

: MemLoad Processing BÌOCK 55+ S+S+sxvsvsrrrrrrrrrrrrrrrrrrrrrre 27

Forwarding Unit BlOCĂK - «+ + 2 tS*EEEvEEvxekekrkrkekrkrkrkrkrkrkrkrke 28

Hazard Unit Block

Processor pinepline archit€CtU€ ¿+ 5+ >++x+cvctererrxerxersrrrvrre 33

Forwarding data in pineline -‹-ss+++s+ssexexerererrrrrrrrrrrrrerrrrrer OD)

Forwarding for memory write inStFUCfÏON 5 555c5+cccvcereeeeeeee-c OO

: Hazard with data load instructions ccceeeeseseseseseseseseeeeeeeeeeeesesesee 37

Trang 8

Figure 2.18: Instruction to load data and calculate data : - 3

Figure 2.19: Branch instruction Occurs in pineline -.-:‹¿- - + + 5++s+ss+>++s+£+£zxe 39

Figure 2.20: Branch instruction is at position B in the instruction pair

Figure 2.21: The branch instruction needs data from the previous instruction 40

Figure 3.1: MIPS architecture after it’s packaged into an IP -. . .-.- AL

Figure 3.2: General architecture of the AXI4 prOtOCOlÏ 5-5-5 ++++++se+s+sxexexere 42

Figure 3.3: A read transaction of AXI4[8] protocol ¿5-5252 5++cccc+scxzxsrerzee 43

Figure 3.4: A write transaction ghi of AXI4[8] protoeol -.-‹ -.- 43,

Figure 3.5: Simulate read channel AXI4 of prOC€SSOT - 5 -c+c+c+c+sxeeeeeeee-.- 40

Figure 3.6: Simulate write channel AXI4 of processor

Figure 3.7: Simulate write response channel AXI4 of processor - 46

Figure 3.9: CDMA Controller BlOCK cesssesesesesesesesesescseseseseeeeneeesscacseseseseseneseseeeeeees 47

Figure 4.1: Architecture design overview of Block Design -‹-5-+++5s+s+xsxe 48

Figure 4.2: Block Design on Vivado SOWare ¿St thư 49

Figure 4.3: IP Memory Interface Ge€In€ra(OT ‹-¿- 5+ 5tr rệt 50

Figure 4.4: IP Central Direct Memory Access (CDMA) -‹ - OO

Figure 4.5: IP Bram Controller va IP Block Memory Generator

Figure 5.1: Test model - ¿5 + S212 2E E11 221212 11121 1 11 gioi DD

Figure 5.2: Instruction in Initial ‹-¿- - 5< + kS‡+Ek#eEEEkEkEkEkck tr ưyn 53

Figure 5.3: Read request to Ï CaCHe - - «¿6k + ke E1 HH 12 H1 11g vườn 53

Figure 5.4: I-CACHE requests a read transaction to load instruction in 54

Figure 5.5: Write value from AXI bus into I-CACHEE - + sssvcsrererere 54

Trang 9

Figure 5.6: The execution process in processor [1] -+ +52 25+s*2*+*+ssscsxsxsxerrre

Figure 5.7: Mult/Div block execute 0 cc eeeseseseseseseeeseseseseseeeeeeeeenecsessscscseseneneneeeeenseees

Figure 5.8: The execution process in processor [2] ¿c5 5252 + sv£vrvrvrrrererere

Figure 5.9: Write value into Data Memory

Figure 5.10: Write value on AX4 ¿5-2-5552 22 11211121 te

Figure 5.11: The execution process in processor [3] + - 555552 + ssveerererererrre

Figure 5.12: The execution process in processor [4] ¿+ 25+ 5+++‡cexss+zxscerczee

Figure 5.13: The execution process in processor [Š ] ¿5-5552 5ss++c++xszxerxrxe

Figure 5.14: Register File and High/Low Register results [ Ï] - -« =+=s+

Figure 5.16: Comparing the simulation result with MARS [Í] -< +

Figure 5.17: Instructions from Main 1

Figure 5.18: Instructions in Sub Ï- Ì ¿+ eeeeeseeseeseseeseesesesnesesnssneseeseeneae

Figure 5.19: Calculation result of the Sub Ï- Ì ¿-¿-+++s+5++++s+>++xexesesx+xseerzxe

Figure 5.20: Calculation result of Subl-1 on MARS - - 5< +csxsseserersrererrre

Figure 5.21: Instructions in Sub Ï~2 - - «+6 + ke E1 11g trên

Figure 5.22: Saved values in Data Memory on MARS - -+ -++++-ccsrsrsrsre

Figure 5.23: Saved values in Data Memory on Vivado [ Ï], - -«-s-+-e-exexexs+e

Figure 5.24: Instructions in Sub1-3

Figure 5.25: Calculation result of the instruction in Sub ]-3 - - s5 + «+cec+zxe

Figure 5.26: Calculation result of the instructions in Sub1-3 on MARS

Figure 5.27: Instructions in StuÐ Í ~4 + + + tt + +vrererersrererekrkrkrrrrrrrrrrrrrrrree

Figure 5.28: Calculation result of the instructions in Sub Í-4 - - -+<s<+<s+exexsxe

Trang 10

Figure 5.29: Calculation result of the instructions in Sub1-4 on MARS 68

Figure 5.30: Instructions in Main 2.0.0.0 ccessesesesesesescsesescseseseeeeseenensescscscscaeseseeeseeneeeeeees 69

Figure 5.31: Instructions in St22- Í - + tk St St EkEvEEEeEekekekekskrkrkrkrkrrrrrrrrrrrre 69

Figure 5.32: Calculation result of the instructions in Sub2-1

Figure 5.33: Calculation result of the instructions on MARS - 70

Figure 5.34: Instructions in StIÐ2-2 - - + tt EvEkkerererererrskrkrrkrrrrrrrrrrrrre 70

Figure 5.35: Calculation result of the instructions in Sub2-2 eects 71

Figure 5.36: Calculation result of the instructions on MARS - 71

Figure 5.37: Results saved in register file after finishing the testing program 72

Figure 5.38: The result on MARS wo ccccceeesesesesesesesesesescseseseseeneeenenenenssscsseeseseneeeeeneeeees 72

Figure 5.39: The values in Data Memory on Vivado

Figure 5.40: The values in Data Memory on MARS - - tees eseeneneseeneneee 73

Figure 5.41: Resource ut[ÏiZatiOn ¿552235222 ‡xexexerxexererrxrrrrrrrrrrrerrree 74

Figure 5.42: Resource utilization of each logic block in the prOC€SSOF 14

Figure 5.44: The clock speed in the block design - ¿+ 55+ S+sxexe+scexererkrke 75

Figure 5.45: Timing r€SuÏ(L -¿- ¿+5 + vs vxrrrrrrererererrrrrrrrrrrrrrrrrrrrrrrrrerroe TO

Figure 5.46: Power COnSUIPÏOI óc SE ‡ksteerererererrrrrererrrrrrrrrerrrree TO

Figure 5.47: Power summary from the old design

Figure 5.48: Utilization summary from the old đesign - - - + - 55+ +£+£+zxe 78

Figure PI.1: KIT FPGA VC707[6] điagraim - - - c- sSk‡ketEkrkekekrrrkererree 82

Figure PI.2: Structure of KIT FPGA VC707{6] -. ‹ :-:+-5+5cc<++-+xc-c -.+ 83

Figure PII.1: Constraint file on ViVadO ccsssteterirerrrrrrrrie ĐỘ,

Trang 11

Figure PII.2: Modeling RTL code for Mux 2-to-1 64-bit on Vivado

Trang 12

DANH MỤC TABLE

Table 1.1: Register set in MIIPS - set LO

Table 1.2: Other TOQIStETS oo eeceees cece tees ee teee sees neaeseseeneeseeteaeetseseeeseeeeseseseceae LL

Table 1.3: Instruction format structure Riu cece - ‹-¿- 5e 5S xE‡+EekeEkErkEketrkekererkrke 11

Table 1.4: Instruction format structure Í 2-52-5552 5+2££2++£+££++Eezxezxerzrerxererre 12

Table 1.5: Instruction format structure J ccccscseseeseseseeeeseseseesesesessessseeeesssssesnenssneeencee 12

Table 1.6: Groups of processing instFUC{ÏOINS - - - ¿555cecccc+cseeeceeeeeeee.eee-e 13

Table 2.1: Signal s of block

Table 2.2: Signals of Program Counter BÏocK ¿+5 s++xcxvsecxsxsrerxrxe 17

Table 2.3: Signal s Of Control Unit BIOCK -¿- 5c 5+cc+ceeeeesseeeereeeeeeseeeeee-e 18

Table 2.4: The bit encoding convention for ALUControl ¿- 55+ 52552 s+c+s+z++ 19

Table 2.5: Branc Ins bit encoding CONVENTION 555 ‡xk‡EErkekerrrkekerrrrke 20

Table 2.6: Signals of Register File BIOCK ccceesesesesessseeseeeseseeeseecscsesesescsesereneeeeesees 21

Table 2.7: Signal

Table 2.8: Signa

s of Register High/Low BlOCK - 5c tt svrvrererererersrsrsvee 23

s of Comparator Block

Table 2.9: Signals of Sign Extend Block ¿- 55s 24

Table 2.10: Signal of ALU BLOCK 0 c.cceeeeseseseeeeesesesesescseseseseeeseenecscssscscseseseneneeeeeeeeeee 25

Table 2.11: Signals of Mult/Div blOCK - 65c sv£vrererereeerstseseererreerervve 20

Table 2.12: Signals of MemLoad processing block - - + c¿+<ece<<c-+c-c«c.-.e 28

Table 2.13: Signals of Forwarding Unit c cccccsecseseseeseesessssessesseescsneseeseeseneseeseenees 29

Trang 13

Table 2.14: Signals of Hazard Unit BÏOC - - 56 2E E119 1 2 1E ke rriereeree 32

Table 2.15: Overview of conflict handling in pineline - - 55+ + <++s+eex+exzeess 34

Table 3.1: Signals of AXI⁄4 Ðus 5c SG S2 1211351131153 911 1111111111 91 HH nghe 44

Table 3.3: Signal of COMA Controller BÏOCK - 5 5 25+ *+*£++£++eEeseEeereeereeres 48

Table 5.1: Instruction computing result [ Ï]| - <6 55 +5 **+*++£E++eE+eeEeeereeerseeeeerse 55

Table 5.2: Instruction computing result [2] <6 5 + 3x E33 E*kE£skEeskreserseesserre 56

Table 5.3: Instruction computing result [3] - <5 + + * + E+#kESseEsseEseerseesseree 58

Table 5.4: Instruction computing result [4] eee ceeeececeeeceeeeeeecesececeeeeseeeseeeeeeeeeteees 60

Table 5.5: Instruction computing result [S]| << + x13 E3 Eskkskeeserseesserre 61

Table PI.1: Describe the location of components on the FPGA VC707 KIT 83

Trang 14

ABBREVIATION GLOSSARY

Abbreviations

IP Intellectual Property

Internet of Thing

Advanced eXtensible Interface

Expanded form Description

Xilinx's intellectual property core is

used in the thesis

Connection of many electricalthings

together to build a system useful for

life Advanced extensible Interface protocol

Complex Instruction Set

Computer

Reduced Instruction Set Computer

Advanced RISC Machines

Computer architecture with Complex Instruction Set

Computer architecture with Simplified Instruction Set

Machine Learning Developed from RISC

Microprocessor without

Interlocked Pipeline Stages

RISC Instruction Set Architecture

Developed by MIPS Technologies Random Access Memory

Central Processing Unit

Field Programmable Gate

Random access memory used in thesis

Central processing unit using MIPS

superscalar architecture

Large-scale integrated circuits using

user-programmable logical element array structures

Arithmetic Logic Unit Arithmetic logic unit using in the thesis Phase Locked Loop Closed-Loop frequency control system

Register Transfer Level

Central Direct Memory Access

Mixed-Mode Clock Manager

Register transfer level using to develop

MIPS core

Intellectual property of Xilinx support

transfer data Hybrid mode clock management controller

Trang 15

THESIS ABSTRACT

The thesis topic includes two main contents revolving around the research of

designing a local memory for MIPS processor inherited from previous thesis and

packaging the processor as an IP following the AXI4 communication standard.

The first content aims to study the design of a Superscalar MIPS processor The

processor consists of two ALU blocks to handle integer-related instructions (signed and

unsigned), along with an additional Multiply/Divide block to perform integer multiplication and division operations Additionally, the IP System Cache provided by

Xilinx will be utilized as the Instruction Memory (I-Cache) and Data Memory

(D-Cache).

The second content of the thesis is about packaging the MIPS processor Most

IPs in the Vivado software communicate with each other through the AXI4 or AXI3

bus In this thesis, the MIPS processor will be packaged following the AXI4

communication standard Subsequently, the MIPS processor will be connected to other

IPs in the Vivado Block Design, establishing the interconnection between the two

contents In the Block Design, the MIPS processor acts as a Master, requesting data

read from Slaves through the AXI4 bus, performing computations, and storing the values in the Data Memory It can also send the computed values back to the Slaves or output them to the UART.

Through these two contents, I hope to contribute to the dissemination of the

benefits of MIPS processors and propose an approach for designing a MIPS processor

specifically, as well as the design of other processors in general and contribute to the development of integrated circuits in Vietnam.

Trang 16

Entering the modern era of the 2lst century, wireless communication technology is considered a leading trend in the IoT era and a driving force behind the

development of numerous useful IoT applications In addition, the rapid urbanization

and a large number of IoT device users in modern life have created a tremendous

demand for powerful processors However, the high cost of complex processor

fabrication, coupled with the challenges posed by material and labor shortages due to the ongoing COVID-19 pandemic, further emphasizes the importance of processors in IoT devices.

From the aforementioned situation, we can see the necessity of optimizing

processors One of the proposed solutions is designing processors specifically tailored

to IoT devices A processor with a simplified instruction set can reduce the number of

logic gates in the design, resulting in lower power consumption for the processor As a result, manufacturers can adjust the cost structure more reasonably, leading to reduced costs for end-users.

As a result, processors not only become more affordable but also offer faster

processing speeds, allowing IoT devices to handle tasks more efficiently and flexibly,

in line with the pace of growth in the modern era.

Based on the aforementioned analysis, our group proposes a thesis titled

"Implementation of MIPS 32-bit Superscalar on FPGA." The thesis consists of two

main components The first component aims to design a MIPS processor using the

Superscalar architecture, increasing the number of instructions that can be processed in

the processor pipeline The second component involves refining the MIPS processor

into a Master that can communicate with other Slaves such as MIG (DDR3), BRAM,

Peripheral through the AXI4 bus The objective of this component is to approach the

design of modern processors and contribute to the development of circuit design in Vietnam.

Trang 17

Chapter1 OVERVIEW

1.1 Microprocessor architecture

1.1.1 Overview of microprocessors

With the remarkable technological advancements of humanity, processors have

emerged and developed rapidly over time Famous chip manufacturers have introduced

their own branded processors that have been widely commercialized, such as Intel,

Apple, AMD, Qualcomm, MediaTek, and more.

A processor, also known as a central processing unit (CPU), is a computer

electronic component fabricated from tiny transistors integrated onto a single unit area.

While the central processing unit (CPU) is a well-known processor component, other

components in a computer also have their own processors, for example, graphics cards

also have their own processors Before the advent of processors, CPUs were built from

separate small-scale integrated circuits, with each integrated circuit containing only

about a dozen transistors.

In the early 1970s, the first processing chips appeared and were used for electronic calculations or algorithms involving binary-coded decimal (BCD) numbers.

Subsequently, 4-bit and 8-bit processing systems were introduced The most significant

32-bit design was the MC68000 chip (68K), introduced in 1979 It featured a large

memory space, high speed, and reasonable cost, making it the most famous CPU

design The world's first 32-bit processor with full 32-bit data paths, a bus structure,

and a 32-bit address was the BELLMAC-32A chip by AT&T BELL.

The ARM processor made its first appearance in 1985 It is a 32-bit processor with a RISC architecture ARM has excelled in the field of embedded systems, offering

high performance and a wide range of development tools During this time, other RISC

processors also achieved great success, such as the MIPS R2000 and MIPS R3000.

The first 32-bit microprocessor chip in Vietnam was the VN1632, designed by

ICDREC and announced in January 2010 The VN1632 microprocessor was designed

using 120 nm technology and had a maximum operating frequency of 100 MHz It

incorporated most of the features of a typical 32-bit microprocessor, but due to the

inexperienced design team, they faced many difficulties during the design process As a

Trang 18

result, the VN1632 chip could not be compared to modern 32-bit processors available

worldwide at that time.

The rapid progress of microprocessors is partly due to the application of

Moore's Law, which has consistently increased performance over the years.

Furthermore, the world is in the midst of the fourth industrial revolution, where the

semiconductor industry is considered paramount Therefore, microprocessors are

becoming increasingly complex and sophisticated Simultaneously, understanding the

fundamentals of microprocessor architecture provides us with essential knowledge and

a foundation for the development of more advanced microprocessors in the future This

also contributes to the overall advancement of processor chips globally.

1.1.2 CISC architecture

CISC (Complex Instruction Set Computer) is a type of microprocessor

designed to make programming with high-level languages easier and more

straightforward The full form of CISC is Complex Instruction Set Computing CISC chips are designed to be easily programmable and efficiently utilize memory CISC eliminates the need for complex instructions on the processor For example, instead of having to create a compiler and write lengthy machine instructions to calculate square

roots, a CISC processor provides integrated capabilities to perform this task.

Many early computers were programmed using assembly language, which

made memory access slow and expensive CISC architectures were commonly implemented in large computers such as the PDP-11 and DEC systems Here are some

characteristics of CISC processors:

° Large number of instruction sets, resulting in complex decoding logic

e Infrequent use of special instructions

° Some instructions have sizes larger than 32 bits.

e Fewer general-purpose registers when operations are performed in

memory.

° Different CISC designs are set up with two special registers for stack

pointers to manage interrupts.

Trang 19

CISC processors offer the following advantages:

° In CISC, it is easy to add new instructions to the chip without changing

the instruction set structure.

° This architecture allows efficient utilization of main memory.

° The compiler complexity is not too high, as in the case of CISC.

Instructions can be written to fit the structure of high-level languages.

However, CISC comes with the following disadvantages:

° Previous generations of the processor line are mainly contained as a

subset in every new version As a result, the instruction set, and chip

hardware become more complex with each computer generation.

° Machine performance is slowed down as the clock cycle executed by

different instructions will never be the same.

° They are larger because they require more semiconductor gates.

1.1.3 RISC architecture

RISC (Reduced Instruction Set Computer) is a design approach for processors that

simplifies the instruction set, where the execution time for all instructions is the same.

Common RISC processors include ARM, SuperH, MIPS, SPARC, DEC Alpha,

PA-RISC, PIC, and PowerPC.

RISC processors are designed to perform a limited set of instructions for smaller-sized

computers, allowing for higher operating speeds The full form of RISC is Reduced

Instruction Set Computer RISC instruction sets typically contain fewer than 100

instructions and use fixed-length instruction formats This approach utilizes a small

number of simple address modes using register-based instructions In this compiler

development mechanism, LOAD/STORE are the only separate instructions for memory

access.

RISC processors offer the following advantages:

° One cycle execution time: RISC processors typically have a CPI (clock

per instruction) of one cycle This is achieved by optimizing each

instruction on the CPU and using a technique called pipelining.

Trang 20

Pipelining: A technique that allows concurrent execution of instructions

in stages to increase execution efficiency.

Large number of registers: RISC designs often incorporate a large

number of registers to minimize interaction with memory.

RISC processors offer the following advantages:

Reduce processor area usage for control unit, from 60% (for CISC

processors) to as low as 10% (for RISC processors) This allows for

increased cache memory or logic gates within the processor.

High computational speed due to simplified instruction decoding.

RISC processors have a large number of general-purpose registers,

reducing the need for frequent memory access.

Uniform execution time for each instruction stage, allowing for

accelerated processing through pipelining.

Provides wide addressing formats for efficient memory management.

However, RISC also has some disadvantages:

Restricted memory access for all instructions except for memory read

and write instructions.

Limited instructions to support high-level languages.

RISC architecture requires hardware on the chip to be redesigned for each

version.

The performance of RISC processors depends on the skill of the

programmer of compiler The complier plays an important role in

translating CISC code into RISC code.

Comparing CISC and RISC

The CISC and RISC architectures are two predominant computer architectures

used today The main difference between RISC and CISC lies in the number of clock

cycles each instruction takes to complete With CISC, each instruction may require a

larger number of cycles to execute compared to RISC.

The reason behind the difference in cycle count lies in the complexity and

objectives of the instructions in both architectures In RISC, each instruction aims to

6

Trang 21

accomplish a very small task So, if a complex task needs to be performed, it requires

stringing together multiple instructions With CISC, each instruction is more like a

high-level language statement Only a few instructions are needed to achieve what the

program wants, as each instruction performs multiple stages.

In terms of available instruction sets, RISC has longer instruction sequences

compared to CISC This is because each small step may require a separate instruction, unlike in CISC where a single instruction will encompass multiple steps While CISC

may be easier for programmers, it also has its drawbacks Using CISC may not be as

efficient as using RISC This is because of the inefficiency in repeatedly using CISC

code, leading to wasted cycles Using RISC allows programmers to eliminate

unnecessary code and prevent wasted cycles.

The previous differences may have been significant to technology enthusiasts, but

for most users, they would be meaningless CISC has attempted to dominate

computing with the dominance of Intel's x86 architecture, which serves as the

foundation for all other modern computer architectures Conversely, RISC has found

its way into mobile devices such as smartphones, tablets, GPS receivers, and similar

devices ARM is one notable RISC architecture used in these devices The higher

efficiency of the RISC architecture makes it desirable in these applications where

cycles and power efficiency are often constrained.

1.2 MIPS architecture

1.2.1 Development history

MIPS (Microprocessor without Interlocked Pipeline Stages) is an instruction set

architecture (ISA) that belongs to the RISC (Reduced Instruction Set Computing)

family and was developed by MIPS Technologies Initially, MIPS had a 32-bit

architecture, which was later followed by a 64-bit version Over time, MIPS has

undergone several revisions and developments, resulting in various versions such as MIPS I, MIPS II, MIPS II, MIPS IV, MIPS V, MIPS32, and MIPS64 The current

versions are MIPS32 and MIPS64 There are also several optional extensions, including

MIPS-3D with a SIMD instruction set, MIPS16e for instruction compression to reduce

program size, and MIPS MT for multithreading support.

In 1981, researchers led by John L Hennessy at Stanford University began the

7

Trang 22

initial research on the MIPS microprocessor The fundamental concept was to improve

performance by utilizing instruction pipelining, a well-known but challenging

technology to develop Hennessy later left Stanford University to establish his own

company, named MIPS Computer Systems, in 1984 The company's first design was

the R2000, introduced in 1985, followed by the R3000 in 1988 In 1991, MIPS

Computer Systems released the first R4000 microprocessor SGI, one of MIPS'

customers, acquired the company and renamed it MIPS Technologies.

Currently, MIPS continues to be widely used and actively developed in various

devices such as IoT devices, Arduino boards, and automotive applications Modern

MIPS processors emphasize performance and power efficiency Along with functional

advancements, the architectural complexity of MIPS has significantly increased

compared to the original R2000 version introduced in 1985.

Here are some devices that utilize MIPS processors:

Figure 1.1: Nintendo 64 (MIPS R4300i CPU)

Released in June 1996, the Nintendo 64 was the first game console to utilize a 64-bit

processor The CPU of the Nintendo 64 is the NEC VR4300, based on the MIPS R4300

processor, running at a clock speed of 93.75 MHz and providing 125 million

instructions per second with raw performance.

Trang 23

Cars Model S manufactured after September 2014 are equipped with a range of

devices designed for autonomous operation, including a windshield-mounted camera,

undercarriage radar, and front and rear ultrasonic sensors In addition, these cars

integrate the Mobileye EyeQ3 computer vision chip, based on the MIPS I-class

architecture, which provides high-performance capabilities for detecting road signs,

lane markings, obstacles, and other vehicles The combination of sensors and computer

vision hardware enables features such as autonomous driving and parking, including

the Autopilot feature that allows drivers to have a hands-free experience on highways.

MIPS is a simple and efficient RISC architecture, highly scalable and available for

licensing Over time, the architecture has evolved, embracing new technologies and

developing a strong ecosystem and comprehensive support for the industry The

fundamental characteristics of MIPS, such as a large number of registers, the number

and size of instructions, and visible pipeline delays, enable the MIPS architecture to

deliver high performance for IP cores, as well as reasonable power utilization for

modern SoC designs.

1.2.2 Instruction set architecture

For the instruction set design principles, the MIPS32 design adheres to the following

principles in the Superscalar architecture:

e Simple and regular instruction structure: The instruction size is fixed at 32

bits, with a 6-bit opcode field.

e Smaller instructions and memory access for faster processing: Limited

instruction set, limited number of registers, and address limitation modes.

e Speed up common cases: Operands are taken from registers, and

instructions contain immediate operands.

e Instructions follow three common formats.

Regarding the data storage principles in memory:

e Alignment Restriction principle: Objects stored in memory must start at an

address that is a multiple of the object's size (typically a multiple of 4).

Trang 24

e Big Endian principle: The high byte is stored in the memory location with

the lower address, while the low byte is stored in the memory location with

the higher address.

MIPS follows a register-to-register and load/store philosophy, which means that

instructions operate on registers When using memory, separate load/store instructions

are used to transfer data between memory and registers.

Each register in MIPS stores a 32-bit value Unlike high-level programming language

concepts, registers in assembly language do not have a data type The way registers are

used determines the data type.

MIPS processors include 32 general-purpose registers, each with a size of 32 bits:

Table 1.1: Register set in MIPS

Register index

Register name Function of register

(Decimal)

$zero (*) “ @ 7 Contains constant value 0

$at Serving Assembler

$v0 - $v1 Đ` 3 Contains the return value after

using the function or procedure

$a0 - $a3 4-7 Store the function's input parameter

Sra (*) Return Address

In this thesis, the MIPS processor also uses additional registers for the

following purpose:

1 Oo

Trang 25

High Register (*) Store upper 32-bit

Low Register (*) Store lower 32-bit

Program Counter (*) Contains current address

(Note * is some register used in the thesis)

Next, we will present the structure of assembly instructions when translated

into machine language Each instruction in MIPS is 32 bits long Each instruction can

be seen as a function in a programming language Therefore, we need the

instruction name, the parameters passed to the instruction, and the type of each

parameter, in this case, the size of each parameter (as there is no concept of data type in

assembly language).

MIPS focuses on the simplicity of the instruction set, so there are three main

instruction formats: R-format, I-format, and J-format Each machine instruction is 32 bits long and is divided into different groups of bits, called fields, with each field

serving a specific role in the instruction Depending on the structure of the instruction,

each field has different specifications The R-format instruction has the following 6

parameters:

Table 1.3: Instruction format structure R

Opcode: Specifies the operation to be performed In case of R-format,

instructions use the same opcode 0.

Rs (Register Source): Specifies the first source register

Rt (Register Target): Specifies the second source register

Rd (Register Destination): Specifies the destination register

Shamt: shift amount, Specifies the amount of shifting or rotating to be

performed.

Funct: Specifies the specific function within the opcode

11

Trang 26

Next is the I-format instruction, which is used for operations between registers

and an immediate value available in the instruction:

6-bit 5-bit 5-bit 16-bit

Opcode: Specifies the operation to be performed Since the I-format does not

have a funct field, the I-format instructions do not share the same opcode like the

R-format instructions.

Rs (Register Source): Specifies the source register

Rt (Register Target): Specifies the destination/source register.

Immediate: Specifies the immediate value or constant used in the instruction.

The J-format instructions are used for jump instructions, similar to the 'goto'

statement in C In addition, there are j and jr instructions, which have the following

structure:

Table 1.5: Instruction format structure J

Target address

Opcode: Specifies the opcode for the jump instruction

Target address: The shortened address of the target jump instruction is

obtained by truncating the original 32-bit address to 6 bits as follows:

e Discard the two least significant bits of the address Since the address

of MIPS instruction are always multiples of 4, the two least significant

bits are always 0.

e Consider the four most significant bits to be the same as the four most

significant bits of the current instruction.

1.2.3 | Executable instructions in the thesis

Next, we would like to present the instruction groups that are used for studying

the design of the MIPS processor in the scope of this thesis:

12

Trang 27

Table 1.6: Groups of processing instructions

Instructi Full name Handling action

on

Arithmetic logic calculation instruction

ADD Add Rd = rs + rt

Add Immediate Rt = rs +imm

Add Immediate Unsigned Rt = rs + imm

im —

am.

c | — 7 mm —

Set On Less Than Rd = rs <rt

Set On Less Than Immediate Rt = rs <imm

Set On Less Than Unsigned Rd =rs <rt

Instruction to transfer values between general and special registers

MFHI Move From HI Rd = HI

MFLO Move From LO Rd =LO

13

Trang 28

SLLV Shift Left Logical Variable

Branch On Equal if(rs == rt) pc += offset * 4 BGEZ Branch On >= 0 if(rs >= 0) pc += offset * 4

131 = pc; if(rs >= 0) Branch On >= 0 And Link

pe += offset * 4

Branch On > 0 if(rs > 0) pc += offset * 4

Branch On <= 0 if(rs <= 0) pe += offset * 4 BLTZ Branch On < 0 if(rs < 0) pc += offset * 4

r31 = pc; if(rs < 0)

BLTZAL Branch On < 0 And Link

pe += offset * 4

Branch On Not Equal if(rs != rt) pc += offset * 4

Jump = pc_upper | (target << 2) Jump And Link r31 = pc; pc= target << 2

Jump And Link Register Rd = pe; pe = rs

Jump Register Pc =rs

ad instruction in memory

Load Byte Rt = *(char*)(offset + rs)

Load Byte Unsigned Rt = *(Uchar*)(offset + rs)

Load Halfword Rt = *(short*)(offset + rs)

LHU Load Halfword Unsigned Rt = *(Ushort*)(offset + rs)

14

Trang 29

Store Byte *(char*)(offset + rs) = rt

Store Halfword *(short*)(offset + rs) = rt

SW Store Word *(int#)(offset + rs) = rt

Chapter 2 DESIGNING MIPS32 PROCESSOR

2.1 Overview of processor design

We would like to present the first content of the thesis, which is focused on the

research and design of a Superscalar MIPS processor The processor consists of two

ALU blocks for handling integer instructions (both signed and unsigned).

Additionally, a Mult/Div block is added to handle integer multiplication and division

operations The design also incorporates two IP System Cache from Xilinx (one for

I-Cache and one for D-I-Cache), serving as the Instruction Memory and Data Memory.

Trang 30

In this thesis, we propose the design of a Superscalar that has been pipelined

into six stages as follows:

e IFI1 stage: The Program Counter outputs the instruction address, and the

Instruction Memory fetches the corresponding instruction.

e IF2 stage: The Instruction Memory completes the instruction fetch.

e ID stage: Reads the values of registers from the Register File In this

stage, the processor handles issues such as hazards, data forwarding, or

branch/jump instructions.

e EX stage: Executes arithmetic and logic instructions or reads data from

the Data Memory.

e MEM stage: Performs write operations to the Data Memory or

completes data read operations if the instruction is a load instruction

from the Data Memory.

e WB stage: Selects the value from the Data Memory for load

instructions or from the ALU block to store it back into the Register

Figure 2.2: System Cache

This is Xilinx’ s System Cache IP working in 2 way set- associative mode In

this thesis, MIPS core will utilize 2 of this cache, | as Instruction Cache MEM) and 1

as Data Cache (DMEM) Both of these caches inputs and outputs are as the above

figure.

16

Trang 31

We control the cache exactly how we would control an AXI transaction, which will

be explained in the next chapter.

Port definitions:

Table 2.1: Signals of block

S0_AXI GEN AXI slave port 0 SO_AXI_GEN AXI slave port 1

Function of Program Counter block:

e The Program Counter is the control unit in the processor and is used to

track the address of the current or next instruction.

e _ The Instruction Memory operates in Dual Port mode, which means the

Program Counter needs to output two instruction addresses to read two

instructions from the Dual Port mode of the Instruction Memory.

The input/output ports and their meanings are as follows:

Table 2.2: Signals of Program Counter Block

clk Clock pulse

aresetn Low level active reset signal

HazardSignal Signals that a instruction conflict has

occurred

17

Trang 32

PC Instruction address B of the previous

instruction pair

A_Address Current A instruction address

B_Address Current B instruction address

ReadInstruction This is the Enable signal of the Program

2.2.3 Control Unit Block

Figure 2.4: Control Unit Block

Function of Control Unit block:

e The decoding of the Opcode and Funct fields is performed to generate

control signals for the Datapath.

e After receiving the Opcode and Funct fields from the instruction, the

Control Unit decodes them to generate the corresponding control

signals for each stage.

Table 2.3: Signals of Control Unit block

Signal Name Description

Opcode Combine with Funct to clearly define what the

instruction should do

Need to distinguish conditional branch

instructions

18

Trang 33

Including the following signals: Sign, Jump, Branch,

ControlSignal HItoReg, LOtoReg, ALUSrc, RegDst, MemWrite,

MemRead,

RegWrite, MemtoReg, RegtoHI, RegtoLO

ALUControl The signal tells the ALU what to do

SignalWrite The signal indicates how the instruction should be

written to Data Memory: sw, sh, sb

Signal indicates how the instruction should read

from Data Memory: Iw, Ih lhu, lb, Ibu

SUBU

SUB

Zal/MFHI/MFLO

19

Trang 34

OR

NOR MULTU

MULT

DIVU

DIV SLTUN

MTHI

SLT

10111 MTLO

(Zal stands for branch/jump instructions and address links)

Finally, we have the bit encoding convention for BranchIns Since the

processor has 12 different branch instructions, it is essential to differentiate between

them.

Table 2.5: BranchIns bit encoding convention

BranchIns[11:0] Description

BranchIns[0] JALR BranchIns[{ 1] JR

20

Trang 35

Figure 2.5: Register File Block

Function of Register File Block:

e Manage the reading or writing of computation results by the processor.

e Receive register addresses to read or write corresponding values,

capable of simultaneous read and write in the same cycle.

Table 2.6: Signals of Register File Block

Name Signal Description

clk Clock pulse aresetn Low level active reset signal

A_ ReadRegisterl Address of register | of instruction A

A_ReadRegister2 Address of register 2 of instruction A

A_WriteData Value to write of instruction A A_WriteRegister The address of the destination register

21

Trang 36

A_Regwrite Request signal to save the value of

instruction A

B_ ReadRegisterl Address of register 1 of instruction B

B_ReadRegister2

B_WriteData

Address of register 2 of instruction B

Value to write of instruction B

B_WriteRegister The address of the destination register

Read value of | instruction register B

Read value of 2 instruction register B

2.2.5 Register High/Low Block

Figure 2.6: Register High/Low Block

Function of Register High/Low block:

e The High register and Low register are two registers used to store the

results of multiplication/division, as the multiplication/division results

are 64-bit.

e The Register High/Low block is located in the ID stage, at the same

position as the Register File.

22

Trang 37

Signal Name

Table 2.7: Signals of Register High/Low Block

Clock pulse

A_HighData High 32-bit write-in value of instruction A

A_LowData Low 32-bit write-in value of instruction A

A_HiLoRegWrite Write enable signal to the high or low bit of

the instruction A

A_Read Signal read low bit (default read high bit)

B_HighData High 32-bit write-in value of instruction B B_LowData Low 32-bit write-in value of instruction A

B_HiLoReg Write Write enable signal to the high or low bit of

the instruction B

B_Read Signal read low bit (default read high bit)

A_ReadData Output read value of instruction A B_ReadData Output read value of instruction B

Comparator Block

IsThatBranch ReadData1[31:0]

IsThatBranchZal ReadData2[31:0] RTL

BranchTaken Branchins[1 1:0]

BranchZal

Figure 2.7: Comparator Block

Function of Comparator Block:

Comparing the two values at the two ends of the ReadData in the block,

combined with the signal sequence at the input of Branchlns,

determines the type of branch/jump instruction

In this processor, there are 8 types of branch instructions and 4 types of

jump instructions, and the Comparator block operates based on the

BranchIns signal.

23

Trang 38

Table 2.8: Signals of Comparator Block

Signal Name DescriptionReadDatal Input value 1 ReadData2 Input value 2 BranchIns Signal specifies type of flash/jump instructions

IsThatBranch Signal indicates whether this is a conditional

branch instruction or a jump instructionIsThatBranchZal Signals indicates whether this is a jump

instruction and address linkBranchTaken Signal indicates whether branch/jump instruction

has a branch or notBranchZal Signal indicates whether jump and link address

instructions have a branch or not

2.2.7 Sign Extend Block

Immediatell50] <>

Sign RTL SignEx[31:0]

Figure 2.8: Sign Extend Block

Function of Sign Extend Block:

e = The Sign Extend block takes a 16-bit input and extends it to a 32-bit

value with or without sign, depending on the Sign signal

Table 2.9: Signals of Sign Extend Block

Immediate Last 16 bits of instruction

24

Trang 39

Sign Signal from the Control Unit block,

indicating whether this is a signed or

Figure 2.9: ALU Block

Function of ALU Block:

¢ The ALU (Arithmetic Logic Unit) is responsible for performing

arithmetic operations in the processor (excludingmultiplication/division and MFHI, MFLO instructions)

e It receives operands from two input ports, performs calculations based

on the ALU Control signals, and outputs the resulting value through the

output port.

Table 2.10: Signal of ALU Block

ValueA Operand value |

ValueB Operand value 2

ALUControl The type of operation to be

performedshamt Number of bits to shift, if

need to calculate shift valueResult Calculation result

overflow Define overflow if instruction

calculation with sign

25

Trang 40

se The Mult/Div block is responsible for performing integer

multiplication, integer division (both signed and unsigned), andtransferring values between the Register File and the RegisterHigh/Low

e The Mult/Div block utilizes hardware algorithms discussed in the

"Computer Architecture" course It has a latency of 32 clock cycles

e The reason for separating the Mult/Div block from the ALU is that the

MIPS processor doesn't directly store the multiplication/division results

in the Register File Instead, the results need to be stored in the RegisterHigh/Low first, and then transferred to the Register File This allowsthe processor to continue executing other instructions unrelated tomultiplication/division without waiting for the 32-cycle computation ofthe Mult/Div block

e The Mult/Div block handles the following instructions: DIV, DIVU, MULT,

MULTU, MFHI, and MFLO.

Table 2.11: Signals of Mult/Div block

clk Clock pulse

26

Tiêu đề	Implementation of MIPS 32-bit Superscalar Processor on FPGA
Tác giả	Vo Hoang Nguyen Tin, Nguyen Phan Nhat Quang
Người hướng dẫn	PhD. Nguyen Hoainhan
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	100
Dung lượng	47,58 MB