ABBREVIATION GLOSSARYAbbreviations IP Intellectual Property Internet of Thing Advanced eXtensible Interface Expanded form Description Xilinx's intellectual property core is used in the t
Trang 1HO CHI MINH NATIONAL UNIVERSITY
UNIVERSITY OF INFORMATION TECHNOLOGY
COMPUTER ENGINEERING FACULTY
VO HOANG NGUYEN TIN - 19522352
NGUYEN PHAN NHAT QUANG - 19522095
Trang 2First and foremost, we would like to express my sincere gratitude to my advisor,
Dr Nguyen Minh Son, for his dedicated guidance throughout the thesis process Heprovided valuable assistance, equipment, and invested considerable time in reviewingand providing feedback, enabling me to complete the thesis in a successful manner
In addition, we would like to extend my thanks to the faculty members,colleagues, and friends of the Computer Engineering Department, as well as the
University of Information Technology - Vietnam National University Ho Chi Minh
City, for their dedicated teaching, imparting invaluable knowledge, and supporting methroughout my academic journey at the University of Information Technology Theknowledge and experiences gained from them will serve as a solid foundation,instilling confidence as I embark on my future endeavors
Lastly, we would like to express my deep appreciation to my parents and familyfor their unwavering support, encouragement, and creating a strong emotional supportsystem throughout my educational journey
During the course of conducting this thesis, I encountered inevitable difficultiesand made mistakes due to certain limitations within my specialized field Therefore, Isincerely hope that the esteemed professors and colleagues will understand and provideconstructive feedback to further develop and refine this thesis in the future
Once again, we would like to express my heartfelt gratitude
Ho Chi Minh City, June 29, 2023
The students
Vo Hoang Nguyen Tin
Nguyen Phan Nhat Quang
Trang 3TABLE OF CONTENTS
wChapter 1 OVERVIEW nhe
LL Architecture ccecesessesesessesesssesessscseenessseessssesesessssesssssssesesssseeeisscsesesessseeeisies O
1.11 Processor overview
1.1.2 CISC architecture ác sec scseeeeeeerrrrrrrrrerrree A
1.143 RISC architecture + treo
1.1.4 Compare CISC and RISC ccccccescescesessesseeecseseeseesesneseeseessenssnessensseesees 7
1⁄2 MIPS architecture
1.2.1 Development hiS(OTY S2 £‡E‡E‡EEEEzkekerrkrkrtrrsrksre §
1.2.2 Instruction S€ SĐT HH HH 0 g1 10
1.2.3 Executable instructions in the thesis
Chapter 2 DESIGNING MIPS32 PROCESSOR ::ccccssesssssessssessesseseesesseeeensseese 16
2.1 Overview in processor design c.ccccesecsesseeseseseseseseseecesscseseseseereeseetseeeseaeaeee 16
2.2 Function blocks in processor ccccseseseseseseesessseseseseeesesrsescsesssesesseetseetetseseseee LT
2.2.1 IP System cache
2.2.2 Program Counter DIOCK -. - 5 c+csxsseeeeeeeeeeerererrrrerreeeee LD
2.2.3 Control Unit blocK - 556 55>ccssssseseeeseeeeeseeeeerreere LO
22.4 ;C2)10801100u AT" ố
2.2.5 Register High/Low bÌOCK -c-cscscsxseeeeererererrereerrreevecee 23
Trang 42.2.6 Comparator ĐÌOCÍK - 2 5S S‡S*2E#E£EEEEEEEEEEEEkerrkrkerrrrrrke
2.2.7 Sign Extend bBIOCK 0 cece esesesesesescsescseseseeeesesesescscscseseseseneeeseseseeeeees
2.2.8 ALU DIOCK
2.2.10 MemLoad DIOcK 55222 St‡x‡xeEtztrxexerrtererrrrrrrrrrrrrree
2.2.11 Forwarding Unit blOCĂK 6S St Svsetevekekekrrrtrrkrrerre
2.2.12 Hazard Unit block ¿S2 S+Stketeteeketerirkrkrrrrrree
2.3 Resolving pipeline COnfÏiCS - ¿+ 5c treerrrrerrrrrerrrererrrrrrrrrke
2.3.1 Processor pipeline architecture
2.3.3 Load Data hazard - S2 St t2 2 HH
23.4 Branch haZaTd ¿- c5 + St 3 kề 21212112101 11 121.1 1.100
Chapter 3 DESIGNING MIPS32 PROCESSOR WITH AXI4 PROTOCOL
3.1 Package MIPS core with AXI4 protocol ccccce cesses eseeeseseeeeneeeseeneeeseaee
3.2 AXI4 prOtOCOI - - SĐT HH giàn
3.2.1 Protocol OVTVICW Sàn HH gà
3.2.2 Specific signals of the AXI4 protocol in the thesis
3.2.3 Overview Simulation of five communication channels AX14
3.3 Additional function BIOCK c.ccccccseseeseseseseessseeeesesesseeessssseesssessanenseneeeseerseaes
Trang 53.3.1 CDMA Control block 5-55 552c+csccsczxeeeesreeerxeexr TỔ,
Chapter 4 BLOCK DESIGN ON VIVADO ii 40
4.1 Overview of Block Design architeCfUre c-cscscccceceeeeeeee 4Ô)
4.1.1 Architecture design OV€TVI€W à sec 4D
4.1.2 Design on 2c 0 ÓA
4.2 Xilinx IPs used in Block Design
4.2.1 IP Memory Interface Generator (MIG 7 Series) OO
4.2.2 IP Central Direct Memory Access (IP CDMA) OL
4.2.3 IP AXI Bram Controller
Chapter 5 SIMULATION AND EVALUATTION -c-cc5<5c+c+ccrcrxereree 52
5.1 Instruction simulation c.c.cceece cscs ee cece eseeceseseeesesesseeseseeeneeeseseseeneneas 52
5.1.1 ONS eR ee 52
5.1.2 Simulation result
Chapter 6 CONCLUSION AND DEVELOPMENT 5-5-5 5cscccccee 77
6.1 s9 T7
6.1.1 Accumulated eXPeri€TIC€S - 5 222 S* S2 tt rrrrrrrrrree T7
6.1.2 Backlog issues cà HH He T7
6.1.3 0 —.A
T300 .4dd
Trang 6APPENDIX I KIT FPGA XILINX VIRTEX-7 VC707 -.-. - BO
APPENDIX II GENERATE BITSTREAM STEPS
Trang 7Nintendo 64 (MIPS R4300i CPU)
Tesla Model S (MIPS I-class CPU) cccccescsesssseescsesssseescseeesseeseseseseeeeseeeeee O
Processing stages Of DFOC€SSOT ¿c5 tt St vườn 15
System Cache ST HH 0H 00g uy 16
Program Counter BÏoCK - + +5 s+ssxssseseeeeersrsrerrrrrrrrrrrrrrerrrrer TỶ
Control Unit BIOCK 55233 S2E2 tt ‡kExektrkekererrrkrrrrrrrrrrrrrree 18
Register File BÏOCK 5c t2 12 12H 21
Register High/Low Block
Comparator BIOCK «+55 5232 26211212 22tEvErrrekrkrkrkrkrerkrerkrkrree 23
Sign Extend Block - sscsxsesesrerererererersrsrssrrrrrrrrree 24
ALU BLOCK 11 25
D/00/90985)099) 21 26
: MemLoad Processing BÌOCK 55+ S+S+sxvsvsrrrrrrrrrrrrrrrrrrrrrre 27
Forwarding Unit BlOCĂK - «+ + 2 tS*EEEvEEvxekekrkrkekrkrkrkrkrkrkrkrke 28
Hazard Unit Block
Processor pinepline archit€CtU€ ¿+ 5+ >++x+cvctererrxerxersrrrvrre 33
Forwarding data in pineline -‹-ss+++s+ssexexerererrrrrrrrrrrrrerrrrrer OD)
Forwarding for memory write inStFUCfÏON 5 555c5+cccvcereeeeeeee-c OO
: Hazard with data load instructions ccceeeeseseseseseseseseeeeeeeeeeeesesesee 37
Trang 8Figure 2.18: Instruction to load data and calculate data : - 3
Figure 2.19: Branch instruction Occurs in pineline -.-:‹¿- - + + 5++s+ss+>++s+£+£zxe 39
Figure 2.20: Branch instruction is at position B in the instruction pair
Figure 2.21: The branch instruction needs data from the previous instruction 40
Figure 3.1: MIPS architecture after it’s packaged into an IP -. . .-.- AL
Figure 3.2: General architecture of the AXI4 prOtOCOlÏ 5-5-5 ++++++se+s+sxexexere 42
Figure 3.3: A read transaction of AXI4[8] protocol ¿5-5252 5++cccc+scxzxsrerzee 43
Figure 3.4: A write transaction ghi of AXI4[8] protoeol -.-‹ -.- 43,
Figure 3.5: Simulate read channel AXI4 of prOC€SSOT - 5 -c+c+c+c+sxeeeeeeee-.- 40
Figure 3.6: Simulate write channel AXI4 of processor
Figure 3.7: Simulate write response channel AXI4 of processor - 46
Figure 3.9: CDMA Controller BlOCK cesssesesesesesesesesescseseseseeeeneeesscacseseseseseneseseeeeeees 47
Figure 4.1: Architecture design overview of Block Design -‹-5-+++5s+s+xsxe 48
Figure 4.2: Block Design on Vivado SOWare ¿St thư 49
Figure 4.3: IP Memory Interface Ge€In€ra(OT ‹-¿- 5+ 5tr rệt 50
Figure 4.4: IP Central Direct Memory Access (CDMA) -‹ - OO
Figure 4.5: IP Bram Controller va IP Block Memory Generator
Figure 5.1: Test model - ¿5 + S212 2E E11 221212 11121 1 11 gioi DD
Figure 5.2: Instruction in Initial ‹-¿- - 5< + kS‡+Ek#eEEEkEkEkEkck tr ưyn 53
Figure 5.3: Read request to Ï CaCHe - - «¿6k + ke E1 HH 12 H1 11g vườn 53
Figure 5.4: I-CACHE requests a read transaction to load instruction in 54
Figure 5.5: Write value from AXI bus into I-CACHEE - + sssvcsrererere 54
Trang 9Figure 5.6: The execution process in processor [1] -+ +52 25+s*2*+*+ssscsxsxsxerrre
Figure 5.7: Mult/Div block execute 0 cc eeeseseseseseseeeseseseseseeeeeeeeenecsessscscseseneneneeeeenseees
Figure 5.8: The execution process in processor [2] ¿c5 5252 + sv£vrvrvrrrererere
Figure 5.9: Write value into Data Memory
Figure 5.10: Write value on AX4 ¿5-2-5552 22 11211121 te
Figure 5.11: The execution process in processor [3] + - 555552 + ssveerererererrre
Figure 5.12: The execution process in processor [4] ¿+ 25+ 5+++‡cexss+zxscerczee
Figure 5.13: The execution process in processor [Š ] ¿5-5552 5ss++c++xszxerxrxe
Figure 5.14: Register File and High/Low Register results [ Ï] - -« =+=s+
Figure 5.16: Comparing the simulation result with MARS [Í] -< +
Figure 5.17: Instructions from Main 1
Figure 5.18: Instructions in Sub Ï- Ì ¿+ eeeeeseeseeseseeseesesesnesesnssneseeseeneae
Figure 5.19: Calculation result of the Sub Ï- Ì ¿-¿-+++s+5++++s+>++xexesesx+xseerzxe
Figure 5.20: Calculation result of Subl-1 on MARS - - 5< +csxsseserersrererrre
Figure 5.21: Instructions in Sub Ï~2 - - «+6 + ke E1 11g trên
Figure 5.22: Saved values in Data Memory on MARS - -+ -++++-ccsrsrsrsre
Figure 5.23: Saved values in Data Memory on Vivado [ Ï], - -«-s-+-e-exexexs+e
Figure 5.24: Instructions in Sub1-3
Figure 5.25: Calculation result of the instruction in Sub ]-3 - - s5 + «+cec+zxe
Figure 5.26: Calculation result of the instructions in Sub1-3 on MARS
Figure 5.27: Instructions in StuÐ Í ~4 + + + tt + +vrererersrererekrkrkrrrrrrrrrrrrrrrree
Figure 5.28: Calculation result of the instructions in Sub Í-4 - - -+<s<+<s+exexsxe
Trang 10Figure 5.29: Calculation result of the instructions in Sub1-4 on MARS 68
Figure 5.30: Instructions in Main 2.0.0.0 ccessesesesesesescsesescseseseeeeseenensescscscscaeseseeeseeneeeeeees 69
Figure 5.31: Instructions in St22- Í - + tk St St EkEvEEEeEekekekekskrkrkrkrkrrrrrrrrrrrre 69
Figure 5.32: Calculation result of the instructions in Sub2-1
Figure 5.33: Calculation result of the instructions on MARS - 70
Figure 5.34: Instructions in StIÐ2-2 - - + tt EvEkkerererererrskrkrrkrrrrrrrrrrrrre 70
Figure 5.35: Calculation result of the instructions in Sub2-2 eects 71
Figure 5.36: Calculation result of the instructions on MARS - 71
Figure 5.37: Results saved in register file after finishing the testing program 72
Figure 5.38: The result on MARS wo ccccceeesesesesesesesesesescseseseseeneeenenenenssscsseeseseneeeeeneeeees 72
Figure 5.39: The values in Data Memory on Vivado
Figure 5.40: The values in Data Memory on MARS - - tees eseeneneseeneneee 73
Figure 5.41: Resource ut[ÏiZatiOn ¿552235222 ‡xexexerxexererrxrrrrrrrrrrrerrree 74
Figure 5.42: Resource utilization of each logic block in the prOC€SSOF 14
Figure 5.44: The clock speed in the block design - ¿+ 55+ S+sxexe+scexererkrke 75
Figure 5.45: Timing r€SuÏ(L -¿- ¿+5 + vs vxrrrrrrererererrrrrrrrrrrrrrrrrrrrrrrrrerroe TO
Figure 5.46: Power COnSUIPÏOI óc SE ‡ksteerererererrrrrererrrrrrrrrerrrree TO
Figure 5.47: Power summary from the old design
Figure 5.48: Utilization summary from the old đesign - - - + - 55+ +£+£+zxe 78
Figure PI.1: KIT FPGA VC707[6] điagraim - - - c- sSk‡ketEkrkekekrrrkererree 82
Figure PI.2: Structure of KIT FPGA VC707{6] -. ‹ :-:+-5+5cc<++-+xc-c -.+ 83
Figure PII.1: Constraint file on ViVadO ccsssteterirerrrrrrrrie ĐỘ,
Trang 11Figure PII.2: Modeling RTL code for Mux 2-to-1 64-bit on Vivado
Trang 12DANH MỤC TABLE
Table 1.1: Register set in MIIPS - set LO
Table 1.2: Other TOQIStETS oo eeceees cece tees ee teee sees neaeseseeneeseeteaeetseseeeseeeeseseseceae LL
Table 1.3: Instruction format structure Riu cece - ‹-¿- 5e 5S xE‡+EekeEkErkEketrkekererkrke 11
Table 1.4: Instruction format structure Í 2-52-5552 5+2££2++£+££++Eezxezxerzrerxererre 12
Table 1.5: Instruction format structure J ccccscseseeseseseeeeseseseesesesessessseeeesssssesnenssneeencee 12
Table 1.6: Groups of processing instFUC{ÏOINS - - - ¿555cecccc+cseeeceeeeeeee.eee-e 13
Table 2.1: Signal s of block
Table 2.2: Signals of Program Counter BÏocK ¿+5 s++xcxvsecxsxsrerxrxe 17
Table 2.3: Signal s Of Control Unit BIOCK -¿- 5c 5+cc+ceeeeesseeeereeeeeeseeeeee-e 18
Table 2.4: The bit encoding convention for ALUControl ¿- 55+ 52552 s+c+s+z++ 19
Table 2.5: Branc Ins bit encoding CONVENTION 555 ‡xk‡EErkekerrrkekerrrrke 20
Table 2.6: Signals of Register File BIOCK ccceesesesesessseeseeeseseeeseecscsesesescsesereneeeeesees 21
Table 2.7: Signal
Table 2.8: Signa
s of Register High/Low BlOCK - 5c tt svrvrererererersrsrsvee 23
s of Comparator Block
Table 2.9: Signals of Sign Extend Block ¿- 55s 24
Table 2.10: Signal of ALU BLOCK 0 c.cceeeeseseseeeeesesesesescseseseseeeseenecscssscscseseseneneeeeeeeeeee 25
Table 2.11: Signals of Mult/Div blOCK - 65c sv£vrererereeerstseseererreerervve 20
Table 2.12: Signals of MemLoad processing block - - + c¿+<ece<<c-+c-c«c.-.e 28
Table 2.13: Signals of Forwarding Unit c cccccsecseseseeseesessssessesseescsneseeseeseneseeseenees 29
Trang 13Table 2.14: Signals of Hazard Unit BÏOC - - 56 2E E119 1 2 1E ke rriereeree 32
Table 2.15: Overview of conflict handling in pineline - - 55+ + <++s+eex+exzeess 34
Table 3.1: Signals of AXI⁄4 Ðus 5c SG S2 1211351131153 911 1111111111 91 HH nghe 44
Table 3.3: Signal of COMA Controller BÏOCK - 5 5 25+ *+*£++£++eEeseEeereeereeres 48
Table 5.1: Instruction computing result [ Ï]| - <6 55 +5 **+*++£E++eE+eeEeeereeerseeeeerse 55
Table 5.2: Instruction computing result [2] <6 5 + 3x E33 E*kE£skEeskreserseesserre 56
Table 5.3: Instruction computing result [3] - <5 + + * + E+#kESseEsseEseerseesseree 58
Table 5.4: Instruction computing result [4] eee ceeeececeeeceeeeeeecesececeeeeseeeseeeeeeeeeteees 60
Table 5.5: Instruction computing result [S]| << + x13 E3 Eskkskeeserseesserre 61
Table PI.1: Describe the location of components on the FPGA VC707 KIT 83
Trang 14ABBREVIATION GLOSSARY
Abbreviations
IP Intellectual Property
Internet of Thing
Advanced eXtensible Interface
Expanded form Description
Xilinx's intellectual property core is
used in the thesis
Connection of many electricalthings
together to build a system useful for
life Advanced extensible Interface protocol
Complex Instruction Set
Computer
Reduced Instruction Set Computer
Advanced RISC Machines
Computer architecture with Complex Instruction Set
Computer architecture with Simplified Instruction Set
Machine Learning Developed from RISC
Microprocessor without
Interlocked Pipeline Stages
RISC Instruction Set Architecture
Developed by MIPS Technologies Random Access Memory
Central Processing Unit
Field Programmable Gate
Random access memory used in thesis
Central processing unit using MIPS
superscalar architecture
Large-scale integrated circuits using
user-programmable logical element array structures
Arithmetic Logic Unit Arithmetic logic unit using in the thesis Phase Locked Loop Closed-Loop frequency control system
Register Transfer Level
Central Direct Memory Access
Mixed-Mode Clock Manager
Register transfer level using to develop
MIPS core
Intellectual property of Xilinx support
transfer data Hybrid mode clock management controller
Trang 15THESIS ABSTRACT
The thesis topic includes two main contents revolving around the research of
designing a local memory for MIPS processor inherited from previous thesis and
packaging the processor as an IP following the AXI4 communication standard.
The first content aims to study the design of a Superscalar MIPS processor The
processor consists of two ALU blocks to handle integer-related instructions (signed and
unsigned), along with an additional Multiply/Divide block to perform integer multiplication and division operations Additionally, the IP System Cache provided by
Xilinx will be utilized as the Instruction Memory (I-Cache) and Data Memory
(D-Cache).
The second content of the thesis is about packaging the MIPS processor Most
IPs in the Vivado software communicate with each other through the AXI4 or AXI3
bus In this thesis, the MIPS processor will be packaged following the AXI4
communication standard Subsequently, the MIPS processor will be connected to other
IPs in the Vivado Block Design, establishing the interconnection between the two
contents In the Block Design, the MIPS processor acts as a Master, requesting data
read from Slaves through the AXI4 bus, performing computations, and storing the values in the Data Memory It can also send the computed values back to the Slaves or output them to the UART.
Through these two contents, I hope to contribute to the dissemination of the
benefits of MIPS processors and propose an approach for designing a MIPS processor
specifically, as well as the design of other processors in general and contribute to the development of integrated circuits in Vietnam.
Trang 16Entering the modern era of the 2lst century, wireless communication technology is considered a leading trend in the IoT era and a driving force behind the
development of numerous useful IoT applications In addition, the rapid urbanization
and a large number of IoT device users in modern life have created a tremendous
demand for powerful processors However, the high cost of complex processor
fabrication, coupled with the challenges posed by material and labor shortages due to the ongoing COVID-19 pandemic, further emphasizes the importance of processors in IoT devices.
From the aforementioned situation, we can see the necessity of optimizing
processors One of the proposed solutions is designing processors specifically tailored
to IoT devices A processor with a simplified instruction set can reduce the number of
logic gates in the design, resulting in lower power consumption for the processor As a result, manufacturers can adjust the cost structure more reasonably, leading to reduced costs for end-users.
As a result, processors not only become more affordable but also offer faster
processing speeds, allowing IoT devices to handle tasks more efficiently and flexibly,
in line with the pace of growth in the modern era.
Based on the aforementioned analysis, our group proposes a thesis titled
"Implementation of MIPS 32-bit Superscalar on FPGA." The thesis consists of two
main components The first component aims to design a MIPS processor using the
Superscalar architecture, increasing the number of instructions that can be processed in
the processor pipeline The second component involves refining the MIPS processor
into a Master that can communicate with other Slaves such as MIG (DDR3), BRAM,
Peripheral through the AXI4 bus The objective of this component is to approach the
design of modern processors and contribute to the development of circuit design in Vietnam.
Trang 17Chapter1 OVERVIEW
1.1 Microprocessor architecture
1.1.1 Overview of microprocessors
With the remarkable technological advancements of humanity, processors have
emerged and developed rapidly over time Famous chip manufacturers have introduced
their own branded processors that have been widely commercialized, such as Intel,
Apple, AMD, Qualcomm, MediaTek, and more.
A processor, also known as a central processing unit (CPU), is a computer
electronic component fabricated from tiny transistors integrated onto a single unit area.
While the central processing unit (CPU) is a well-known processor component, other
components in a computer also have their own processors, for example, graphics cards
also have their own processors Before the advent of processors, CPUs were built from
separate small-scale integrated circuits, with each integrated circuit containing only
about a dozen transistors.
In the early 1970s, the first processing chips appeared and were used for electronic calculations or algorithms involving binary-coded decimal (BCD) numbers.
Subsequently, 4-bit and 8-bit processing systems were introduced The most significant
32-bit design was the MC68000 chip (68K), introduced in 1979 It featured a large
memory space, high speed, and reasonable cost, making it the most famous CPU
design The world's first 32-bit processor with full 32-bit data paths, a bus structure,
and a 32-bit address was the BELLMAC-32A chip by AT&T BELL.
The ARM processor made its first appearance in 1985 It is a 32-bit processor with a RISC architecture ARM has excelled in the field of embedded systems, offering
high performance and a wide range of development tools During this time, other RISC
processors also achieved great success, such as the MIPS R2000 and MIPS R3000.
The first 32-bit microprocessor chip in Vietnam was the VN1632, designed by
ICDREC and announced in January 2010 The VN1632 microprocessor was designed
using 120 nm technology and had a maximum operating frequency of 100 MHz It
incorporated most of the features of a typical 32-bit microprocessor, but due to the
inexperienced design team, they faced many difficulties during the design process As a
Trang 18result, the VN1632 chip could not be compared to modern 32-bit processors available
worldwide at that time.
The rapid progress of microprocessors is partly due to the application of
Moore's Law, which has consistently increased performance over the years.
Furthermore, the world is in the midst of the fourth industrial revolution, where the
semiconductor industry is considered paramount Therefore, microprocessors are
becoming increasingly complex and sophisticated Simultaneously, understanding the
fundamentals of microprocessor architecture provides us with essential knowledge and
a foundation for the development of more advanced microprocessors in the future This
also contributes to the overall advancement of processor chips globally.
1.1.2 CISC architecture
CISC (Complex Instruction Set Computer) is a type of microprocessor
designed to make programming with high-level languages easier and more
straightforward The full form of CISC is Complex Instruction Set Computing CISC chips are designed to be easily programmable and efficiently utilize memory CISC eliminates the need for complex instructions on the processor For example, instead of having to create a compiler and write lengthy machine instructions to calculate square
roots, a CISC processor provides integrated capabilities to perform this task.
Many early computers were programmed using assembly language, which
made memory access slow and expensive CISC architectures were commonly implemented in large computers such as the PDP-11 and DEC systems Here are some
characteristics of CISC processors:
° Large number of instruction sets, resulting in complex decoding logic
e Infrequent use of special instructions
° Some instructions have sizes larger than 32 bits.
e Fewer general-purpose registers when operations are performed in
memory.
° Different CISC designs are set up with two special registers for stack
pointers to manage interrupts.
Trang 19CISC processors offer the following advantages:
° In CISC, it is easy to add new instructions to the chip without changing
the instruction set structure.
° This architecture allows efficient utilization of main memory.
° The compiler complexity is not too high, as in the case of CISC.
Instructions can be written to fit the structure of high-level languages.
However, CISC comes with the following disadvantages:
° Previous generations of the processor line are mainly contained as a
subset in every new version As a result, the instruction set, and chip
hardware become more complex with each computer generation.
° Machine performance is slowed down as the clock cycle executed by
different instructions will never be the same.
° They are larger because they require more semiconductor gates.
1.1.3 RISC architecture
RISC (Reduced Instruction Set Computer) is a design approach for processors that
simplifies the instruction set, where the execution time for all instructions is the same.
Common RISC processors include ARM, SuperH, MIPS, SPARC, DEC Alpha,
PA-RISC, PIC, and PowerPC.
RISC processors are designed to perform a limited set of instructions for smaller-sized
computers, allowing for higher operating speeds The full form of RISC is Reduced
Instruction Set Computer RISC instruction sets typically contain fewer than 100
instructions and use fixed-length instruction formats This approach utilizes a small
number of simple address modes using register-based instructions In this compiler
development mechanism, LOAD/STORE are the only separate instructions for memory
access.
RISC processors offer the following advantages:
° One cycle execution time: RISC processors typically have a CPI (clock
per instruction) of one cycle This is achieved by optimizing each
instruction on the CPU and using a technique called pipelining.
Trang 20Pipelining: A technique that allows concurrent execution of instructions
in stages to increase execution efficiency.
Large number of registers: RISC designs often incorporate a large
number of registers to minimize interaction with memory.
RISC processors offer the following advantages:
Reduce processor area usage for control unit, from 60% (for CISC
processors) to as low as 10% (for RISC processors) This allows for
increased cache memory or logic gates within the processor.
High computational speed due to simplified instruction decoding.
RISC processors have a large number of general-purpose registers,
reducing the need for frequent memory access.
Uniform execution time for each instruction stage, allowing for
accelerated processing through pipelining.
Provides wide addressing formats for efficient memory management.
However, RISC also has some disadvantages:
Restricted memory access for all instructions except for memory read
and write instructions.
Limited instructions to support high-level languages.
RISC architecture requires hardware on the chip to be redesigned for each
version.
The performance of RISC processors depends on the skill of the
programmer of compiler The complier plays an important role in
translating CISC code into RISC code.
Comparing CISC and RISC
The CISC and RISC architectures are two predominant computer architectures
used today The main difference between RISC and CISC lies in the number of clock
cycles each instruction takes to complete With CISC, each instruction may require a
larger number of cycles to execute compared to RISC.
The reason behind the difference in cycle count lies in the complexity and
objectives of the instructions in both architectures In RISC, each instruction aims to
6
Trang 21accomplish a very small task So, if a complex task needs to be performed, it requires
stringing together multiple instructions With CISC, each instruction is more like a
high-level language statement Only a few instructions are needed to achieve what the
program wants, as each instruction performs multiple stages.
In terms of available instruction sets, RISC has longer instruction sequences
compared to CISC This is because each small step may require a separate instruction, unlike in CISC where a single instruction will encompass multiple steps While CISC
may be easier for programmers, it also has its drawbacks Using CISC may not be as
efficient as using RISC This is because of the inefficiency in repeatedly using CISC
code, leading to wasted cycles Using RISC allows programmers to eliminate
unnecessary code and prevent wasted cycles.
The previous differences may have been significant to technology enthusiasts, but
for most users, they would be meaningless CISC has attempted to dominate
computing with the dominance of Intel's x86 architecture, which serves as the
foundation for all other modern computer architectures Conversely, RISC has found
its way into mobile devices such as smartphones, tablets, GPS receivers, and similar
devices ARM is one notable RISC architecture used in these devices The higher
efficiency of the RISC architecture makes it desirable in these applications where
cycles and power efficiency are often constrained.
1.2 MIPS architecture
1.2.1 Development history
MIPS (Microprocessor without Interlocked Pipeline Stages) is an instruction set
architecture (ISA) that belongs to the RISC (Reduced Instruction Set Computing)
family and was developed by MIPS Technologies Initially, MIPS had a 32-bit
architecture, which was later followed by a 64-bit version Over time, MIPS has
undergone several revisions and developments, resulting in various versions such as MIPS I, MIPS II, MIPS II, MIPS IV, MIPS V, MIPS32, and MIPS64 The current
versions are MIPS32 and MIPS64 There are also several optional extensions, including
MIPS-3D with a SIMD instruction set, MIPS16e for instruction compression to reduce
program size, and MIPS MT for multithreading support.
In 1981, researchers led by John L Hennessy at Stanford University began the
7
Trang 22initial research on the MIPS microprocessor The fundamental concept was to improve
performance by utilizing instruction pipelining, a well-known but challenging
technology to develop Hennessy later left Stanford University to establish his own
company, named MIPS Computer Systems, in 1984 The company's first design was
the R2000, introduced in 1985, followed by the R3000 in 1988 In 1991, MIPS
Computer Systems released the first R4000 microprocessor SGI, one of MIPS'
customers, acquired the company and renamed it MIPS Technologies.
Currently, MIPS continues to be widely used and actively developed in various
devices such as IoT devices, Arduino boards, and automotive applications Modern
MIPS processors emphasize performance and power efficiency Along with functional
advancements, the architectural complexity of MIPS has significantly increased
compared to the original R2000 version introduced in 1985.
Here are some devices that utilize MIPS processors:
Figure 1.1: Nintendo 64 (MIPS R4300i CPU)
Released in June 1996, the Nintendo 64 was the first game console to utilize a 64-bit
processor The CPU of the Nintendo 64 is the NEC VR4300, based on the MIPS R4300
processor, running at a clock speed of 93.75 MHz and providing 125 million
instructions per second with raw performance.
Trang 23Cars Model S manufactured after September 2014 are equipped with a range of
devices designed for autonomous operation, including a windshield-mounted camera,
undercarriage radar, and front and rear ultrasonic sensors In addition, these cars
integrate the Mobileye EyeQ3 computer vision chip, based on the MIPS I-class
architecture, which provides high-performance capabilities for detecting road signs,
lane markings, obstacles, and other vehicles The combination of sensors and computer
vision hardware enables features such as autonomous driving and parking, including
the Autopilot feature that allows drivers to have a hands-free experience on highways.
MIPS is a simple and efficient RISC architecture, highly scalable and available for
licensing Over time, the architecture has evolved, embracing new technologies and
developing a strong ecosystem and comprehensive support for the industry The
fundamental characteristics of MIPS, such as a large number of registers, the number
and size of instructions, and visible pipeline delays, enable the MIPS architecture to
deliver high performance for IP cores, as well as reasonable power utilization for
modern SoC designs.
1.2.2 Instruction set architecture
For the instruction set design principles, the MIPS32 design adheres to the following
principles in the Superscalar architecture:
e Simple and regular instruction structure: The instruction size is fixed at 32
bits, with a 6-bit opcode field.
e Smaller instructions and memory access for faster processing: Limited
instruction set, limited number of registers, and address limitation modes.
e Speed up common cases: Operands are taken from registers, and
instructions contain immediate operands.
e Instructions follow three common formats.
Regarding the data storage principles in memory:
e Alignment Restriction principle: Objects stored in memory must start at an
address that is a multiple of the object's size (typically a multiple of 4).
Trang 24e Big Endian principle: The high byte is stored in the memory location with
the lower address, while the low byte is stored in the memory location with
the higher address.
MIPS follows a register-to-register and load/store philosophy, which means that
instructions operate on registers When using memory, separate load/store instructions
are used to transfer data between memory and registers.
Each register in MIPS stores a 32-bit value Unlike high-level programming language
concepts, registers in assembly language do not have a data type The way registers are
used determines the data type.
MIPS processors include 32 general-purpose registers, each with a size of 32 bits:
Table 1.1: Register set in MIPS
Register index
Register name Function of register
(Decimal)
$zero (*) “ @ 7 Contains constant value 0
$at Serving Assembler
$v0 - $v1 Đ` 3 Contains the return value after
using the function or procedure
$a0 - $a3 4-7 Store the function's input parameter
Sra (*) Return Address
In this thesis, the MIPS processor also uses additional registers for the
following purpose:
1 Oo
Trang 25High Register (*) Store upper 32-bit
Low Register (*) Store lower 32-bit
Program Counter (*) Contains current address
(Note * is some register used in the thesis)
Next, we will present the structure of assembly instructions when translated
into machine language Each instruction in MIPS is 32 bits long Each instruction can
be seen as a function in a programming language Therefore, we need the
instruction name, the parameters passed to the instruction, and the type of each
parameter, in this case, the size of each parameter (as there is no concept of data type in
assembly language).
MIPS focuses on the simplicity of the instruction set, so there are three main
instruction formats: R-format, I-format, and J-format Each machine instruction is 32 bits long and is divided into different groups of bits, called fields, with each field
serving a specific role in the instruction Depending on the structure of the instruction,
each field has different specifications The R-format instruction has the following 6
parameters:
Table 1.3: Instruction format structure R
Opcode: Specifies the operation to be performed In case of R-format,
instructions use the same opcode 0.
Rs (Register Source): Specifies the first source register
Rt (Register Target): Specifies the second source register
Rd (Register Destination): Specifies the destination register
Shamt: shift amount, Specifies the amount of shifting or rotating to be
performed.
Funct: Specifies the specific function within the opcode
11
Trang 26Next is the I-format instruction, which is used for operations between registers
and an immediate value available in the instruction:
6-bit 5-bit 5-bit 16-bit
Opcode: Specifies the operation to be performed Since the I-format does not
have a funct field, the I-format instructions do not share the same opcode like the
R-format instructions.
Rs (Register Source): Specifies the source register
Rt (Register Target): Specifies the destination/source register.
Immediate: Specifies the immediate value or constant used in the instruction.
The J-format instructions are used for jump instructions, similar to the 'goto'
statement in C In addition, there are j and jr instructions, which have the following
structure:
Table 1.5: Instruction format structure J
Target address
Opcode: Specifies the opcode for the jump instruction
Target address: The shortened address of the target jump instruction is
obtained by truncating the original 32-bit address to 6 bits as follows:
e Discard the two least significant bits of the address Since the address
of MIPS instruction are always multiples of 4, the two least significant
bits are always 0.
e Consider the four most significant bits to be the same as the four most
significant bits of the current instruction.
1.2.3 | Executable instructions in the thesis
Next, we would like to present the instruction groups that are used for studying
the design of the MIPS processor in the scope of this thesis:
12
Trang 27Table 1.6: Groups of processing instructions
Instructi Full name Handling action
on
Arithmetic logic calculation instruction
ADD Add Rd = rs + rt
Add Immediate Rt = rs +imm
Add Immediate Unsigned Rt = rs + imm
im —
am.
c | — 7 mm —
Set On Less Than Rd = rs <rt
Set On Less Than Immediate Rt = rs <imm
Set On Less Than Unsigned Rd =rs <rt
Instruction to transfer values between general and special registers
MFHI Move From HI Rd = HI
MFLO Move From LO Rd =LO
13
Trang 28SLLV Shift Left Logical Variable
Branch On Equal if(rs == rt) pc += offset * 4 BGEZ Branch On >= 0 if(rs >= 0) pc += offset * 4
131 = pc; if(rs >= 0) Branch On >= 0 And Link
pe += offset * 4
Branch On > 0 if(rs > 0) pc += offset * 4
Branch On <= 0 if(rs <= 0) pe += offset * 4 BLTZ Branch On < 0 if(rs < 0) pc += offset * 4
r31 = pc; if(rs < 0)
BLTZAL Branch On < 0 And Link
pe += offset * 4
Branch On Not Equal if(rs != rt) pc += offset * 4
Jump = pc_upper | (target << 2) Jump And Link r31 = pc; pc= target << 2
Jump And Link Register Rd = pe; pe = rs
Jump Register Pc =rs
ad instruction in memory
Load Byte Rt = *(char*)(offset + rs)
Load Byte Unsigned Rt = *(Uchar*)(offset + rs)
Load Halfword Rt = *(short*)(offset + rs)
LHU Load Halfword Unsigned Rt = *(Ushort*)(offset + rs)
14
Trang 29Store Byte *(char*)(offset + rs) = rt
Store Halfword *(short*)(offset + rs) = rt
SW Store Word *(int#)(offset + rs) = rt
Chapter 2 DESIGNING MIPS32 PROCESSOR
2.1 Overview of processor design
We would like to present the first content of the thesis, which is focused on the
research and design of a Superscalar MIPS processor The processor consists of two
ALU blocks for handling integer instructions (both signed and unsigned).
Additionally, a Mult/Div block is added to handle integer multiplication and division
operations The design also incorporates two IP System Cache from Xilinx (one for
I-Cache and one for D-I-Cache), serving as the Instruction Memory and Data Memory.
Trang 30In this thesis, we propose the design of a Superscalar that has been pipelined
into six stages as follows:
e IFI1 stage: The Program Counter outputs the instruction address, and the
Instruction Memory fetches the corresponding instruction.
e IF2 stage: The Instruction Memory completes the instruction fetch.
e ID stage: Reads the values of registers from the Register File In this
stage, the processor handles issues such as hazards, data forwarding, or
branch/jump instructions.
e EX stage: Executes arithmetic and logic instructions or reads data from
the Data Memory.
e MEM stage: Performs write operations to the Data Memory or
completes data read operations if the instruction is a load instruction
from the Data Memory.
e WB stage: Selects the value from the Data Memory for load
instructions or from the ALU block to store it back into the Register
Figure 2.2: System Cache
This is Xilinx’ s System Cache IP working in 2 way set- associative mode In
this thesis, MIPS core will utilize 2 of this cache, | as Instruction Cache MEM) and 1
as Data Cache (DMEM) Both of these caches inputs and outputs are as the above
figure.
16
Trang 31We control the cache exactly how we would control an AXI transaction, which will
be explained in the next chapter.
Port definitions:
Table 2.1: Signals of block
S0_AXI GEN AXI slave port 0 SO_AXI_GEN AXI slave port 1
Function of Program Counter block:
e The Program Counter is the control unit in the processor and is used to
track the address of the current or next instruction.
e _ The Instruction Memory operates in Dual Port mode, which means the
Program Counter needs to output two instruction addresses to read two
instructions from the Dual Port mode of the Instruction Memory.
The input/output ports and their meanings are as follows:
Table 2.2: Signals of Program Counter Block
clk Clock pulse
aresetn Low level active reset signal
HazardSignal Signals that a instruction conflict has
occurred
17
Trang 32PC Instruction address B of the previous
instruction pair
A_Address Current A instruction address
B_Address Current B instruction address
ReadInstruction This is the Enable signal of the Program
2.2.3 Control Unit Block
Figure 2.4: Control Unit Block
Function of Control Unit block:
e The decoding of the Opcode and Funct fields is performed to generate
control signals for the Datapath.
e After receiving the Opcode and Funct fields from the instruction, the
Control Unit decodes them to generate the corresponding control
signals for each stage.
The input/output ports and their meanings are as follows:
Table 2.3: Signals of Control Unit block
Signal Name Description
Opcode Combine with Funct to clearly define what the
instruction should do
Need to distinguish conditional branch
instructions
18
Trang 33Including the following signals: Sign, Jump, Branch,
ControlSignal HItoReg, LOtoReg, ALUSrc, RegDst, MemWrite,
MemRead,
RegWrite, MemtoReg, RegtoHI, RegtoLO
ALUControl The signal tells the ALU what to do
SignalWrite The signal indicates how the instruction should be
written to Data Memory: sw, sh, sb
Signal indicates how the instruction should read
from Data Memory: Iw, Ih lhu, lb, Ibu
SUBU
SUB
Zal/MFHI/MFLO
19
Trang 34OR
NOR MULTU
MULT
DIVU
DIV SLTUN
MTHI
SLT
10111 MTLO
(Zal stands for branch/jump instructions and address links)
Finally, we have the bit encoding convention for BranchIns Since the
processor has 12 different branch instructions, it is essential to differentiate between
them.
Table 2.5: BranchIns bit encoding convention
BranchIns[11:0] Description
BranchIns[0] JALR BranchIns[{ 1] JR
20
Trang 35Figure 2.5: Register File Block
Function of Register File Block:
e Manage the reading or writing of computation results by the processor.
e Receive register addresses to read or write corresponding values,
capable of simultaneous read and write in the same cycle.
The input/output ports and their meanings are as follows:
Table 2.6: Signals of Register File Block
Name Signal Description
clk Clock pulse aresetn Low level active reset signal
A_ ReadRegisterl Address of register | of instruction A
A_ReadRegister2 Address of register 2 of instruction A
A_WriteData Value to write of instruction A A_WriteRegister The address of the destination register
21
Trang 36A_Regwrite Request signal to save the value of
instruction A
B_ ReadRegisterl Address of register 1 of instruction B
B_ReadRegister2
B_WriteData
Address of register 2 of instruction B
Value to write of instruction B
B_WriteRegister The address of the destination register
Read value of | instruction register B
Read value of 2 instruction register B
2.2.5 Register High/Low Block
Figure 2.6: Register High/Low Block
Function of Register High/Low block:
e The High register and Low register are two registers used to store the
results of multiplication/division, as the multiplication/division results
are 64-bit.
e The Register High/Low block is located in the ID stage, at the same
position as the Register File.
The input/output ports and their meanings are as follows:
22
Trang 37Signal Name
Table 2.7: Signals of Register High/Low Block
Clock pulse
aresetn Low level active reset signal
A_HighData High 32-bit write-in value of instruction A
A_LowData Low 32-bit write-in value of instruction A
A_HiLoRegWrite Write enable signal to the high or low bit of
the instruction A
A_Read Signal read low bit (default read high bit)
B_HighData High 32-bit write-in value of instruction B B_LowData Low 32-bit write-in value of instruction A
B_HiLoReg Write Write enable signal to the high or low bit of
the instruction B
B_Read Signal read low bit (default read high bit)
A_ReadData Output read value of instruction A B_ReadData Output read value of instruction B
Comparator Block
IsThatBranch ReadData1[31:0]
IsThatBranchZal ReadData2[31:0] RTL
BranchTaken Branchins[1 1:0]
BranchZal
Figure 2.7: Comparator Block
Function of Comparator Block:
Comparing the two values at the two ends of the ReadData in the block,
combined with the signal sequence at the input of Branchlns,
determines the type of branch/jump instruction
In this processor, there are 8 types of branch instructions and 4 types of
jump instructions, and the Comparator block operates based on the
BranchIns signal.
23
Trang 38The input/output ports and their meanings are as follows:
Table 2.8: Signals of Comparator Block
Signal Name DescriptionReadDatal Input value 1 ReadData2 Input value 2 BranchIns Signal specifies type of flash/jump instructions
IsThatBranch Signal indicates whether this is a conditional
branch instruction or a jump instructionIsThatBranchZal Signals indicates whether this is a jump
instruction and address linkBranchTaken Signal indicates whether branch/jump instruction
has a branch or notBranchZal Signal indicates whether jump and link address
instructions have a branch or not
2.2.7 Sign Extend Block
Immediatell50] <>
Sign RTL SignEx[31:0]
Figure 2.8: Sign Extend Block
Function of Sign Extend Block:
e = The Sign Extend block takes a 16-bit input and extends it to a 32-bit
value with or without sign, depending on the Sign signal
The input/output ports and their meanings are as follows:
Table 2.9: Signals of Sign Extend Block
Signal Name Description
Immediate Last 16 bits of instruction
24
Trang 39Sign Signal from the Control Unit block,
indicating whether this is a signed or
Figure 2.9: ALU Block
Function of ALU Block:
¢ The ALU (Arithmetic Logic Unit) is responsible for performing
arithmetic operations in the processor (excludingmultiplication/division and MFHI, MFLO instructions)
e It receives operands from two input ports, performs calculations based
on the ALU Control signals, and outputs the resulting value through the
output port.
The input/output ports and their meanings are as follows:
Table 2.10: Signal of ALU Block
Signal Name Description
ValueA Operand value |
ValueB Operand value 2
ALUControl The type of operation to be
performedshamt Number of bits to shift, if
need to calculate shift valueResult Calculation result
overflow Define overflow if instruction
calculation with sign
25
Trang 40se The Mult/Div block is responsible for performing integer
multiplication, integer division (both signed and unsigned), andtransferring values between the Register File and the RegisterHigh/Low
e The Mult/Div block utilizes hardware algorithms discussed in the
"Computer Architecture" course It has a latency of 32 clock cycles
e The reason for separating the Mult/Div block from the ALU is that the
MIPS processor doesn't directly store the multiplication/division results
in the Register File Instead, the results need to be stored in the RegisterHigh/Low first, and then transferred to the Register File This allowsthe processor to continue executing other instructions unrelated tomultiplication/division without waiting for the 32-cycle computation ofthe Mult/Div block
e The Mult/Div block handles the following instructions: DIV, DIVU, MULT,
MULTU, MFHI, and MFLO.
The input/output ports and their meanings are as follows:
Table 2.11: Signals of Mult/Div block
Signal Name Description
clk Clock pulse
aresetn Low level active reset signal
26