Insteadof using registers to store the input data, we use a novel8T SRAM cell to build the input buffer.. We attribute this increment to f r eqw, whichattenuates the effect of the toggle
Trang 1Wente Yi , Graduate Student Member, IEEE, Kefan Mo, Wenjia Wang , Yitong Zhou, Yejun Zeng,
Zihan Yuan, Bojun Cheng , Member, IEEE, and Biao Pan , Member, IEEE
Abstract— Digital computing-in-memory (DCIM) that merges
computing logic into memory has been proven to be an efficient
architecture for accelerating multiply-and-accumulates (MACs)
However, low energy efficiency and high area overhead pose
a primary restriction for integrating DCIM in re-configurable
processors required for multi-functional workloads To alleviate
this dilemma, a novel RISC-V supported full-digital
computing-in-memory processor (RDCIM) is designed and fabricated with
55nm CMOS technology In RDCIM, an
adding-on-memory-boundary (AOMB) scheme is adopted to improve the energy
efficiency of DCIM Meanwhile, a multi-precision adaptive
accumulator (MPAA) and a serial-parallel conversion supported
SRAM buffer (SPBUF) are employed to reduce the area overhead
caused by the peripheral circuits and the intermediate buffer
for multi-precision support The results show that the energy
efficiency in our design is 16.6 TOPS/W (8-bit) and 66.3 TOPS/W
(4-bit) Compared to related works, the proposed RDCIM macro
shows a maximum energy efficiency improvement of 1.22× in
a continuous computing scenario, an area saving of 1.22× in
the accumulator, and an area saving of 3.12× in the input
buffer Moreover, in RDCIM, 5 fine-grained RISC-V extended
instructions are designed to dynamically adjust the state of
DCIM, reaching 1.2× computation efficiency
Index Terms— Computing-in-memory, RISC-V, extended
instructions, re-configurable precision
I INTRODUCTION
IN RECENT years, deep neural networks (DNNs) have
been widely applied in various fields, including image
Manuscript received 14 September 2023; revised 2 November
2023 and 13 December 2023; accepted 3 January 2024 This work was
supported in part by the National Key Research and Development Program of
China under Grant 2021YFB3601304 and Grant 2021YFB3601300, in part
by the National Natural Science Foundation of China under Grant 62001019,
in part by the Laboratory Open Fund of Beijing Smart-Chip Microelectronics
Technology Company Ltd., in part by the Fundamental Research Funds
for the Central Universities, in part by the Key Research and Development
Program of Anhui Province under Grant 2022a05020018, and in part by
the Joint Laboratory Fund of Beihang University and SynSense This article
was recommended by Associate Editor W Liu (Corresponding authors:
Biao Pan; Bojun Cheng.)
Wente Yi, Kefan Mo, Wenjia Wang, Yitong Zhou, Yejun Zeng,
Zihan Yuan, and Biao Pan are with the School of Integrated Circuit
Science and Engineering, Beihang University, Beijing 100191, China (e-mail:
panbiao@buaa.edu.cn).
Bojun Cheng is with the Microelectronics Thrust, Function Hub, The
Hong Kong University of Science and Technology (Guangzhou), Guangzhou
510000, China (e-mail: bocheng@ust.hk).
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2024.3350664.
Digital Object Identifier 10.1109/TCSI.2024.3350664
classification, voice detection, language processing, etc [1],
[2], [3] However, with the fast growth of network scale, computer systems are confronted with emerging issues relating to data-centric computations as massive multiply-and-accumulates (MACs) operations exist in DNNs For example, as one of the representative DNN models, ResNet-50 requires 3.9G MACs for the inference of a single image[4]
In order to meet the high-parallelism requirements of MACs operations, a variety of appropriate AI accelerators have been proposed [5], [6], [7], [8] to coordinate with conventional processing units (CPU or GPU) Despite the optimization of data flow in these accelerators, the memory-wall bottleneck caused by the separation of computing and memory units still exists, resulting in huge power consumption and extra area overhead[9] In addition, to meet the demand of diverse application scenarios, the requirement of multi-precision computation with quick configuration must be satisfied within
a single DNN accelerator
In the past ten years, computing-in-memory (CIM) which processes and stores data at the same location has been proven
to be a promising approach to reduce the data movement needed for high-throughput MAC operations [10],[11],[12],
[13],[14] Instead of executing the computation tasks merely
in arithmetic logical units, CIM allocates partial tasks into memory units, which lowers the demand for data movement between arithmetic logical units and memory units According
to the paradigm of data encoding and processing, CIM is mainly divided into analog computing-in-memory (ACIM), which makes use of Kirchhoff’s law to execute computation, and digital computing-in-memory (DCIM), which makes use
of full digital circuits to execute computation To date, most CIM research has focused on the ACIM for high energy efficiency and throughput at the cost of limited accuracy[11],
[12],[13] On the contrary, DCIM can avoid the inaccuracies caused by process variations, data conversion overhead, and poor scaling of analog circuits However, the benefits also come up with new challenges for DCIM in power and area aspects, raising requirements for optimization of DCIM both internally and externally:
A Challenge-1 (Power) According to the survey, the maximum power consumption comes from the adder tree Fig.1(a)shows the dynamic power
1549-8328 © 2024 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Trang 2Fig 1 Challenges of DCIM design and the solutions of RDCIM macro (a) Power efficiency (b), (c) Area overhead.
breakdown of typical DCIM in previous works[14],[15],[16]
under 55nm CMOS technology (1.2V, 100MHz, 300K) In the
4-bit situation, the adder tree accounts for 79% of the power,
which is the main source of DCIM computation Moreover,
as the bit width increases to 8 bits, this ratio will increase to
82%, leading to a further imbalance of the power distribution
B Challenge-2a (Area)
Facing the increasing demand of multi-task application
scenarios, supporting multi-precision computation has become
a necessity for DCIM [22], [24] As shown in Fig 1(b),
in order to support multi-precision computation, DCIMs
reported in [14], [15], and [16] tend to split the
high-precision data into different SRAM arrays, and then sum up
the computation results of each array in additional circuits like
shifter, adder, etc outside the DCIM macro These additional
multi-precision supported circuits increase the area overhead
by 24% of the accumulator area
C Challenge-2b (Area)
Outside the DCIM macro, the input buffer is used to change
the data format of input data to support bit-serial computation
As shown in Fig 1(c), in previous works, the input buffer is
composed of registers to support the function of data format
conversion Even though the registers are easy to control and
accomplish various functions, the area overhead of registers is
huge compared to SRAM cells Specifically, a single register
is 6× larger than a 6T SRAM cell
To tackle these challenges, we have proposed a
RISC-V supported full-digital computing-in-memory processor
(RDCIM) with high energy efficiency and low area overhead,
supporting flexible precision re-configuration The main
contribution of this work can be summarized as:
1) For challenge-1, RDCIM uses the
adding-on-memory-boundary (AOMB) scheme to reduce the power
consumption of the adder tree Unlike prior works that
simplify multiplications to NOR operations before the
adder tree, RDCIM inserts 26T mirror adders into the
SRAM arrays to pre-process the weight data and uses
multiplexer (MUX) to implement multiplications The
AOMB scheme together with the MUX-embedded adder
tree (MAT) significantly reduces the dynamic power
2) For challenge-2a, the multi-precision adaptive
accu-mulator (MPAA) is designed to realize multi-precision
computation internally MPAA eliminates the need
for additional multi-precision supported circuits, which results in reducing the area overhead Moreover,
by setting the configuration signals, more computation precision (4/8/12/16 bits) can be flexibly adjusted 3) For challenge-2b, a serial-parallel conversion supported SRAM buffer (SPBUF) is designed in RDCIM Instead
of using registers to store the input data, we use a novel 8T SRAM cell to build the input buffer Compared to the conventional register-based buffer, the SPBUF is 3.12× smaller Therefore, by adopting the SPBUF, the storage density of the buffer can be increased to store more input data
4) To support these circuit-level optimizations above-mentioned, extended instructions are employed in RDCIM by integrating the DCIM macro with a
RISC-V CPU The extended instructions include two extra parameters, which enable adjusting the DCIM state dynamically, making it suitable for our proposed DCIM and achieving a 1.2× improvement in computation efficiency
The remainder of the paper is organized as follows Section II introduces the DCIM fundamentals and the recent related works Section III presents the architecture of RDCIM, including AOMB+MAT, MPAA, SPBUF, and
RISC-V extended instructions SectionIVpresents the experimental results Finally, SectionVconcludes the paper
II BACKGROUND ANDMOTIVATION
A Overall Architecture of DCIM Fig 2(a) shows the overall architecture of DCIM which contains two parts: the memory array and the digital computation circuit The memory array is typically composed
of SRAM cells and utilized to perform weight data storage and MAC operations between input and weight data As shown
in Fig 2(b), to reduce the area overhead, NOR gates are placed on the boundary of the SRAM array [16], which results in multiple 8T SRAM cells in the same row sharing one NOR gate Fig 2(c)illustrates the components of digital computation circuits: the adder tree and the accumulator The adder tree is placed on the boundary of the SRAM array, summing up all the output of each column in parallel The accumulator is placed on the boundary of the adder tree to accumulate the output of the adder tree in different clock cycles Besides, an input buffer is needed to temporarily store the input data and output serial bits
Trang 3Fig 2 The overall architecture of a typical DCIM macro and its peripheral
circuits (a) DCIM macro of SRAM arrays and digital computation circuit.
(b) SRAM array with 10T cells (c) Digital computation circuit of adder tree
and accumulator.
Fig 3 The workflow of DCIM (a) Mathematical expression of the
computation (b) The summing up process (c) The accumulating process.
The basic function of DCIM is to implement matrix-vector
multiplication (MVM), where MAC operations are used to
compute the dot product of two vectors Mathematically,
MVM can be resolved to multiple vector-vector
multiplica-tions (VVM) The computation steps of VVM are illustrated
in Fig.3(a) In each clock cycle, one specific bit of each input
data is sent to the SRAM array to implement multiplication
with weight data After that, the outputs of the SRAM array
are summed up in the adder tree as shown in Fig 3(b), and
the accumulator shifts the data stored in the register and adds
them with the output of the adder tree as shown in Fig.3(c)
These operations are repeated until the computations of the last
bit of each input data are finished, and then the accumulator
returns the result of VVM [26]
B Related Works
In the past five years, as shown in Fig 4(a), (b), and (c),
researchers have done a lot of work on the architecture and
circuit level around DCIM-based DNN accelerators
As shown in Fig 4(c), several prior works have optimized
the circuit of DCIM to improve its throughput and
functionality Fujiwara et al designed a DCIM macro with
Fig 4 Related works of DCIM (a) DCIM-based DNNs processor (b) The innovations in architecture (c) The innovations in circuits.
5nm CMOS technology and integrated the adder tree into the SRAM array, achieving 254 TOPS/W in 4-bit and 63 TOPS/W
in 8-bit computation [14] Yue et al combined DCIM with float MAC rules, enabling DCIM to support both integer and float computation[20] H Zhu et al proposed a novel DCIM structure: computing-on-memory-boundary (COMB), which allows multiple columns of SRAM array to share one digital computation circuit on the boundary of the array, achieving 1152Kb array size and 0.25 TOPS peak performance (65nm CMOS technology), balancing the computation and memory resources [16]
As shown in Fig.4(b), several prior works have incorporated DCIM into DNN accelerators to leverage its high MVM computation efficiency and throughput F Tu et al designed
a pipeline/parallel reconfigurable accelerator with DCIM for transformer models [19] The accelerator overcame the challenge of weight update in DCIM and achieved 12.08× to 36.82× lower energy than previous CIM-based accelerators Moreover, the connection between different DCIM chips can
be achieved by chiplet technology [21], which expands the capacity and improves the performance of the accelerator
C Motivation Although a lot of works have been done before to improve the performance of CIM accelerators, most of them are optimized from a single aspect of circuit[14],[15],[16],[18],
[20],[24],[28],[30],[33],[34]or architecture[19],[21],[25],
[27],[29],[31],[32], and the two aspects are rarely combined Motivated by the previous works, we address the complexity
of these three challenges by proposing a novel DCIM solution that integrates circuit optimization and architecture design at both levels
III RDCIM ARCHITECTURE
The overall architecture of RDCIM is shown in Fig 5
with three features highlighted in different colors As depicted
Trang 4Fig 5 The overall architecture of RDCIM with three features:
AOMB+MAT, MPAA, and SPBUF.
in Fig 5, facing three challenges, we have introduced three
key features (1, 2a, and 2b), which correspond to the three
challenges, for improving the energy efficiency and reducing
the area overhead from both circuit and architecture level
synergistically Meanwhile, by introducing the RISC-V CPU,
the architectural computation efficiency is expected to be
further improved
A Adding-on-Memory-Boundary (AOMB)
The key component of MVM computation is the
multipli-cation between input and weight data In the typical DCIM
scheme, the multiplications are performed by NOR gates
inside the SRAM array, and then summed up by the adder
tree Basically, the total power consumption during MVM
computation can be expressed by Eq.(1):
Psum1=Pr ead+PNOR+
n
X
i =1
Padder tr ee[i ] (1)
Here, Psum1 represents the total power consumption, Pr ead
represents the power consumed while reading the weight
data, PNOR represents the power consumed while executing
the NOR operations, and Padder tr ee[i ] represents the power
consumed by each stage of adder tree (n stages in total) The
strategy for tackling challenge-1 is to analyze Eq (1) and
optimize PNOR+Pn
i =1Padder tr ee[i ]
Given the structure of DCIM, the weight data in the
SRAM array should be relatively static to achieve better
energy efficiency To utilize this characteristic, we proposed
the AOMB scheme to pre-process the weight data and further
reduce the burden of digital computation circuits To cooperate
with the AOMB scheme, we designed a MAT to accomplish
the computation
Fig 6 illustrates the structure of the AOMB scheme In
RDCIM, the SRAM array adopts a novel 8T SRAM cell,
which is similar to the structure in Fig 2(b), but has two
different types, that will be described in detail later The
AOMB scheme decouples the SRAM cells and the NOR gates
and places the adders beside them The weight data stored in the SRAM array are split into ax and bx(x represents the index
of data), and sx is the sum of ax and bx The output of AOMB
is ax[4]ax[3]ax[2]ax[1]ax[0], bx[4]bx[3]bx[2]bx[1]bx[0] and
sx[4]sx[3]sx[2]sx[1]sx[0]
As shown in Fig.6(a), adders are placed on the boundary of the SRAM array, and each of the adders is placed between two SRAM rows The two rows of SRAM cells, together with the adder, form a basic computing unit Each computation unit passes the carry to the next one and outputs the result of
ax[k]/ax[k], bx[k]/bx[k] and sx[k]/sx[k] Two 4-bit weight data are stored in an interleaved manner to meet the demand
of the adder chain in6(d)
In the AOMB scheme, we adopt the 26T mirror adder instead of the 28T adder, as shown in Fig 6(c) The advantage of the 26T mirror adder is that it can reduce the delay and transistor number by removing the output inverters
of carry and alternating positive and negative logic When the 26T mirror adder receives an uninverted input(ax[k],
bx[k], cx[k −1]), it outputs an inverted carry(cx[k]) and an uninverted sum(sx[k]) As shown in Fig.6(d), cascading these adders forms an adder chain This adder chain is different from the regular one in that its summation and carry are with the alternant inverted-uninverted forms For example,
if the input of the adder chain is ax[3]ax[2]ax[1]ax[0] and bx[3]bx[2]bx[1]bx[0], the output of the adder chain is
sx[3]sx[2]sx[1]sx[0] It is necessary to obtain the input bits with the alternant inverted-uninverted form
To obtain this input form, we make specific optimization
of the SRAM cell structure Fig 6(b)shows that the internal connection of SRAM cells changes according to the weight data bit position For odd bit positions, the point C (PC) connects to the point B (PB), and the read bit line (RBL[m]) outputs an inverted bit after the read word line (RWL[0])
is turned on and the RBL[m] is pre-charged For even bit positions, the PC connects to point A (PA) and outputs an uninverted bit
Furthermore, the sign control signal of weight (w_sgn) is introduced to indicate the form of weight data to be signed or unsigned as shown in Fig.6(a) To add the signed and unsigned weight data, the most significant bit (MSB) of data should be extended according tow_sgn In the AOMB scheme, a NOR gate is utilized to extend the MSB of weight data If the weight data are signed, the w_sgn is set to 1, and the ax[4] equals
to ax[3] If the weight data are unsigned, thew_sgn is set to
0 and the ax[4] equals to 0
However, the AOMB scheme cannot complete the MAC operations independently, so we designed MAT to implement multiplication and accumulation As shown in Fig 7(a), the MAT is placed next to the SRAM array, receiving the output
of it, and summing up the selected data Different from the regular adder tree, the first stage of MAT is composed
of MUXs that perform the multiplication Specifically, transmission gate logic is adopted to build the MUXs, which reduces the number of transistors from 12 to 6 and lowers the parasitic capacitance Two bits of input data (i [l + 1]i [l]) are grouped to select the output of the SRAM array Fig.7(b)
illustrates the first MUX, where i [1]i [0] select the output of
Trang 5Fig 6 The design of AOMB and its related circuits (a) The boundary of one SRAM array with an adder chain placed beside it (b) The design of one computation unit in AOMB (c) The mirror adder circuit with 26 transistors (d) The adder chain with 26T full adders.
Fig 7 The structure of MAT and its related circuits (a) The components
of MAT (b) The MUX circuit.
the first computation unit in the SRAM array: ‘00’ means
the output of MUX is 0, ‘01’ means the output is a0[k],
‘10’ means the output is b0[k], and ‘11’ means the output
is s0[k] The logical expression of MUX can be written as
out put = a0[k] × i [0] + b0[k] × i [1], which performs the
multiplication indirectly In the subsequent stages of MAT, the
26T mirror adders are used, and the data format of alternant
inverted-uninverted bits is passed on until the final stage of
MAT The sign bit processing is similar to the structure in
Fig 6(a) To transform the output to a normal form, the odd
bit positions are inverted
By pre-processing the weight data, the power consumption
during computation can be expressed as the sum of three
components, as shown in Eq (2):
Psum2=Pr ead+Padder× f r eqw+PMAT (2a)
PMAT =PMUX+
n
X
i =2
Padder tr ee[i ] (2b)
Here, Psum2 represents the total power consumption, Pr ead
represents the power consumed while reading the weight
data, Padder represents the power consumed by the adders
next to the SRAM array, f r eqw represents the frequency
of weight updating, and PMAT represents the power
consumed by the MAT Inside the MAT, its power consists
of two parts: the MUX (PMUX) and the adder tree
(Pn
i =2Padder tr ee[i ]) The total power saving can be expressed
as Psum1−Psum2, as shown in Eq.(3):
Psum1−Psum2= PNOR+Padder tr ee[1]
−Padder× f r eqw−PMUX (3) Since the computation of DCIM is data-centric, DNN mapping is typically focused on weight reuse, which reduces
f r eqw As described in Eq (3), the smaller the f r eqw, the greater the power savings In this case, the AOMB scheme reduces the dynamic power of the first stage of the conventional adder tree and collaborates with MAT to lower the overall power consumption Fig 8(a) compares the AOMB scheme (Padder×f r eqw+PMUX) with the non-AOMB scheme (PNOR+Padder tr ee[1]) at different input and weight toggle rates We evaluated them in the post-simulation environment with 55nm, 100MHz, and 300K The baseline
is the conventional scheme at 0.1 toggle rate, with weight data updated every 8 cycles ( f r eqw = 1/8) The AOMB scheme achieves up to 1.48× power saving at 0.1 toggle rate As the toggle rate increases, the AOMB scheme can save more power consumption and finally reach 1.79× at 0.5 toggle rate We attribute this increment to f r eqw, which attenuates the effect of the toggle rate on power consumption Fig.8(b) shows the power consumption comparison between the AOMB+MAT scheme (Psum2) and the AOMB & non-MAT scheme (Psum1) at different input and weight toggle rates with the same evaluation environment as Fig.8(a) The baseline is the conventional scheme at 0.1 toggle rate, and the AOMB+MAT scheme achieves up to 1.22× power saving compared to the non-AOMB & non-MAT scheme at 0.5 toggle rate
B Multi-Precision Adaptive Accumulator (MPAA)
To support multi-precision computation, various methods have been proposed in prior works[14] As shown in Fig.9(a),
if the basic bank is 4 bits, 8 bits weight data should be split into two different SRAM arrays, one for the high 4 bits and the other for the low 4 bits Both SRAM arrays perform VVM computation in parallel, and their results are summed up
by multi-precision computation-supported circuits (MPCSC) outside DCIM In the MPCSC, the DCIM result of high 4 bits
is left shifted 4 bits and then added to the result of low
Trang 6Fig 8 The comparison of power consumption between the conventional scheme and our scheme in the post-simulation environment (55nm, 100MHz, 300K) (a) The comparison between the AOMB scheme and the non-AOMB scheme (b) The comparison between the AOMB+MAT scheme and the non-AOMB & non-MAT scheme.
Fig 9 The comparison between conventional multi-precision supported
scheme and our MPAA scheme.
4 bits In this way, the result of 8-bit data can be obtained
in one computation cycle Similarly, 16-bit weight data can be
computed by dividing it into four SRAM arrays and summing
up their outputs within MPCSC Specifically, the 16-bit weight
data is first split into two 8-bit data sets, and the 8-bit data
repeats the same computation step mentioned before in stage 1
of MPCSC Then, the MPCSC left shifts the high 8-bit result
by 8 bits and adds it to the low 8-bit result in stage 2 However,
this method supports multi-precision computation at a cost
of additional circuits, which has a quarter of the area of the
accumulator as described in challenge-2a Moreover, more
precision levels need more shifters and adders in MPCSC,
forming a tree structure that increases the area overhead
Meanwhile, selecting the results from different precision levels
is challenging because they come from different stages of
MPCSC
To address the challenge-2a, we proposed the MPAA,
which supports multi-precision computation inside DCIM
As shown in Fig 9(b), to cooperate with MPAA, the SRAM
array contains different columns, and a selector is introduced
to select a column to perform computation The multiple
columns of the SRAM array share the same adder tree and
MPAA
Different from the traditional scheme, if the weight data is
8 bits and the basic bank is 4 bits, the high and low 4 bits
are placed in the first and second columns of the SRAM array
respectively They share a common digital computation circuit
and execute VVM in series Firstly, the DCIM macro finishes
the computation of the high 4 bits and stores the result in the register of the accumulator Without clearing the previous result, it finishes the computation of low 4 bits and stores the final result in the same register
Fig.10(a)shows the structure and the computation steps of MPAA To support both signed and unsigned computation, the MPAA can be configured by setting the MSB indicator signal (x_MSB) and the sign indicator signals of input data (x_sgn) and weight data (w_sgn) To support unsigned computation, x_MSB, x_sgn, and w_sgn need to be set to zero, which means the input of accumulator (data_acc) does not need any processing before being sent to the register To support signed computation, x_MSB, x_sgn, andw_sgn need to cooperate After receiving the data_acc, the MPAA first extends the sign bit based on w_sgn Then, it judges whether the input data are the MSB based on x_MSB If so, the sign-extended data (D_sgn) need to perform the following operation: ∼ D_sgn+1′b1, otherwise, the D_sgn remains unchanged Since the data in the register may need to be right-shifted by 3 bits,
a left shift of 3 bits is performed before the adder to avoid accuracy loss After the register, two shifters are optional based on the cycle control signal (w_cycle) to solve the problem of continuous computation between different columns
of banks
It is noteworthy that the shifter inside MPAA depends on the bit widths of both the input and the weight data Assume that the input data is M bits, the basic bank of the SRAM array
is N bits, and the weight data is 2N bits After computing the high N bits of weight data, the data stored in the accumulator register must be right-shifted by M−N bits before computing the low N bits Otherwise, the result of high N bits would
be left-shifted M bits during the computation of low N bits, but the high N bits need to be left-shifted N bits actually
To realize this function, the data in the register should be right-shifted by M −N −1 bits during the first computation of the low N bits weight To support higher precision levels of weight data (e.g 3N , 4N , and so on), the data stored in the same register only needs to be right-shifted by M − N −1 bits each time during the first computation of the lower N bits weight
Trang 7Fig 10 The structure and the computation steps of MPAA (a) The structure
of MPAA with sign bit processing circuit, MSB processing circuit, and
reconfigurable shifter (b)-(f) The computation steps of MPAA with 8 bits
signed input data and weight data.
Fig 11 Computation steps of 4/8/12/16 bits input data and weight data.
Different gray levels represent different steps In the computation of each
precision, the steps are executed from the left to the right.
As shown in Fig 10(b) to (f), based on the structure
mentioned before, the computation steps with 8 bits signed
weight and input data are shown as follows:
1) Step 1: The input data is the MSB, and the high 4 bits of
weight data are first computed In this step, the x_MSB,
x_sgn and w_sgn are set to 1, and the w_cycle is set
Fig 12 The area overhead and features comparison between conventional scheme and our MPAA in 55nm CMOS technology.
to 0, so the input data undergoes the sign bit and the MSB pre-processing before being sent to the register 2) Step 2: The input data is not the MSB, so the x_MSB
is set to 0, and the remaining signals are unchanged Step 2 lasts 7 cycles, and the register data are left shifted
1 bit and summed up with the sign extended input data
in every cycle
3) Step 3: To compute the low 4 bits of weight data, the w_sgn needs to be set to 0 The input data is the MSB,
so the x_MSB is set to 1 Since computing the low
4 bits requires switching the bank to the second one, the w_cycle is set to 1, and the data in the register is right-shifted by 3 bits
4) Step 4: This step is similar to the step 2 withw_sgn set
to 0
5) Step 5: The w_cycle is set to 1, and the output is the register data right shifted by 3 bits, which is 32 bits Furthermore, as illustrated in Fig.11, the MPAA can support more precision levels by combining these steps To support 4-bit input data, steps 2 and 4 should be configured into two different types: one lasting 7 cycles (2a and 4a) and the other lasting 3 cycles (2b and 4b)
When the weight data is 4 bits and the input data is 8 bits, only steps 1, 2a, and 5 are executed When the weight data is
12 bits or 16 bits, steps 3 and 4a are repeated several times after steps 2a, and then step 5 returns the output When the input data is 4 bits, step 3 is replaced by step 1, and steps 2a (4a) are replaced by 2b (4b) respectively, and the other steps are unchanged When the input data is 12 bits or 16 bits and the input buffer only supports 8 bits of input data, the input data are loaded twice The high 8 bits of input data, which are computed first, follow the computation steps 1, 2a, 3, 4a, and 5 For the low 4 bits or 8 bits of input data, it follows the computation steps mentioned before with x_sgn set to
0 To compute the unsigned input data and weight data, the x_sgn andw_sgn are set to 0, and the computation steps are the same Since the register inside the accumulator is 32 bits,
it is acceptable that the bit width of the input data plus the weight data is less than 32
Assuming a weight matrix size of 64 × 64 with 4-bit banks, and 8-bit weight data support as our goal, we compare the MPAA with the MPCSC scheme The MPCSC scheme can compute the 8-bit weight data with a matrix size of 32×64 in a single computation cycle For a larger matrix size, such as 64×
64, it requires an additional computation cycle By adopting the MPAA, our RDCIM computes the 8-bit weight data with
Trang 8Fig 13 The structure of SPBUF and its related circuits (a) The circuit of the register with 36 transistors (b) The processing steps of SPBUF (c) The overall structure of SPBUF includes the SRAM cells and their peripheral circuits (d) The design of the read-out circuit (ROC) (e) The circuit of 8T SRAM cell and its line crossing relation.
a matrix size of 64 × 64 in two computation cycles When
the matrix size exceeds 32 × 64, the MPAA scheme achieves
the same throughput as the MPCSC scheme However, our
scheme demonstrates greater flexibility of bit width compared
to the MPCSC scheme In particular, the MPCSC scheme
limits the weight bit width by the quantity of SRAM arrays
To support 16-bit width with the same parallelism, 4× SRAM
arrays are required Moreover, the input bit width is limited by
the MPCSC By contrast, our scheme can support more weight
bits by extending the columns of the SRAM array, and the bit
width is only limited by the register in the accumulator The
weight and the input bit width can support 4/8/12/16 bits as
long as their sum is less than 32 bits Moreover, by eliminating
the MPCSC, the area overhead is reduced by 1.22× as shown
in Fig.12
C Serial-Parallel Conversion Supported SRAM Buffer
(SPBUF)
The input buffer is a crucial component of the DCIM
It receives input data and outputs the serial bit To support this
function, the input buffer is typically made up of registers that
are easy to control and output steady bit-serial data However,
registers suffer from significant area overhead as mentioned in
challenge-2b As shown in Fig.13(a), the register is 4.5 times
larger than the 8T SRAM cell Due to area limitations, it is
challenging to increase the storage capacity of the input buffer,
resulting in a large hardware overhead in high-bandwidth
application scenarios
To address the challenge-2b, we proposed the SPBUF
as shown in Fig 13(c) The SRAM buffer is composed of
64 columns, and each column represents 8 bits of input data
A write selector is placed at the bottom of the SRAM buffer
to select a column to store the input data that comes from the
right side of it, and a bit selector is placed on the left side of it
to select a specific bit of each input data Read-out circuits are
placed on top of the SRAM buffer to provide steady output
data The processing steps are shown in Fig 13(b) To start
with, the input data needs to be loaded into the SRAM buffer
Similar to the regular SRAM, the write address (write_addr)
is given to the write selector and selects a column to store the
8-bit input data Then, the bit position (sel_bi t ) is given to the bit selector to select a specific bit to be computed, and the SPBUF outputs the selected bit of each input data
To achieve the structure in Fig 13(c), the 8T SRAM cell
is customized as shown in Fig 13(e) In our design, there exist two groups of perpendicular control lines (RWL&WWL, RBL&WBL) In this case, the write data and the read data are perpendicular So when the input data is stored in different columns, each row represents the same bit position In the write stage, the RWL activates the SRAM cell, and the RBL passes the data into it In the read stage, the RWL activates the read path, and the RBL is pre-charged If the data inside the cell is 0, the RBL remains high Otherwise, it discharges to 0 The output of the SRAM cell is inverted to the data inside it The 8T SRAM cell introduces a problem where the RBLs need to be pre-charged, which results in unstable output
To overcome this problem, a read-out circuit is designed as shown in Fig.13(d) During the low-level of the clock signal, the RBL is pre-charged If the RBL is high level and the clock
is low level, the signal t is locked During the high level of the clock signal, the RBL obtains the output of the SRAM cell, which is inverted to the data stored in the cell At the same time, signal t is inverted to the RBL The inverter placed between the out put and the signal t stabilizes the output data From the point of view of the control signal, the bit position should be selected at the falling edge of the clock, and the out put will be obtained at the rising edge of the clock The out put will stay stable for one clock cycle
We compared the area overhead of the register-based input buffer with our SPBUF in 55nm CMOS technology with a capacity of 64 × 8 bits As shown in Fig 14(a), the area overhead is reduced by 3.12× when adopting the SPBUF with the same capacity To prove the correction of SPBUF,
we simulated the circuit at 55nm, 100MHz, and the partial waveform is shown in Fig 14(b) We chose the high 2 bits
of the first column to illustrate, and the two SRAM cells are called cell A and cell B respectively For the first clock cycle, cell A and cell B receive the input data At the falling edge of the second clock cycle, cell A is chosen with RWL[7] set to 1
At the rising edge of the third clock cycle, the read-out circuit outputs the data (Q) in cell A Since the data in cell A is 0,
Trang 9Fig 14 (a) The area overhead comparison between register-based input
buffer and our SPBUF in 55nm CMOS technology (b) The waveform of the
read-out process.
the RBL does not need to discharge, so the output remains
high At the falling edge of the third clock cycle, cell B is
chosen with RWL[6] set to 1 At the rising edge of the fourth
clock cycle, the read-out circuit outputs the data (Q) in cell
B Since the data in cell B is 1, its RBL discharges, and the
output changes to 0 At the falling edge of the fourth clock
cycle, signal t is locked, so the output remains unchanged
The output of each data lasts one clock cycle, which matches
our previous discussion
D RISC-V Extended Instructions
To support more application scenarios and computation
types, a CPU is always connected to DCIM to deal with
control-intensive tasks In previous works, the DCIM is always
connected to an ARM architecture CPU by the AXI bus
However, since the instructions of DCIM are usually
coarse-grained, the scheme of bus connection suffers from weak
flexibility Moreover, since the main features incorporated into
RDCIM come at the cost of increasing control difficulty,
a novel coarse-grained coupling manner between CPU and
DCIM is needed
To couple DCIM with CPU more tightly and support the
three features, a RISC-V CPU is introduced into the RDCIM
The overall architecture of the RISC-V CPU is shown in
Fig 15(a) This architecture supports the RISC-V extended
instructions, which provide two additional parameters to
control the DCIM flexibly Moreover, the memory inside
DCIM is directly connected to data local memory (DLM) in
the RISC-V CPU to generate an AXI stream-like connection
To support the RISC-V extended instructions, we added
some modifications to the execution unit It incorporates
a judging module that distinguishes between standard and
extended instructions Standard instructions are executed by
the arithmetic logic unit inside the RISC-V CPU, whereas
extended instructions are forwarded to an external decoder
that further decodes them The external decoder also obtains
ones, but they operate on different hardware components Based on the proposed architecture, we designed five RISC-V extended instructions, as illustrated in Fig 15(b) These instructions resemble the standard ones with two 32-bit register parameters, namely register 1 and register
2 respectively These instructions can fully utilize the DCIM functionality by working together
The cim_clr instruction is used to reset the DCIM state, including setting the accumulator register to 0 and clearing the data in latches and read-out circuit
The data_trans_1 and data_trans_2 instructions are used to transfer the input data and weight data from DLM to SPBUF and SRAM array, respectively In the program, the address of the input data and weight data in DLM should be specified in register 1 Moreover, for the data_trans_2, the column of the SRAM array to receive the weight data should be specified
in register 2 (write_column_sel) During the execution of data_trans_1, the SRAM buffer is filled with the input data from DLM, and during the execution of data_trans_2, the selected column of the SRAM array is filled with the weight data from DLM
The run_cim instruction is used to perform the computation
of DCIM To ensure the correct computation, register 1 should specify thew_sgn, x_sgn, w_cycle, and bit_w signals, which are related to the AOMB, MAT, MPAA, and SPBUF The w_sgn and the x_sgn indicate the sign of the weight and input data in AOMB, MAT, and MPAA, where 1 means signed and 0 means unsigned The bi t _w indicates the bit width of the input data in SPBUF, where 1 means 8 bits and 0 means 4 bits Thew_cycle indicates the need for right shifting the register data in MPAA, where 1 means necessary and 0 means unnecessary Since the DCIM only supports the computation of one column of each SRAM array at a time, register 2 (r ead_column_sel) should specify the column to
be computed
The read_out instruction is used to transfer the DCIM results to DLM Register 1 should specify the address in DLM that receives the results
There are three main categories of matrix computation using RDCIM and the following examples illustrate each of them:
If the input and the weight data are both 4-bit signed data, the VVM execution follows these steps:
1) cim_clr: Initialize the DCIM state;
2) data_trans_2: Store the weight data in the selected column of the SRAM array Register 1 specifies the address of the weight data and register 2 specifies the column to be written;
3) data_trans_1: Store the input data into the SRAM buffer Register 1 specifies the address of the input data;
Trang 10Fig 15 (a) The hardware foundations of the RISC-V extended instructions (b) The description of the extended instructions (IFU: Instructions fetch unit EXU: Execution unit LSU: Load store unit WB: Write back BIU: Bus interface unit.)
4) run_cim: Start the computation with the selected column
of the SRAM array The w_sgn and x_sgn should be
set to 1, since the input data and the weight data are
signed, and thew_cycle and bit_w should be set to 0
Register 2 specifies the column to be computed;
5) read_out: Load the results into DLM Register 1
speci-fies the address that receives the results
Moreover, the instructions can be combined to support
multi-precision computation Supposing the weight data is
signed 12 bits and the input data is signed 8 bits The input
data can be stored in the SRAM buffer entirely, but the weight
data needs to be stored in three different columns of the SRAM
array The computation can be simplified as follows:
1) cim_clr: Initialize the DCIM state;
2) data_trans_2: Repeat three times to store the 12-bit
weight data into three different columns of SRAM array;
3) data_trans_1: Store the input data into the SRAM
buffer;
4) run_cim: Start the computation with the column of high
4 bits selected Thew_sgn, x_sgn and the bit_w should
be set to 1, and thew_cycle should be set to 0;
5) run_cim: Start the computation with the column of
middle 4 bits selected The w_cycle, x_sgn and the
bi t_w should be set to 1, and the w_sgn should be
set to 0;
6) run_cim: Start the computation with the column of low
4 bits selected The w_cycle, x_sgn and the bit_w
should be set to 1, and thew_sgn should be set to 0;
7) read_out: Load the results into DLM
Supposing the input data are signed 12 bits and the weight
data are signed 4 bits The computation can be simplified to
the following steps:
1) cim_clr: Initialize the DCIM state;
2) data_trans_2: Store the weight data into the selected
column of the SRAM array;
3) data_trans_1: Store the high 8 bits of input data into the
SRAM buffer;
4) run_cim: Start the computation with the selected
column The w_sgn, x_sgn and the bit_w should be
set to 1, andw_cycle should be set to 0;
5) data_trans_1: Store the low 4 bits of input data into the
SRAM buffer;
TABLE I
C HIP S PECIFICATION OF O UR D ESIGNED RDCIM M ACRO
6) run_cim: Start the computation with the selected column Thew_sgn should be set to 1, and the x_sgn,
bi t_w and w_cycle should be set to 0;
7) read_out: Load the results into DLM
For the computation of unsigned data, the x_sgn andw_sgn should be set to 0
IV EXPERIMENTALRESULT
A 64KB macro using the 55nm CMOS technology was fabricated to evaluate the principal innovative points of RDCIM The detailed technical specifications of the RDCIM macro are shown in Table I The RDCIM macro operates
at a nominal voltage of 1.2V and a clock frequency of 200MHz, with a total area of 9.8mm2 It consists of a 2.8mm2 DCIM macro with a 64KB SRAM array inside The RDCIM macro achieves a peak energy efficiency of 66.3 TOPS/W with an area efficiency of 0.288 TOPS/mm2 at 4 bits and 16.6 TOPS/W with an area efficiency of 0.072 TOPS/mm2
at 8 bits precision, respectively Multi-precision computation including 4/8/12/16 bits of weight and input data could be conducted in RDCIM to support various application scenarios
To support the innovation circuits mentioned before, a
RISC-V CPU with 64KB instruction and data cache, which can be expanded to 4MB, is incorporated This RISC-V CPU has a 3-stage pipeline structure that can execute RISC-V instructions seamlessly