Rdcim risc v supported full digital computing in memory processor with high energy efficiency and low area overhead

Insteadof using registers to store the input data, we use a novel8T SRAM cell to build the input buffer.. We attribute this increment to f r eqw, whichattenuates the effect of the toggle

Trang 1

Wente Yi , Graduate Student Member, IEEE, Kefan Mo, Wenjia Wang , Yitong Zhou, Yejun Zeng,

Zihan Yuan, Bojun Cheng , Member, IEEE, and Biao Pan , Member, IEEE

Abstract— Digital computing-in-memory (DCIM) that merges

computing logic into memory has been proven to be an efficient

architecture for accelerating multiply-and-accumulates (MACs)

However, low energy efficiency and high area overhead pose

a primary restriction for integrating DCIM in re-configurable

processors required for multi-functional workloads To alleviate

this dilemma, a novel RISC-V supported full-digital

computing-in-memory processor (RDCIM) is designed and fabricated with

55nm CMOS technology In RDCIM, an

adding-on-memory-boundary (AOMB) scheme is adopted to improve the energy

efficiency of DCIM Meanwhile, a multi-precision adaptive

accumulator (MPAA) and a serial-parallel conversion supported

SRAM buffer (SPBUF) are employed to reduce the area overhead

caused by the peripheral circuits and the intermediate buffer

for multi-precision support The results show that the energy

efficiency in our design is 16.6 TOPS/W (8-bit) and 66.3 TOPS/W

(4-bit) Compared to related works, the proposed RDCIM macro

shows a maximum energy efficiency improvement of 1.22× in

a continuous computing scenario, an area saving of 1.22× in

the accumulator, and an area saving of 3.12× in the input

buffer Moreover, in RDCIM, 5 fine-grained RISC-V extended

instructions are designed to dynamically adjust the state of

DCIM, reaching 1.2× computation efficiency

Index Terms— Computing-in-memory, RISC-V, extended

instructions, re-configurable precision

I INTRODUCTION

IN RECENT years, deep neural networks (DNNs) have

been widely applied in various fields, including image

Manuscript received 14 September 2023; revised 2 November

2023 and 13 December 2023; accepted 3 January 2024 This work was

supported in part by the National Key Research and Development Program of

China under Grant 2021YFB3601304 and Grant 2021YFB3601300, in part

by the National Natural Science Foundation of China under Grant 62001019,

in part by the Laboratory Open Fund of Beijing Smart-Chip Microelectronics

Technology Company Ltd., in part by the Fundamental Research Funds

for the Central Universities, in part by the Key Research and Development

Program of Anhui Province under Grant 2022a05020018, and in part by

the Joint Laboratory Fund of Beihang University and SynSense This article

was recommended by Associate Editor W Liu (Corresponding authors:

Biao Pan; Bojun Cheng.)

Wente Yi, Kefan Mo, Wenjia Wang, Yitong Zhou, Yejun Zeng,

Zihan Yuan, and Biao Pan are with the School of Integrated Circuit

Science and Engineering, Beihang University, Beijing 100191, China (e-mail:

panbiao@buaa.edu.cn).

Bojun Cheng is with the Microelectronics Thrust, Function Hub, The

Hong Kong University of Science and Technology (Guangzhou), Guangzhou

510000, China (e-mail: bocheng@ust.hk).

Color versions of one or more figures in this article are available at

https://doi.org/10.1109/TCSI.2024.3350664.

Digital Object Identifier 10.1109/TCSI.2024.3350664

classification, voice detection, language processing, etc [1],

[2], [3] However, with the fast growth of network scale, computer systems are confronted with emerging issues relating to data-centric computations as massive multiply-and-accumulates (MACs) operations exist in DNNs For example, as one of the representative DNN models, ResNet-50 requires 3.9G MACs for the inference of a single image[4]

In order to meet the high-parallelism requirements of MACs operations, a variety of appropriate AI accelerators have been proposed [5], [6], [7], [8] to coordinate with conventional processing units (CPU or GPU) Despite the optimization of data flow in these accelerators, the memory-wall bottleneck caused by the separation of computing and memory units still exists, resulting in huge power consumption and extra area overhead[9] In addition, to meet the demand of diverse application scenarios, the requirement of multi-precision computation with quick configuration must be satisfied within

a single DNN accelerator

In the past ten years, computing-in-memory (CIM) which processes and stores data at the same location has been proven

to be a promising approach to reduce the data movement needed for high-throughput MAC operations [10],[11],[12],

[13],[14] Instead of executing the computation tasks merely

in arithmetic logical units, CIM allocates partial tasks into memory units, which lowers the demand for data movement between arithmetic logical units and memory units According

to the paradigm of data encoding and processing, CIM is mainly divided into analog computing-in-memory (ACIM), which makes use of Kirchhoff’s law to execute computation, and digital computing-in-memory (DCIM), which makes use

of full digital circuits to execute computation To date, most CIM research has focused on the ACIM for high energy efficiency and throughput at the cost of limited accuracy[11],

[12],[13] On the contrary, DCIM can avoid the inaccuracies caused by process variations, data conversion overhead, and poor scaling of analog circuits However, the benefits also come up with new challenges for DCIM in power and area aspects, raising requirements for optimization of DCIM both internally and externally:

A Challenge-1 (Power) According to the survey, the maximum power consumption comes from the adder tree Fig.1(a)shows the dynamic power

See https://www.ieee.org/publications/rights/index.html for more information.

Trang 2

Fig 1 Challenges of DCIM design and the solutions of RDCIM macro (a) Power efficiency (b), (c) Area overhead.

breakdown of typical DCIM in previous works[14],[15],[16]

under 55nm CMOS technology (1.2V, 100MHz, 300K) In the

4-bit situation, the adder tree accounts for 79% of the power,

which is the main source of DCIM computation Moreover,

as the bit width increases to 8 bits, this ratio will increase to

82%, leading to a further imbalance of the power distribution

B Challenge-2a (Area)

Facing the increasing demand of multi-task application

scenarios, supporting multi-precision computation has become

a necessity for DCIM [22], [24] As shown in Fig 1(b),

in order to support multi-precision computation, DCIMs

reported in [14], [15], and [16] tend to split the

high-precision data into different SRAM arrays, and then sum up

the computation results of each array in additional circuits like

shifter, adder, etc outside the DCIM macro These additional

multi-precision supported circuits increase the area overhead

by 24% of the accumulator area

C Challenge-2b (Area)

Outside the DCIM macro, the input buffer is used to change

the data format of input data to support bit-serial computation

As shown in Fig 1(c), in previous works, the input buffer is

composed of registers to support the function of data format

conversion Even though the registers are easy to control and

accomplish various functions, the area overhead of registers is

huge compared to SRAM cells Specifically, a single register

is 6× larger than a 6T SRAM cell

To tackle these challenges, we have proposed a

RISC-V supported full-digital computing-in-memory processor

(RDCIM) with high energy efficiency and low area overhead,

supporting flexible precision re-configuration The main

contribution of this work can be summarized as:

1) For challenge-1, RDCIM uses the

adding-on-memory-boundary (AOMB) scheme to reduce the power

consumption of the adder tree Unlike prior works that

simplify multiplications to NOR operations before the

adder tree, RDCIM inserts 26T mirror adders into the

SRAM arrays to pre-process the weight data and uses

multiplexer (MUX) to implement multiplications The

AOMB scheme together with the MUX-embedded adder

tree (MAT) significantly reduces the dynamic power

2) For challenge-2a, the multi-precision adaptive

accu-mulator (MPAA) is designed to realize multi-precision

computation internally MPAA eliminates the need

for additional multi-precision supported circuits, which results in reducing the area overhead Moreover,

by setting the configuration signals, more computation precision (4/8/12/16 bits) can be flexibly adjusted 3) For challenge-2b, a serial-parallel conversion supported SRAM buffer (SPBUF) is designed in RDCIM Instead

of using registers to store the input data, we use a novel 8T SRAM cell to build the input buffer Compared to the conventional register-based buffer, the SPBUF is 3.12× smaller Therefore, by adopting the SPBUF, the storage density of the buffer can be increased to store more input data

4) To support these circuit-level optimizations above-mentioned, extended instructions are employed in RDCIM by integrating the DCIM macro with a

RISC-V CPU The extended instructions include two extra parameters, which enable adjusting the DCIM state dynamically, making it suitable for our proposed DCIM and achieving a 1.2× improvement in computation efficiency

The remainder of the paper is organized as follows Section II introduces the DCIM fundamentals and the recent related works Section III presents the architecture of RDCIM, including AOMB+MAT, MPAA, SPBUF, and

RISC-V extended instructions SectionIVpresents the experimental results Finally, SectionVconcludes the paper

II BACKGROUND ANDMOTIVATION

A Overall Architecture of DCIM Fig 2(a) shows the overall architecture of DCIM which contains two parts: the memory array and the digital computation circuit The memory array is typically composed

of SRAM cells and utilized to perform weight data storage and MAC operations between input and weight data As shown

in Fig 2(b), to reduce the area overhead, NOR gates are placed on the boundary of the SRAM array [16], which results in multiple 8T SRAM cells in the same row sharing one NOR gate Fig 2(c)illustrates the components of digital computation circuits: the adder tree and the accumulator The adder tree is placed on the boundary of the SRAM array, summing up all the output of each column in parallel The accumulator is placed on the boundary of the adder tree to accumulate the output of the adder tree in different clock cycles Besides, an input buffer is needed to temporarily store the input data and output serial bits

Trang 3

Fig 2 The overall architecture of a typical DCIM macro and its peripheral

circuits (a) DCIM macro of SRAM arrays and digital computation circuit.

(b) SRAM array with 10T cells (c) Digital computation circuit of adder tree

and accumulator.

Fig 3 The workflow of DCIM (a) Mathematical expression of the

computation (b) The summing up process (c) The accumulating process.

The basic function of DCIM is to implement matrix-vector

multiplication (MVM), where MAC operations are used to

compute the dot product of two vectors Mathematically,

MVM can be resolved to multiple vector-vector

multiplica-tions (VVM) The computation steps of VVM are illustrated

in Fig.3(a) In each clock cycle, one specific bit of each input

data is sent to the SRAM array to implement multiplication

with weight data After that, the outputs of the SRAM array

are summed up in the adder tree as shown in Fig 3(b), and

the accumulator shifts the data stored in the register and adds

them with the output of the adder tree as shown in Fig.3(c)

These operations are repeated until the computations of the last

bit of each input data are finished, and then the accumulator

returns the result of VVM [26]

B Related Works

In the past five years, as shown in Fig 4(a), (b), and (c),

researchers have done a lot of work on the architecture and

circuit level around DCIM-based DNN accelerators

As shown in Fig 4(c), several prior works have optimized

the circuit of DCIM to improve its throughput and

functionality Fujiwara et al designed a DCIM macro with

Fig 4 Related works of DCIM (a) DCIM-based DNNs processor (b) The innovations in architecture (c) The innovations in circuits.

5nm CMOS technology and integrated the adder tree into the SRAM array, achieving 254 TOPS/W in 4-bit and 63 TOPS/W

in 8-bit computation [14] Yue et al combined DCIM with float MAC rules, enabling DCIM to support both integer and float computation[20] H Zhu et al proposed a novel DCIM structure: computing-on-memory-boundary (COMB), which allows multiple columns of SRAM array to share one digital computation circuit on the boundary of the array, achieving 1152Kb array size and 0.25 TOPS peak performance (65nm CMOS technology), balancing the computation and memory resources [16]

As shown in Fig.4(b), several prior works have incorporated DCIM into DNN accelerators to leverage its high MVM computation efficiency and throughput F Tu et al designed

a pipeline/parallel reconfigurable accelerator with DCIM for transformer models [19] The accelerator overcame the challenge of weight update in DCIM and achieved 12.08× to 36.82× lower energy than previous CIM-based accelerators Moreover, the connection between different DCIM chips can

be achieved by chiplet technology [21], which expands the capacity and improves the performance of the accelerator

C Motivation Although a lot of works have been done before to improve the performance of CIM accelerators, most of them are optimized from a single aspect of circuit[14],[15],[16],[18],

[20],[24],[28],[30],[33],[34]or architecture[19],[21],[25],

[27],[29],[31],[32], and the two aspects are rarely combined Motivated by the previous works, we address the complexity

of these three challenges by proposing a novel DCIM solution that integrates circuit optimization and architecture design at both levels

III RDCIM ARCHITECTURE

The overall architecture of RDCIM is shown in Fig 5

with three features highlighted in different colors As depicted

Trang 4

Fig 5 The overall architecture of RDCIM with three features:

AOMB+MAT, MPAA, and SPBUF.

in Fig 5, facing three challenges, we have introduced three

key features (1, 2a, and 2b), which correspond to the three

challenges, for improving the energy efficiency and reducing

the area overhead from both circuit and architecture level

synergistically Meanwhile, by introducing the RISC-V CPU,

the architectural computation efficiency is expected to be

further improved

A Adding-on-Memory-Boundary (AOMB)

The key component of MVM computation is the

multipli-cation between input and weight data In the typical DCIM

scheme, the multiplications are performed by NOR gates

inside the SRAM array, and then summed up by the adder

tree Basically, the total power consumption during MVM

computation can be expressed by Eq.(1):

Psum1=Pr ead+PNOR+

n

X

i =1

Padder tr ee[i ] (1)

Here, Psum1 represents the total power consumption, Pr ead

represents the power consumed while reading the weight

data, PNOR represents the power consumed while executing

the NOR operations, and Padder tr ee[i ] represents the power

consumed by each stage of adder tree (n stages in total) The

strategy for tackling challenge-1 is to analyze Eq (1) and

optimize PNOR+Pn

i =1Padder tr ee[i ]

Given the structure of DCIM, the weight data in the

SRAM array should be relatively static to achieve better

energy efficiency To utilize this characteristic, we proposed

the AOMB scheme to pre-process the weight data and further

reduce the burden of digital computation circuits To cooperate

with the AOMB scheme, we designed a MAT to accomplish

the computation

Fig 6 illustrates the structure of the AOMB scheme In

RDCIM, the SRAM array adopts a novel 8T SRAM cell,

which is similar to the structure in Fig 2(b), but has two

different types, that will be described in detail later The

AOMB scheme decouples the SRAM cells and the NOR gates

and places the adders beside them The weight data stored in the SRAM array are split into ax and bx(x represents the index

of data), and sx is the sum of ax and bx The output of AOMB

is ax[4]ax[3]ax[2]ax[1]ax[0], bx[4]bx[3]bx[2]bx[1]bx[0] and

sx[4]sx[3]sx[2]sx[1]sx[0]

As shown in Fig.6(a), adders are placed on the boundary of the SRAM array, and each of the adders is placed between two SRAM rows The two rows of SRAM cells, together with the adder, form a basic computing unit Each computation unit passes the carry to the next one and outputs the result of

ax[k]/ax[k], bx[k]/bx[k] and sx[k]/sx[k] Two 4-bit weight data are stored in an interleaved manner to meet the demand

of the adder chain in6(d)

In the AOMB scheme, we adopt the 26T mirror adder instead of the 28T adder, as shown in Fig 6(c) The advantage of the 26T mirror adder is that it can reduce the delay and transistor number by removing the output inverters

of carry and alternating positive and negative logic When the 26T mirror adder receives an uninverted input(ax[k],

bx[k], cx[k −1]), it outputs an inverted carry(cx[k]) and an uninverted sum(sx[k]) As shown in Fig.6(d), cascading these adders forms an adder chain This adder chain is different from the regular one in that its summation and carry are with the alternant inverted-uninverted forms For example,

if the input of the adder chain is ax[3]ax[2]ax[1]ax[0] and bx[3]bx[2]bx[1]bx[0], the output of the adder chain is

sx[3]sx[2]sx[1]sx[0] It is necessary to obtain the input bits with the alternant inverted-uninverted form

To obtain this input form, we make specific optimization

of the SRAM cell structure Fig 6(b)shows that the internal connection of SRAM cells changes according to the weight data bit position For odd bit positions, the point C (PC) connects to the point B (PB), and the read bit line (RBL[m]) outputs an inverted bit after the read word line (RWL[0])

is turned on and the RBL[m] is pre-charged For even bit positions, the PC connects to point A (PA) and outputs an uninverted bit

Furthermore, the sign control signal of weight (w_sgn) is introduced to indicate the form of weight data to be signed or unsigned as shown in Fig.6(a) To add the signed and unsigned weight data, the most significant bit (MSB) of data should be extended according tow_sgn In the AOMB scheme, a NOR gate is utilized to extend the MSB of weight data If the weight data are signed, the w_sgn is set to 1, and the ax[4] equals

to ax[3] If the weight data are unsigned, thew_sgn is set to

0 and the ax[4] equals to 0

However, the AOMB scheme cannot complete the MAC operations independently, so we designed MAT to implement multiplication and accumulation As shown in Fig 7(a), the MAT is placed next to the SRAM array, receiving the output

of it, and summing up the selected data Different from the regular adder tree, the first stage of MAT is composed

of MUXs that perform the multiplication Specifically, transmission gate logic is adopted to build the MUXs, which reduces the number of transistors from 12 to 6 and lowers the parasitic capacitance Two bits of input data (i [l + 1]i [l]) are grouped to select the output of the SRAM array Fig.7(b)

illustrates the first MUX, where i [1]i [0] select the output of

Trang 5

Fig 6 The design of AOMB and its related circuits (a) The boundary of one SRAM array with an adder chain placed beside it (b) The design of one computation unit in AOMB (c) The mirror adder circuit with 26 transistors (d) The adder chain with 26T full adders.

Fig 7 The structure of MAT and its related circuits (a) The components

of MAT (b) The MUX circuit.

the first computation unit in the SRAM array: ‘00’ means

the output of MUX is 0, ‘01’ means the output is a0[k],

‘10’ means the output is b0[k], and ‘11’ means the output

is s0[k] The logical expression of MUX can be written as

out put = a0[k] × i [0] + b0[k] × i [1], which performs the

multiplication indirectly In the subsequent stages of MAT, the

26T mirror adders are used, and the data format of alternant

inverted-uninverted bits is passed on until the final stage of

MAT The sign bit processing is similar to the structure in

Fig 6(a) To transform the output to a normal form, the odd

bit positions are inverted

By pre-processing the weight data, the power consumption

during computation can be expressed as the sum of three

components, as shown in Eq (2):

Psum2=Pr ead+Padder× f r eqw+PMAT (2a)

PMAT =PMUX+

n

X

i =2

Padder tr ee[i ] (2b)

Here, Psum2 represents the total power consumption, Pr ead

represents the power consumed while reading the weight

data, Padder represents the power consumed by the adders

next to the SRAM array, f r eqw represents the frequency

of weight updating, and PMAT represents the power

consumed by the MAT Inside the MAT, its power consists

of two parts: the MUX (PMUX) and the adder tree

(Pn

i =2Padder tr ee[i ]) The total power saving can be expressed

as Psum1−Psum2, as shown in Eq.(3):

Psum1−Psum2= PNOR+Padder tr ee[1]

−Padder× f r eqw−PMUX (3) Since the computation of DCIM is data-centric, DNN mapping is typically focused on weight reuse, which reduces

f r eqw As described in Eq (3), the smaller the f r eqw, the greater the power savings In this case, the AOMB scheme reduces the dynamic power of the first stage of the conventional adder tree and collaborates with MAT to lower the overall power consumption Fig 8(a) compares the AOMB scheme (Padder×f r eqw+PMUX) with the non-AOMB scheme (PNOR+Padder tr ee[1]) at different input and weight toggle rates We evaluated them in the post-simulation environment with 55nm, 100MHz, and 300K The baseline

is the conventional scheme at 0.1 toggle rate, with weight data updated every 8 cycles ( f r eqw = 1/8) The AOMB scheme achieves up to 1.48× power saving at 0.1 toggle rate As the toggle rate increases, the AOMB scheme can save more power consumption and finally reach 1.79× at 0.5 toggle rate We attribute this increment to f r eqw, which attenuates the effect of the toggle rate on power consumption Fig.8(b) shows the power consumption comparison between the AOMB+MAT scheme (Psum2) and the AOMB & non-MAT scheme (Psum1) at different input and weight toggle rates with the same evaluation environment as Fig.8(a) The baseline is the conventional scheme at 0.1 toggle rate, and the AOMB+MAT scheme achieves up to 1.22× power saving compared to the non-AOMB & non-MAT scheme at 0.5 toggle rate

B Multi-Precision Adaptive Accumulator (MPAA)

To support multi-precision computation, various methods have been proposed in prior works[14] As shown in Fig.9(a),

if the basic bank is 4 bits, 8 bits weight data should be split into two different SRAM arrays, one for the high 4 bits and the other for the low 4 bits Both SRAM arrays perform VVM computation in parallel, and their results are summed up

by multi-precision computation-supported circuits (MPCSC) outside DCIM In the MPCSC, the DCIM result of high 4 bits

is left shifted 4 bits and then added to the result of low

Trang 6

Fig 8 The comparison of power consumption between the conventional scheme and our scheme in the post-simulation environment (55nm, 100MHz, 300K) (a) The comparison between the AOMB scheme and the non-AOMB scheme (b) The comparison between the AOMB+MAT scheme and the non-AOMB & non-MAT scheme.

Fig 9 The comparison between conventional multi-precision supported

scheme and our MPAA scheme.

4 bits In this way, the result of 8-bit data can be obtained

in one computation cycle Similarly, 16-bit weight data can be

computed by dividing it into four SRAM arrays and summing

up their outputs within MPCSC Specifically, the 16-bit weight

data is first split into two 8-bit data sets, and the 8-bit data

repeats the same computation step mentioned before in stage 1

of MPCSC Then, the MPCSC left shifts the high 8-bit result

by 8 bits and adds it to the low 8-bit result in stage 2 However,

this method supports multi-precision computation at a cost

of additional circuits, which has a quarter of the area of the

accumulator as described in challenge-2a Moreover, more

precision levels need more shifters and adders in MPCSC,

forming a tree structure that increases the area overhead

Meanwhile, selecting the results from different precision levels

is challenging because they come from different stages of

MPCSC

To address the challenge-2a, we proposed the MPAA,

which supports multi-precision computation inside DCIM

As shown in Fig 9(b), to cooperate with MPAA, the SRAM

array contains different columns, and a selector is introduced

to select a column to perform computation The multiple

columns of the SRAM array share the same adder tree and

MPAA

Different from the traditional scheme, if the weight data is

8 bits and the basic bank is 4 bits, the high and low 4 bits

are placed in the first and second columns of the SRAM array

respectively They share a common digital computation circuit

and execute VVM in series Firstly, the DCIM macro finishes

the computation of the high 4 bits and stores the result in the register of the accumulator Without clearing the previous result, it finishes the computation of low 4 bits and stores the final result in the same register

Fig.10(a)shows the structure and the computation steps of MPAA To support both signed and unsigned computation, the MPAA can be configured by setting the MSB indicator signal (x_MSB) and the sign indicator signals of input data (x_sgn) and weight data (w_sgn) To support unsigned computation, x_MSB, x_sgn, and w_sgn need to be set to zero, which means the input of accumulator (data_acc) does not need any processing before being sent to the register To support signed computation, x_MSB, x_sgn, andw_sgn need to cooperate After receiving the data_acc, the MPAA first extends the sign bit based on w_sgn Then, it judges whether the input data are the MSB based on x_MSB If so, the sign-extended data (D_sgn) need to perform the following operation: ∼ D_sgn+1′b1, otherwise, the D_sgn remains unchanged Since the data in the register may need to be right-shifted by 3 bits,

a left shift of 3 bits is performed before the adder to avoid accuracy loss After the register, two shifters are optional based on the cycle control signal (w_cycle) to solve the problem of continuous computation between different columns

of banks

It is noteworthy that the shifter inside MPAA depends on the bit widths of both the input and the weight data Assume that the input data is M bits, the basic bank of the SRAM array

is N bits, and the weight data is 2N bits After computing the high N bits of weight data, the data stored in the accumulator register must be right-shifted by M−N bits before computing the low N bits Otherwise, the result of high N bits would

be left-shifted M bits during the computation of low N bits, but the high N bits need to be left-shifted N bits actually

To realize this function, the data in the register should be right-shifted by M −N −1 bits during the first computation of the low N bits weight To support higher precision levels of weight data (e.g 3N , 4N , and so on), the data stored in the same register only needs to be right-shifted by M − N −1 bits each time during the first computation of the lower N bits weight

Trang 7

Fig 10 The structure and the computation steps of MPAA (a) The structure

of MPAA with sign bit processing circuit, MSB processing circuit, and

reconfigurable shifter (b)-(f) The computation steps of MPAA with 8 bits

signed input data and weight data.

Fig 11 Computation steps of 4/8/12/16 bits input data and weight data.

Different gray levels represent different steps In the computation of each

precision, the steps are executed from the left to the right.

As shown in Fig 10(b) to (f), based on the structure

mentioned before, the computation steps with 8 bits signed

weight and input data are shown as follows:

1) Step 1: The input data is the MSB, and the high 4 bits of

weight data are first computed In this step, the x_MSB,

x_sgn and w_sgn are set to 1, and the w_cycle is set

Fig 12 The area overhead and features comparison between conventional scheme and our MPAA in 55nm CMOS technology.

to 0, so the input data undergoes the sign bit and the MSB pre-processing before being sent to the register 2) Step 2: The input data is not the MSB, so the x_MSB

is set to 0, and the remaining signals are unchanged Step 2 lasts 7 cycles, and the register data are left shifted

1 bit and summed up with the sign extended input data

in every cycle

3) Step 3: To compute the low 4 bits of weight data, the w_sgn needs to be set to 0 The input data is the MSB,

so the x_MSB is set to 1 Since computing the low

4 bits requires switching the bank to the second one, the w_cycle is set to 1, and the data in the register is right-shifted by 3 bits

4) Step 4: This step is similar to the step 2 withw_sgn set

to 0

5) Step 5: The w_cycle is set to 1, and the output is the register data right shifted by 3 bits, which is 32 bits Furthermore, as illustrated in Fig.11, the MPAA can support more precision levels by combining these steps To support 4-bit input data, steps 2 and 4 should be configured into two different types: one lasting 7 cycles (2a and 4a) and the other lasting 3 cycles (2b and 4b)

When the weight data is 4 bits and the input data is 8 bits, only steps 1, 2a, and 5 are executed When the weight data is

12 bits or 16 bits, steps 3 and 4a are repeated several times after steps 2a, and then step 5 returns the output When the input data is 4 bits, step 3 is replaced by step 1, and steps 2a (4a) are replaced by 2b (4b) respectively, and the other steps are unchanged When the input data is 12 bits or 16 bits and the input buffer only supports 8 bits of input data, the input data are loaded twice The high 8 bits of input data, which are computed first, follow the computation steps 1, 2a, 3, 4a, and 5 For the low 4 bits or 8 bits of input data, it follows the computation steps mentioned before with x_sgn set to

0 To compute the unsigned input data and weight data, the x_sgn andw_sgn are set to 0, and the computation steps are the same Since the register inside the accumulator is 32 bits,

it is acceptable that the bit width of the input data plus the weight data is less than 32

Assuming a weight matrix size of 64 × 64 with 4-bit banks, and 8-bit weight data support as our goal, we compare the MPAA with the MPCSC scheme The MPCSC scheme can compute the 8-bit weight data with a matrix size of 32×64 in a single computation cycle For a larger matrix size, such as 64×

64, it requires an additional computation cycle By adopting the MPAA, our RDCIM computes the 8-bit weight data with

Trang 8

Fig 13 The structure of SPBUF and its related circuits (a) The circuit of the register with 36 transistors (b) The processing steps of SPBUF (c) The overall structure of SPBUF includes the SRAM cells and their peripheral circuits (d) The design of the read-out circuit (ROC) (e) The circuit of 8T SRAM cell and its line crossing relation.

a matrix size of 64 × 64 in two computation cycles When

the matrix size exceeds 32 × 64, the MPAA scheme achieves

the same throughput as the MPCSC scheme However, our

scheme demonstrates greater flexibility of bit width compared

to the MPCSC scheme In particular, the MPCSC scheme

limits the weight bit width by the quantity of SRAM arrays

To support 16-bit width with the same parallelism, 4× SRAM

arrays are required Moreover, the input bit width is limited by

the MPCSC By contrast, our scheme can support more weight

bits by extending the columns of the SRAM array, and the bit

width is only limited by the register in the accumulator The

weight and the input bit width can support 4/8/12/16 bits as

long as their sum is less than 32 bits Moreover, by eliminating

the MPCSC, the area overhead is reduced by 1.22× as shown

in Fig.12

C Serial-Parallel Conversion Supported SRAM Buffer

(SPBUF)

The input buffer is a crucial component of the DCIM

It receives input data and outputs the serial bit To support this

function, the input buffer is typically made up of registers that

are easy to control and output steady bit-serial data However,

registers suffer from significant area overhead as mentioned in

challenge-2b As shown in Fig.13(a), the register is 4.5 times

larger than the 8T SRAM cell Due to area limitations, it is

challenging to increase the storage capacity of the input buffer,

resulting in a large hardware overhead in high-bandwidth

application scenarios

To address the challenge-2b, we proposed the SPBUF

as shown in Fig 13(c) The SRAM buffer is composed of

64 columns, and each column represents 8 bits of input data

A write selector is placed at the bottom of the SRAM buffer

to select a column to store the input data that comes from the

right side of it, and a bit selector is placed on the left side of it

to select a specific bit of each input data Read-out circuits are

placed on top of the SRAM buffer to provide steady output

data The processing steps are shown in Fig 13(b) To start

with, the input data needs to be loaded into the SRAM buffer

Similar to the regular SRAM, the write address (write_addr)

is given to the write selector and selects a column to store the

8-bit input data Then, the bit position (sel_bi t ) is given to the bit selector to select a specific bit to be computed, and the SPBUF outputs the selected bit of each input data

To achieve the structure in Fig 13(c), the 8T SRAM cell

is customized as shown in Fig 13(e) In our design, there exist two groups of perpendicular control lines (RWL&WWL, RBL&WBL) In this case, the write data and the read data are perpendicular So when the input data is stored in different columns, each row represents the same bit position In the write stage, the RWL activates the SRAM cell, and the RBL passes the data into it In the read stage, the RWL activates the read path, and the RBL is pre-charged If the data inside the cell is 0, the RBL remains high Otherwise, it discharges to 0 The output of the SRAM cell is inverted to the data inside it The 8T SRAM cell introduces a problem where the RBLs need to be pre-charged, which results in unstable output

To overcome this problem, a read-out circuit is designed as shown in Fig.13(d) During the low-level of the clock signal, the RBL is pre-charged If the RBL is high level and the clock

is low level, the signal t is locked During the high level of the clock signal, the RBL obtains the output of the SRAM cell, which is inverted to the data stored in the cell At the same time, signal t is inverted to the RBL The inverter placed between the out put and the signal t stabilizes the output data From the point of view of the control signal, the bit position should be selected at the falling edge of the clock, and the out put will be obtained at the rising edge of the clock The out put will stay stable for one clock cycle

We compared the area overhead of the register-based input buffer with our SPBUF in 55nm CMOS technology with a capacity of 64 × 8 bits As shown in Fig 14(a), the area overhead is reduced by 3.12× when adopting the SPBUF with the same capacity To prove the correction of SPBUF,

we simulated the circuit at 55nm, 100MHz, and the partial waveform is shown in Fig 14(b) We chose the high 2 bits

of the first column to illustrate, and the two SRAM cells are called cell A and cell B respectively For the first clock cycle, cell A and cell B receive the input data At the falling edge of the second clock cycle, cell A is chosen with RWL[7] set to 1

At the rising edge of the third clock cycle, the read-out circuit outputs the data (Q) in cell A Since the data in cell A is 0,

Trang 9

Fig 14 (a) The area overhead comparison between register-based input

buffer and our SPBUF in 55nm CMOS technology (b) The waveform of the

read-out process.

the RBL does not need to discharge, so the output remains

high At the falling edge of the third clock cycle, cell B is

chosen with RWL[6] set to 1 At the rising edge of the fourth

clock cycle, the read-out circuit outputs the data (Q) in cell

B Since the data in cell B is 1, its RBL discharges, and the

output changes to 0 At the falling edge of the fourth clock

cycle, signal t is locked, so the output remains unchanged

The output of each data lasts one clock cycle, which matches

our previous discussion

D RISC-V Extended Instructions

To support more application scenarios and computation

types, a CPU is always connected to DCIM to deal with

control-intensive tasks In previous works, the DCIM is always

connected to an ARM architecture CPU by the AXI bus

However, since the instructions of DCIM are usually

coarse-grained, the scheme of bus connection suffers from weak

flexibility Moreover, since the main features incorporated into

RDCIM come at the cost of increasing control difficulty,

a novel coarse-grained coupling manner between CPU and

DCIM is needed

To couple DCIM with CPU more tightly and support the

three features, a RISC-V CPU is introduced into the RDCIM

The overall architecture of the RISC-V CPU is shown in

Fig 15(a) This architecture supports the RISC-V extended

instructions, which provide two additional parameters to

control the DCIM flexibly Moreover, the memory inside

DCIM is directly connected to data local memory (DLM) in

the RISC-V CPU to generate an AXI stream-like connection

To support the RISC-V extended instructions, we added

some modifications to the execution unit It incorporates

a judging module that distinguishes between standard and

extended instructions Standard instructions are executed by

the arithmetic logic unit inside the RISC-V CPU, whereas

extended instructions are forwarded to an external decoder

that further decodes them The external decoder also obtains

ones, but they operate on different hardware components Based on the proposed architecture, we designed five RISC-V extended instructions, as illustrated in Fig 15(b) These instructions resemble the standard ones with two 32-bit register parameters, namely register 1 and register

2 respectively These instructions can fully utilize the DCIM functionality by working together

The cim_clr instruction is used to reset the DCIM state, including setting the accumulator register to 0 and clearing the data in latches and read-out circuit

The data_trans_1 and data_trans_2 instructions are used to transfer the input data and weight data from DLM to SPBUF and SRAM array, respectively In the program, the address of the input data and weight data in DLM should be specified in register 1 Moreover, for the data_trans_2, the column of the SRAM array to receive the weight data should be specified

in register 2 (write_column_sel) During the execution of data_trans_1, the SRAM buffer is filled with the input data from DLM, and during the execution of data_trans_2, the selected column of the SRAM array is filled with the weight data from DLM

The run_cim instruction is used to perform the computation

of DCIM To ensure the correct computation, register 1 should specify thew_sgn, x_sgn, w_cycle, and bit_w signals, which are related to the AOMB, MAT, MPAA, and SPBUF The w_sgn and the x_sgn indicate the sign of the weight and input data in AOMB, MAT, and MPAA, where 1 means signed and 0 means unsigned The bi t _w indicates the bit width of the input data in SPBUF, where 1 means 8 bits and 0 means 4 bits Thew_cycle indicates the need for right shifting the register data in MPAA, where 1 means necessary and 0 means unnecessary Since the DCIM only supports the computation of one column of each SRAM array at a time, register 2 (r ead_column_sel) should specify the column to

be computed

The read_out instruction is used to transfer the DCIM results to DLM Register 1 should specify the address in DLM that receives the results

There are three main categories of matrix computation using RDCIM and the following examples illustrate each of them:

If the input and the weight data are both 4-bit signed data, the VVM execution follows these steps:

1) cim_clr: Initialize the DCIM state;

2) data_trans_2: Store the weight data in the selected column of the SRAM array Register 1 specifies the address of the weight data and register 2 specifies the column to be written;

3) data_trans_1: Store the input data into the SRAM buffer Register 1 specifies the address of the input data;

Trang 10

Fig 15 (a) The hardware foundations of the RISC-V extended instructions (b) The description of the extended instructions (IFU: Instructions fetch unit EXU: Execution unit LSU: Load store unit WB: Write back BIU: Bus interface unit.)

4) run_cim: Start the computation with the selected column

of the SRAM array The w_sgn and x_sgn should be

set to 1, since the input data and the weight data are

signed, and thew_cycle and bit_w should be set to 0

Register 2 specifies the column to be computed;

5) read_out: Load the results into DLM Register 1

speci-fies the address that receives the results

Moreover, the instructions can be combined to support

multi-precision computation Supposing the weight data is

signed 12 bits and the input data is signed 8 bits The input

data can be stored in the SRAM buffer entirely, but the weight

data needs to be stored in three different columns of the SRAM

array The computation can be simplified as follows:

2) data_trans_2: Repeat three times to store the 12-bit

weight data into three different columns of SRAM array;

3) data_trans_1: Store the input data into the SRAM

buffer;

4) run_cim: Start the computation with the column of high

4 bits selected Thew_sgn, x_sgn and the bit_w should

be set to 1, and thew_cycle should be set to 0;

5) run_cim: Start the computation with the column of

middle 4 bits selected The w_cycle, x_sgn and the

bi t_w should be set to 1, and the w_sgn should be

set to 0;

6) run_cim: Start the computation with the column of low

4 bits selected The w_cycle, x_sgn and the bit_w

should be set to 1, and thew_sgn should be set to 0;

7) read_out: Load the results into DLM

Supposing the input data are signed 12 bits and the weight

data are signed 4 bits The computation can be simplified to

the following steps:

2) data_trans_2: Store the weight data into the selected

column of the SRAM array;

3) data_trans_1: Store the high 8 bits of input data into the

SRAM buffer;

4) run_cim: Start the computation with the selected

column The w_sgn, x_sgn and the bit_w should be

set to 1, andw_cycle should be set to 0;

5) data_trans_1: Store the low 4 bits of input data into the

SRAM buffer;

TABLE I

C HIP S PECIFICATION OF O UR D ESIGNED RDCIM M ACRO

6) run_cim: Start the computation with the selected column Thew_sgn should be set to 1, and the x_sgn,

bi t_w and w_cycle should be set to 0;

7) read_out: Load the results into DLM

For the computation of unsigned data, the x_sgn andw_sgn should be set to 0

IV EXPERIMENTALRESULT

A 64KB macro using the 55nm CMOS technology was fabricated to evaluate the principal innovative points of RDCIM The detailed technical specifications of the RDCIM macro are shown in Table I The RDCIM macro operates

at a nominal voltage of 1.2V and a clock frequency of 200MHz, with a total area of 9.8mm2 It consists of a 2.8mm2 DCIM macro with a 64KB SRAM array inside The RDCIM macro achieves a peak energy efficiency of 66.3 TOPS/W with an area efficiency of 0.288 TOPS/mm2 at 4 bits and 16.6 TOPS/W with an area efficiency of 0.072 TOPS/mm2

at 8 bits precision, respectively Multi-precision computation including 4/8/12/16 bits of weight and input data could be conducted in RDCIM to support various application scenarios

To support the innovation circuits mentioned before, a

RISC-V CPU with 64KB instruction and data cache, which can be expanded to 4MB, is incorporated This RISC-V CPU has a 3-stage pipeline structure that can execute RISC-V instructions seamlessly

Tiêu đề	Rdcim: Risc-V Supported Full-Digital Computing-in-Memory Processor With High Energy Efficiency and Low Area Overhead
Tác giả	Wente Yi, Kefan Mo, Wenjia Wang, Yitong Zhou, Yejun Zeng, Zihan Yuan, Bojun Cheng, Biao Pan
Trường học	Beihang University
Chuyên ngành	Integrated Circuit Science and Engineering
Thể loại	journal article
Năm xuất bản	2024
Thành phố	Beijing

Định dạng
Số trang	14
Dung lượng	2,82 MB