1. Trang chủ
  2. » Luận Văn - Báo Cáo

Rdcim risc v supported full digital computing in memory processor with high energy efficiency and low area overhead

14 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Insteadof using registers to store the input data, we use a novel8T SRAM cell to build the input buffer.. We attribute this increment to f r eqw, whichattenuates the effect of the toggle

This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS RDCIM: RISC-V Supported Full-Digital Computing-in-Memory Processor With High Energy Efficiency and Low Area Overhead Wente Yi , Graduate Student Member, IEEE, Kefan Mo, Wenjia Wang , Yitong Zhou, Yejun Zeng, Zihan Yuan, Bojun Cheng , Member, IEEE, and Biao Pan , Member, IEEE Abstract— Digital computing-in-memory (DCIM) that merges computing logic into memory has been proven to be an efficient architecture for accelerating multiply-and-accumulates (MACs) However, low energy efficiency and high area overhead pose a primary restriction for integrating DCIM in re-configurable processors required for multi-functional workloads To alleviate this dilemma, a novel RISC-V supported full-digital computingin-memory processor (RDCIM) is designed and fabricated with 55nm CMOS technology In RDCIM, an adding-on-memoryboundary (AOMB) scheme is adopted to improve the energy efficiency of DCIM Meanwhile, a multi-precision adaptive accumulator (MPAA) and a serial-parallel conversion supported SRAM buffer (SPBUF) are employed to reduce the area overhead caused by the peripheral circuits and the intermediate buffer for multi-precision support The results show that the energy efficiency in our design is 16.6 TOPS/W (8-bit) and 66.3 TOPS/W (4-bit) Compared to related works, the proposed RDCIM macro shows a maximum energy efficiency improvement of 1.22× in a continuous computing scenario, an area saving of 1.22× in the accumulator, and an area saving of 3.12× in the input buffer Moreover, in RDCIM, fine-grained RISC-V extended instructions are designed to dynamically adjust the state of DCIM, reaching 1.2× computation efficiency Index Terms— Computing-in-memory, instructions, re-configurable precision RISC-V, extended I I NTRODUCTION I N RECENT years, deep neural networks (DNNs) have been widely applied in various fields, including image Manuscript received 14 September 2023; revised November 2023 and 13 December 2023; accepted January 2024 This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFB3601304 and Grant 2021YFB3601300, in part by the National Natural Science Foundation of China under Grant 62001019, in part by the Laboratory Open Fund of Beijing Smart-Chip Microelectronics Technology Company Ltd., in part by the Fundamental Research Funds for the Central Universities, in part by the Key Research and Development Program of Anhui Province under Grant 2022a05020018, and in part by the Joint Laboratory Fund of Beihang University and SynSense This article was recommended by Associate Editor W Liu (Corresponding authors: Biao Pan; Bojun Cheng.) Wente Yi, Kefan Mo, Wenjia Wang, Yitong Zhou, Yejun Zeng, Zihan Yuan, and Biao Pan are with the School of Integrated Circuit Science and Engineering, Beihang University, Beijing 100191, China (e-mail: panbiao@buaa.edu.cn) Bojun Cheng is with the Microelectronics Thrust, Function Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 510000, China (e-mail: bocheng@ust.hk) Color versions of one or more figures in this article are available at https://doi.org/10.1109/TCSI.2024.3350664 Digital Object Identifier 10.1109/TCSI.2024.3350664 classification, voice detection, language processing, etc [1], [2], [3] However, with the fast growth of network scale, computer systems are confronted with emerging issues relating to data-centric computations as massive multiplyand-accumulates (MACs) operations exist in DNNs For example, as one of the representative DNN models, ResNet-50 requires 3.9G MACs for the inference of a single image [4] In order to meet the high-parallelism requirements of MACs operations, a variety of appropriate AI accelerators have been proposed [5], [6], [7], [8] to coordinate with conventional processing units (CPU or GPU) Despite the optimization of data flow in these accelerators, the memory-wall bottleneck caused by the separation of computing and memory units still exists, resulting in huge power consumption and extra area overhead [9] In addition, to meet the demand of diverse application scenarios, the requirement of multi-precision computation with quick configuration must be satisfied within a single DNN accelerator In the past ten years, computing-in-memory (CIM) which processes and stores data at the same location has been proven to be a promising approach to reduce the data movement needed for high-throughput MAC operations [10], [11], [12], [13], [14] Instead of executing the computation tasks merely in arithmetic logical units, CIM allocates partial tasks into memory units, which lowers the demand for data movement between arithmetic logical units and memory units According to the paradigm of data encoding and processing, CIM is mainly divided into analog computing-in-memory (ACIM), which makes use of Kirchhoff’s law to execute computation, and digital computing-in-memory (DCIM), which makes use of full digital circuits to execute computation To date, most CIM research has focused on the ACIM for high energy efficiency and throughput at the cost of limited accuracy [11], [12], [13] On the contrary, DCIM can avoid the inaccuracies caused by process variations, data conversion overhead, and poor scaling of analog circuits However, the benefits also come up with new challenges for DCIM in power and area aspects, raising requirements for optimization of DCIM both internally and externally: A Challenge-1 (Power) According to the survey, the maximum power consumption comes from the adder tree Fig 1(a) shows the dynamic power 1549-8328 © 2024 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See https://www.ieee.org/publications/rights/index.html for more information Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination Fig IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Challenges of DCIM design and the solutions of RDCIM macro (a) Power efficiency (b), (c) Area overhead breakdown of typical DCIM in previous works [14], [15], [16] under 55nm CMOS technology (1.2V, 100MHz, 300K) In the 4-bit situation, the adder tree accounts for 79% of the power, which is the main source of DCIM computation Moreover, as the bit width increases to bits, this ratio will increase to 82%, leading to a further imbalance of the power distribution B Challenge-2a (Area) Facing the increasing demand of multi-task application scenarios, supporting multi-precision computation has become a necessity for DCIM [22], [24] As shown in Fig 1(b), in order to support multi-precision computation, DCIMs reported in [14], [15], and [16] tend to split the highprecision data into different SRAM arrays, and then sum up the computation results of each array in additional circuits like shifter, adder, etc outside the DCIM macro These additional multi-precision supported circuits increase the area overhead by 24% of the accumulator area C Challenge-2b (Area) Outside the DCIM macro, the input buffer is used to change the data format of input data to support bit-serial computation As shown in Fig 1(c), in previous works, the input buffer is composed of registers to support the function of data format conversion Even though the registers are easy to control and accomplish various functions, the area overhead of registers is huge compared to SRAM cells Specifically, a single register is 6× larger than a 6T SRAM cell To tackle these challenges, we have proposed a RISCV supported full-digital computing-in-memory processor (RDCIM) with high energy efficiency and low area overhead, supporting flexible precision re-configuration The main contribution of this work can be summarized as: 1) For challenge-1, RDCIM uses the adding-on-memoryboundary (AOMB) scheme to reduce the power consumption of the adder tree Unlike prior works that simplify multiplications to NOR operations before the adder tree, RDCIM inserts 26T mirror adders into the SRAM arrays to pre-process the weight data and uses multiplexer (MUX) to implement multiplications The AOMB scheme together with the MUX-embedded adder tree (MAT) significantly reduces the dynamic power 2) For challenge-2a, the multi-precision adaptive accumulator (MPAA) is designed to realize multi-precision computation internally MPAA eliminates the need for additional multi-precision supported circuits, which results in reducing the area overhead Moreover, by setting the configuration signals, more computation precision (4/8/12/16 bits) can be flexibly adjusted 3) For challenge-2b, a serial-parallel conversion supported SRAM buffer (SPBUF) is designed in RDCIM Instead of using registers to store the input data, we use a novel 8T SRAM cell to build the input buffer Compared to the conventional register-based buffer, the SPBUF is 3.12× smaller Therefore, by adopting the SPBUF, the storage density of the buffer can be increased to store more input data 4) To support these circuit-level optimizations abovementioned, extended instructions are employed in RDCIM by integrating the DCIM macro with a RISCV CPU The extended instructions include two extra parameters, which enable adjusting the DCIM state dynamically, making it suitable for our proposed DCIM and achieving a 1.2× improvement in computation efficiency The remainder of the paper is organized as follows Section II introduces the DCIM fundamentals and the recent related works Section III presents the architecture of RDCIM, including AOMB+MAT, MPAA, SPBUF, and RISCV extended instructions Section IV presents the experimental results Finally, Section V concludes the paper II BACKGROUND AND M OTIVATION A Overall Architecture of DCIM Fig 2(a) shows the overall architecture of DCIM which contains two parts: the memory array and the digital computation circuit The memory array is typically composed of SRAM cells and utilized to perform weight data storage and MAC operations between input and weight data As shown in Fig 2(b), to reduce the area overhead, NOR gates are placed on the boundary of the SRAM array [16], which results in multiple 8T SRAM cells in the same row sharing one NOR gate Fig 2(c) illustrates the components of digital computation circuits: the adder tree and the accumulator The adder tree is placed on the boundary of the SRAM array, summing up all the output of each column in parallel The accumulator is placed on the boundary of the adder tree to accumulate the output of the adder tree in different clock cycles Besides, an input buffer is needed to temporarily store the input data and output serial bits Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD Fig The overall architecture of a typical DCIM macro and its peripheral circuits (a) DCIM macro of SRAM arrays and digital computation circuit (b) SRAM array with 10T cells (c) Digital computation circuit of adder tree and accumulator Fig Related works of DCIM (a) DCIM-based DNNs processor (b) The innovations in architecture (c) The innovations in circuits Fig The workflow of DCIM (a) Mathematical expression of the computation (b) The summing up process (c) The accumulating process The basic function of DCIM is to implement matrix-vector multiplication (MVM), where MAC operations are used to compute the dot product of two vectors Mathematically, MVM can be resolved to multiple vector-vector multiplications (VVM) The computation steps of VVM are illustrated in Fig 3(a) In each clock cycle, one specific bit of each input data is sent to the SRAM array to implement multiplication with weight data After that, the outputs of the SRAM array are summed up in the adder tree as shown in Fig 3(b), and the accumulator shifts the data stored in the register and adds them with the output of the adder tree as shown in Fig 3(c) These operations are repeated until the computations of the last bit of each input data are finished, and then the accumulator returns the result of VVM [26] B Related Works In the past five years, as shown in Fig 4(a), (b), and (c), researchers have done a lot of work on the architecture and circuit level around DCIM-based DNN accelerators As shown in Fig 4(c), several prior works have optimized the circuit of DCIM to improve its throughput and functionality Fujiwara et al designed a DCIM macro with 5nm CMOS technology and integrated the adder tree into the SRAM array, achieving 254 TOPS/W in 4-bit and 63 TOPS/W in 8-bit computation [14] Yue et al combined DCIM with float MAC rules, enabling DCIM to support both integer and float computation [20] H Zhu et al proposed a novel DCIM structure: computing-on-memory-boundary (COMB), which allows multiple columns of SRAM array to share one digital computation circuit on the boundary of the array, achieving 1152Kb array size and 0.25 TOPS peak performance (65nm CMOS technology), balancing the computation and memory resources [16] As shown in Fig 4(b), several prior works have incorporated DCIM into DNN accelerators to leverage its high MVM computation efficiency and throughput F Tu et al designed a pipeline/parallel reconfigurable accelerator with DCIM for transformer models [19] The accelerator overcame the challenge of weight update in DCIM and achieved 12.08× to 36.82× lower energy than previous CIM-based accelerators Moreover, the connection between different DCIM chips can be achieved by chiplet technology [21], which expands the capacity and improves the performance of the accelerator C Motivation Although a lot of works have been done before to improve the performance of CIM accelerators, most of them are optimized from a single aspect of circuit [14], [15], [16], [18], [20], [24], [28], [30], [33], [34] or architecture [19], [21], [25], [27], [29], [31], [32], and the two aspects are rarely combined Motivated by the previous works, we address the complexity of these three challenges by proposing a novel DCIM solution that integrates circuit optimization and architecture design at both levels III RDCIM A RCHITECTURE The overall architecture of RDCIM is shown in Fig with three features highlighted in different colors As depicted Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Fig The overall architecture of RDCIM with three features: AOMB+MAT, MPAA, and SPBUF in Fig 5, facing three challenges, we have introduced three key features (1, 2a, and 2b), which correspond to the three challenges, for improving the energy efficiency and reducing the area overhead from both circuit and architecture level synergistically Meanwhile, by introducing the RISC-V CPU, the architectural computation efficiency is expected to be further improved A Adding-on-Memory-Boundary (AOMB) The key component of MVM computation is the multiplication between input and weight data In the typical DCIM scheme, the multiplications are performed by NOR gates inside the SRAM array, and then summed up by the adder tree Basically, the total power consumption during MVM computation can be expressed by Eq (1): Psum1 = Pr ead + PNOR + n X Padder tr ee [i] (1) i=1 Here, Psum1 represents the total power consumption, Pr ead represents the power consumed while reading the weight data, PNOR represents the power consumed while executing the NOR operations, and Padder tr ee [i] represents the power consumed by each stage of adder tree (n stages in total) The strategy for tackling Pn challenge-1 is to analyze Eq (1) and optimize PNOR + i=1 Padder tr ee [i] Given the structure of DCIM, the weight data in the SRAM array should be relatively static to achieve better energy efficiency To utilize this characteristic, we proposed the AOMB scheme to pre-process the weight data and further reduce the burden of digital computation circuits To cooperate with the AOMB scheme, we designed a MAT to accomplish the computation Fig illustrates the structure of the AOMB scheme In RDCIM, the SRAM array adopts a novel 8T SRAM cell, which is similar to the structure in Fig 2(b), but has two different types, that will be described in detail later The AOMB scheme decouples the SRAM cells and the NOR gates and places the adders beside them The weight data stored in the SRAM array are split into ax and bx (x represents the index of data), and sx is the sum of ax and bx The output of AOMB is ax [4]ax [3]ax [2]ax [1]ax [0], bx [4]bx [3]bx [2]bx [1]bx [0] and sx [4]sx [3]sx [2]sx [1]sx [0] As shown in Fig 6(a), adders are placed on the boundary of the SRAM array, and each of the adders is placed between two SRAM rows The two rows of SRAM cells, together with the adder, form a basic computing unit Each computation unit passes the carry to the next one and outputs the result of ax [k]/ax [k], bx [k]/bx [k] and sx [k]/sx [k] Two 4-bit weight data are stored in an interleaved manner to meet the demand of the adder chain in 6(d) In the AOMB scheme, we adopt the 26T mirror adder instead of the 28T adder, as shown in Fig 6(c) The advantage of the 26T mirror adder is that it can reduce the delay and transistor number by removing the output inverters of carry and alternating positive and negative logic When the 26T mirror adder receives an uninverted input(ax [k], bx [k], cx [k − 1]), it outputs an inverted carry(cx [k]) and an uninverted sum(sx [k]) As shown in Fig 6(d), cascading these adders forms an adder chain This adder chain is different from the regular one in that its summation and carry are with the alternant inverted-uninverted forms For example, if the input of the adder chain is ax [3]ax [2]ax [1]ax [0] and bx [3]bx [2]bx [1]bx [0], the output of the adder chain is sx [3]sx [2]sx [1]sx [0] It is necessary to obtain the input bits with the alternant inverted-uninverted form To obtain this input form, we make specific optimization of the SRAM cell structure Fig 6(b) shows that the internal connection of SRAM cells changes according to the weight data bit position For odd bit positions, the point C (PC) connects to the point B (PB), and the read bit line (RBL[m]) outputs an inverted bit after the read word line (RWL[0]) is turned on and the RBL[m] is pre-charged For even bit positions, the PC connects to point A (PA) and outputs an uninverted bit Furthermore, the sign control signal of weight (w_sgn) is introduced to indicate the form of weight data to be signed or unsigned as shown in Fig 6(a) To add the signed and unsigned weight data, the most significant bit (MSB) of data should be extended according to w_sgn In the AOMB scheme, a NOR gate is utilized to extend the MSB of weight data If the weight data are signed, the w_sgn is set to 1, and the ax [4] equals to ax [3] If the weight data are unsigned, the w_sgn is set to and the ax [4] equals to However, the AOMB scheme cannot complete the MAC operations independently, so we designed MAT to implement multiplication and accumulation As shown in Fig 7(a), the MAT is placed next to the SRAM array, receiving the output of it, and summing up the selected data Different from the regular adder tree, the first stage of MAT is composed of MUXs that perform the multiplication Specifically, transmission gate logic is adopted to build the MUXs, which reduces the number of transistors from 12 to and lowers the parasitic capacitance Two bits of input data (i[l + 1]i[l]) are grouped to select the output of the SRAM array Fig 7(b) illustrates the first MUX, where i[1]i[0] select the output of Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD Fig The design of AOMB and its related circuits (a) The boundary of one SRAM array with an adder chain placed beside it (b) The design of one computation unit in AOMB (c) The mirror adder circuit with 26 transistors (d) The adder chain with 26T full adders as Psum1 − Psum2 , as shown in Eq (3): Psum1 − Psum2 = PNOR + Padder tr ee [1] −Padder × f r eqw − PMUX (2a) Since the computation of DCIM is data-centric, DNN mapping is typically focused on weight reuse, which reduces f r eqw As described in Eq (3), the smaller the f r eqw , the greater the power savings In this case, the AOMB scheme reduces the dynamic power of the first stage of the conventional adder tree and collaborates with MAT to lower the overall power consumption Fig 8(a) compares the AOMB scheme (Padder × f r eqw + PMUX ) with the nonAOMB scheme (PNOR + Padder tr ee [1]) at different input and weight toggle rates We evaluated them in the post-simulation environment with 55nm, 100MHz, and 300K The baseline is the conventional scheme at 0.1 toggle rate, with weight data updated every cycles ( f r eqw = 1/8) The AOMB scheme achieves up to 1.48× power saving at 0.1 toggle rate As the toggle rate increases, the AOMB scheme can save more power consumption and finally reach 1.79× at 0.5 toggle rate We attribute this increment to f r eqw , which attenuates the effect of the toggle rate on power consumption Fig 8(b) shows the power consumption comparison between the AOMB+MAT scheme (Psum2 ) and the non-AOMB & nonMAT scheme (Psum1 ) at different input and weight toggle rates with the same evaluation environment as Fig 8(a) The baseline is the conventional scheme at 0.1 toggle rate, and the AOMB+MAT scheme achieves up to 1.22× power saving compared to the non-AOMB & non-MAT scheme at 0.5 toggle rate (2b) B Multi-Precision Adaptive Accumulator (MPAA) Fig The structure of MAT and its related circuits (a) The components of MAT (b) The MUX circuit the first computation unit in the SRAM array: ‘00’ means the output of MUX is 0, ‘01’ means the output is a0 [k], ‘10’ means the output is b0 [k], and ‘11’ means the output is s0 [k] The logical expression of MUX can be written as out put = a0 [k] × i[0] + b0 [k] × i[1], which performs the multiplication indirectly In the subsequent stages of MAT, the 26T mirror adders are used, and the data format of alternant inverted-uninverted bits is passed on until the final stage of MAT The sign bit processing is similar to the structure in Fig 6(a) To transform the output to a normal form, the odd bit positions are inverted By pre-processing the weight data, the power consumption during computation can be expressed as the sum of three components, as shown in Eq (2): Psum2 = Pr ead + Padder × f r eqw + PMAT n X PMAT = PMUX + Padder tr ee [i] (3) i=2 Here, Psum2 represents the total power consumption, Pr ead represents the power consumed while reading the weight data, Padder represents the power consumed by the adders next to the SRAM array, f r eqw represents the frequency of weight updating, and PMAT represents the power consumed by the MAT Inside the MAT, its power consists ofP two parts: the MUX (PMUX ) and the adder tree n ( i=2 Padder tr ee [i]) The total power saving can be expressed To support multi-precision computation, various methods have been proposed in prior works [14] As shown in Fig 9(a), if the basic bank is bits, bits weight data should be split into two different SRAM arrays, one for the high bits and the other for the low bits Both SRAM arrays perform VVM computation in parallel, and their results are summed up by multi-precision computation-supported circuits (MPCSC) outside DCIM In the MPCSC, the DCIM result of high bits is left shifted bits and then added to the result of low Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Fig The comparison of power consumption between the conventional scheme and our scheme in the post-simulation environment (55nm, 100MHz, 300K) (a) The comparison between the AOMB scheme and the non-AOMB scheme (b) The comparison between the AOMB+MAT scheme and the non-AOMB & non-MAT scheme Fig The comparison between conventional multi-precision supported scheme and our MPAA scheme bits In this way, the result of 8-bit data can be obtained in one computation cycle Similarly, 16-bit weight data can be computed by dividing it into four SRAM arrays and summing up their outputs within MPCSC Specifically, the 16-bit weight data is first split into two 8-bit data sets, and the 8-bit data repeats the same computation step mentioned before in stage of MPCSC Then, the MPCSC left shifts the high 8-bit result by bits and adds it to the low 8-bit result in stage However, this method supports multi-precision computation at a cost of additional circuits, which has a quarter of the area of the accumulator as described in challenge-2a Moreover, more precision levels need more shifters and adders in MPCSC, forming a tree structure that increases the area overhead Meanwhile, selecting the results from different precision levels is challenging because they come from different stages of MPCSC To address the challenge-2a, we proposed the MPAA, which supports multi-precision computation inside DCIM As shown in Fig 9(b), to cooperate with MPAA, the SRAM array contains different columns, and a selector is introduced to select a column to perform computation The multiple columns of the SRAM array share the same adder tree and MPAA Different from the traditional scheme, if the weight data is bits and the basic bank is bits, the high and low bits are placed in the first and second columns of the SRAM array respectively They share a common digital computation circuit and execute VVM in series Firstly, the DCIM macro finishes the computation of the high bits and stores the result in the register of the accumulator Without clearing the previous result, it finishes the computation of low bits and stores the final result in the same register Fig 10(a) shows the structure and the computation steps of MPAA To support both signed and unsigned computation, the MPAA can be configured by setting the MSB indicator signal (x_MSB) and the sign indicator signals of input data (x_sgn) and weight data (w_sgn) To support unsigned computation, x_MSB, x_sgn, and w_sgn need to be set to zero, which means the input of accumulator (data_acc) does not need any processing before being sent to the register To support signed computation, x_MSB, x_sgn, and w_sgn need to cooperate After receiving the data_acc, the MPAA first extends the sign bit based on w_sgn Then, it judges whether the input data are the MSB based on x_MSB If so, the sign-extended data (D_sgn) need to perform the following operation: ∼ D_sgn+1′ b1, otherwise, the D_sgn remains unchanged Since the data in the register may need to be right-shifted by bits, a left shift of bits is performed before the adder to avoid accuracy loss After the register, two shifters are optional based on the cycle control signal (w_cycle) to solve the problem of continuous computation between different columns of banks It is noteworthy that the shifter inside MPAA depends on the bit widths of both the input and the weight data Assume that the input data is M bits, the basic bank of the SRAM array is N bits, and the weight data is 2N bits After computing the high N bits of weight data, the data stored in the accumulator register must be right-shifted by M−N bits before computing the low N bits Otherwise, the result of high N bits would be left-shifted M bits during the computation of low N bits, but the high N bits need to be left-shifted N bits actually To realize this function, the data in the register should be right-shifted by M −N −1 bits during the first computation of the low N bits weight To support higher precision levels of weight data (e.g 3N , 4N , and so on), the data stored in the same register only needs to be right-shifted by M −N −1 bits each time during the first computation of the lower N bits weight Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD Fig 12 The area overhead and features comparison between conventional scheme and our MPAA in 55nm CMOS technology Fig 10 The structure and the computation steps of MPAA (a) The structure of MPAA with sign bit processing circuit, MSB processing circuit, and reconfigurable shifter (b)-(f) The computation steps of MPAA with bits signed input data and weight data Fig 11 Computation steps of 4/8/12/16 bits input data and weight data Different gray levels represent different steps In the computation of each precision, the steps are executed from the left to the right As shown in Fig 10(b) to (f), based on the structure mentioned before, the computation steps with bits signed weight and input data are shown as follows: 1) Step 1: The input data is the MSB, and the high bits of weight data are first computed In this step, the x_MSB, x_sgn and w_sgn are set to 1, and the w_cycle is set to 0, so the input data undergoes the sign bit and the MSB pre-processing before being sent to the register 2) Step 2: The input data is not the MSB, so the x_MSB is set to 0, and the remaining signals are unchanged Step lasts cycles, and the register data are left shifted bit and summed up with the sign extended input data in every cycle 3) Step 3: To compute the low bits of weight data, the w_sgn needs to be set to The input data is the MSB, so the x_MSB is set to Since computing the low bits requires switching the bank to the second one, the w_cycle is set to 1, and the data in the register is right-shifted by bits 4) Step 4: This step is similar to the step with w_sgn set to 5) Step 5: The w_cycle is set to 1, and the output is the register data right shifted by bits, which is 32 bits Furthermore, as illustrated in Fig 11, the MPAA can support more precision levels by combining these steps To support 4-bit input data, steps and should be configured into two different types: one lasting cycles (2a and 4a) and the other lasting cycles (2b and 4b) When the weight data is bits and the input data is bits, only steps 1, 2a, and are executed When the weight data is 12 bits or 16 bits, steps and 4a are repeated several times after steps 2a, and then step returns the output When the input data is bits, step is replaced by step 1, and steps 2a (4a) are replaced by 2b (4b) respectively, and the other steps are unchanged When the input data is 12 bits or 16 bits and the input buffer only supports bits of input data, the input data are loaded twice The high bits of input data, which are computed first, follow the computation steps 1, 2a, 3, 4a, and For the low bits or bits of input data, it follows the computation steps mentioned before with x_sgn set to To compute the unsigned input data and weight data, the x_sgn and w_sgn are set to 0, and the computation steps are the same Since the register inside the accumulator is 32 bits, it is acceptable that the bit width of the input data plus the weight data is less than 32 Assuming a weight matrix size of 64 × 64 with 4-bit banks, and 8-bit weight data support as our goal, we compare the MPAA with the MPCSC scheme The MPCSC scheme can compute the 8-bit weight data with a matrix size of 32×64 in a single computation cycle For a larger matrix size, such as 64× 64, it requires an additional computation cycle By adopting the MPAA, our RDCIM computes the 8-bit weight data with Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Fig 13 The structure of SPBUF and its related circuits (a) The circuit of the register with 36 transistors (b) The processing steps of SPBUF (c) The overall structure of SPBUF includes the SRAM cells and their peripheral circuits (d) The design of the read-out circuit (ROC) (e) The circuit of 8T SRAM cell and its line crossing relation a matrix size of 64 × 64 in two computation cycles When the matrix size exceeds 32 × 64, the MPAA scheme achieves the same throughput as the MPCSC scheme However, our scheme demonstrates greater flexibility of bit width compared to the MPCSC scheme In particular, the MPCSC scheme limits the weight bit width by the quantity of SRAM arrays To support 16-bit width with the same parallelism, 4× SRAM arrays are required Moreover, the input bit width is limited by the MPCSC By contrast, our scheme can support more weight bits by extending the columns of the SRAM array, and the bit width is only limited by the register in the accumulator The weight and the input bit width can support 4/8/12/16 bits as long as their sum is less than 32 bits Moreover, by eliminating the MPCSC, the area overhead is reduced by 1.22× as shown in Fig 12 C Serial-Parallel Conversion Supported SRAM Buffer (SPBUF) The input buffer is a crucial component of the DCIM It receives input data and outputs the serial bit To support this function, the input buffer is typically made up of registers that are easy to control and output steady bit-serial data However, registers suffer from significant area overhead as mentioned in challenge-2b As shown in Fig 13(a), the register is 4.5 times larger than the 8T SRAM cell Due to area limitations, it is challenging to increase the storage capacity of the input buffer, resulting in a large hardware overhead in high-bandwidth application scenarios To address the challenge-2b, we proposed the SPBUF as shown in Fig 13(c) The SRAM buffer is composed of 64 columns, and each column represents bits of input data A write selector is placed at the bottom of the SRAM buffer to select a column to store the input data that comes from the right side of it, and a bit selector is placed on the left side of it to select a specific bit of each input data Read-out circuits are placed on top of the SRAM buffer to provide steady output data The processing steps are shown in Fig 13(b) To start with, the input data needs to be loaded into the SRAM buffer Similar to the regular SRAM, the write address (write_addr ) is given to the write selector and selects a column to store the 8-bit input data Then, the bit position (sel_bit) is given to the bit selector to select a specific bit to be computed, and the SPBUF outputs the selected bit of each input data To achieve the structure in Fig 13(c), the 8T SRAM cell is customized as shown in Fig 13(e) In our design, there exist two groups of perpendicular control lines (RWL&WWL, RBL&WBL) In this case, the write data and the read data are perpendicular So when the input data is stored in different columns, each row represents the same bit position In the write stage, the RWL activates the SRAM cell, and the RBL passes the data into it In the read stage, the RWL activates the read path, and the RBL is pre-charged If the data inside the cell is 0, the RBL remains high Otherwise, it discharges to The output of the SRAM cell is inverted to the data inside it The 8T SRAM cell introduces a problem where the RBLs need to be pre-charged, which results in unstable output To overcome this problem, a read-out circuit is designed as shown in Fig 13(d) During the low-level of the clock signal, the RBL is pre-charged If the RBL is high level and the clock is low level, the signal t is locked During the high level of the clock signal, the RBL obtains the output of the SRAM cell, which is inverted to the data stored in the cell At the same time, signal t is inverted to the RBL The inverter placed between the out put and the signal t stabilizes the output data From the point of view of the control signal, the bit position should be selected at the falling edge of the clock, and the out put will be obtained at the rising edge of the clock The out put will stay stable for one clock cycle We compared the area overhead of the register-based input buffer with our SPBUF in 55nm CMOS technology with a capacity of 64 × bits As shown in Fig 14(a), the area overhead is reduced by 3.12× when adopting the SPBUF with the same capacity To prove the correction of SPBUF, we simulated the circuit at 55nm, 100MHz, and the partial waveform is shown in Fig 14(b) We chose the high bits of the first column to illustrate, and the two SRAM cells are called cell A and cell B respectively For the first clock cycle, cell A and cell B receive the input data At the falling edge of the second clock cycle, cell A is chosen with RWL[7] set to At the rising edge of the third clock cycle, the read-out circuit outputs the data (Q) in cell A Since the data in cell A is 0, Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD Fig 14 (a) The area overhead comparison between register-based input buffer and our SPBUF in 55nm CMOS technology (b) The waveform of the read-out process the RBL does not need to discharge, so the output remains high At the falling edge of the third clock cycle, cell B is chosen with RWL[6] set to At the rising edge of the fourth clock cycle, the read-out circuit outputs the data (Q) in cell B Since the data in cell B is 1, its RBL discharges, and the output changes to At the falling edge of the fourth clock cycle, signal t is locked, so the output remains unchanged The output of each data lasts one clock cycle, which matches our previous discussion D RISC-V Extended Instructions To support more application scenarios and computation types, a CPU is always connected to DCIM to deal with control-intensive tasks In previous works, the DCIM is always connected to an ARM architecture CPU by the AXI bus However, since the instructions of DCIM are usually coarsegrained, the scheme of bus connection suffers from weak flexibility Moreover, since the main features incorporated into RDCIM come at the cost of increasing control difficulty, a novel coarse-grained coupling manner between CPU and DCIM is needed To couple DCIM with CPU more tightly and support the three features, a RISC-V CPU is introduced into the RDCIM The overall architecture of the RISC-V CPU is shown in Fig 15(a) This architecture supports the RISC-V extended instructions, which provide two additional parameters to control the DCIM flexibly Moreover, the memory inside DCIM is directly connected to data local memory (DLM) in the RISC-V CPU to generate an AXI stream-like connection To support the RISC-V extended instructions, we added some modifications to the execution unit It incorporates a judging module that distinguishes between standard and extended instructions Standard instructions are executed by the arithmetic logic unit inside the RISC-V CPU, whereas extended instructions are forwarded to an external decoder that further decodes them The external decoder also obtains the register data required for the instruction from the register file If the instruction is legitimate, the controller utilizes the register data and the instruction to control the DCIM state The SPBUF and the SRAM array interact with the DLM for data exchange under the control of the controller Upon completion of the instruction, the response module emits a finish signal In essence, extended instructions function similarly to standard ones, but they operate on different hardware components Based on the proposed architecture, we designed five RISC-V extended instructions, as illustrated in Fig 15(b) These instructions resemble the standard ones with two 32-bit register parameters, namely register and register respectively These instructions can fully utilize the DCIM functionality by working together The cim_clr instruction is used to reset the DCIM state, including setting the accumulator register to and clearing the data in latches and read-out circuit The data_trans_1 and data_trans_2 instructions are used to transfer the input data and weight data from DLM to SPBUF and SRAM array, respectively In the program, the address of the input data and weight data in DLM should be specified in register Moreover, for the data_trans_2, the column of the SRAM array to receive the weight data should be specified in register (write_column_sel) During the execution of data_trans_1, the SRAM buffer is filled with the input data from DLM, and during the execution of data_trans_2, the selected column of the SRAM array is filled with the weight data from DLM The run_cim instruction is used to perform the computation of DCIM To ensure the correct computation, register should specify the w_sgn, x_sgn, w_cycle, and bit_w signals, which are related to the AOMB, MAT, MPAA, and SPBUF The w_sgn and the x_sgn indicate the sign of the weight and input data in AOMB, MAT, and MPAA, where means signed and means unsigned The bit_w indicates the bit width of the input data in SPBUF, where means bits and means bits The w_cycle indicates the need for right shifting the register data in MPAA, where means necessary and means unnecessary Since the DCIM only supports the computation of one column of each SRAM array at a time, register (r ead_column_sel) should specify the column to be computed The read_out instruction is used to transfer the DCIM results to DLM Register should specify the address in DLM that receives the results There are three main categories of matrix computation using RDCIM and the following examples illustrate each of them: If the input and the weight data are both 4-bit signed data, the VVM execution follows these steps: 1) cim_clr: Initialize the DCIM state; 2) data_trans_2: Store the weight data in the selected column of the SRAM array Register specifies the address of the weight data and register specifies the column to be written; 3) data_trans_1: Store the input data into the SRAM buffer Register specifies the address of the input data; Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination 10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Fig 15 (a) The hardware foundations of the RISC-V extended instructions (b) The description of the extended instructions (IFU: Instructions fetch unit EXU: Execution unit LSU: Load store unit WB: Write back BIU: Bus interface unit.) 4) run_cim: Start the computation with the selected column of the SRAM array The w_sgn and x_sgn should be set to 1, since the input data and the weight data are signed, and the w_cycle and bit_w should be set to Register specifies the column to be computed; 5) read_out: Load the results into DLM Register specifies the address that receives the results TABLE I C HIP S PECIFICATION OF O UR D ESIGNED RDCIM M ACRO Moreover, the instructions can be combined to support multi-precision computation Supposing the weight data is signed 12 bits and the input data is signed bits The input data can be stored in the SRAM buffer entirely, but the weight data needs to be stored in three different columns of the SRAM array The computation can be simplified as follows: 1) cim_clr: Initialize the DCIM state; 2) data_trans_2: Repeat three times to store the 12-bit weight data into three different columns of SRAM array; 3) data_trans_1: Store the input data into the SRAM buffer; 4) run_cim: Start the computation with the column of high bits selected The w_sgn, x_sgn and the bit_w should be set to 1, and the w_cycle should be set to 0; 5) run_cim: Start the computation with the column of middle bits selected The w_cycle, x_sgn and the bit_w should be set to 1, and the w_sgn should be set to 0; 6) run_cim: Start the computation with the column of low bits selected The w_cycle, x_sgn and the bit_w should be set to 1, and the w_sgn should be set to 0; 7) read_out: Load the results into DLM Supposing the input data are signed 12 bits and the weight data are signed bits The computation can be simplified to the following steps: 1) cim_clr: Initialize the DCIM state; 2) data_trans_2: Store the weight data into the selected column of the SRAM array; 3) data_trans_1: Store the high bits of input data into the SRAM buffer; 4) run_cim: Start the computation with the selected column The w_sgn, x_sgn and the bit_w should be set to 1, and w_cycle should be set to 0; 5) data_trans_1: Store the low bits of input data into the SRAM buffer; 6) run_cim: Start the computation with the selected column The w_sgn should be set to 1, and the x_sgn, bit_w and w_cycle should be set to 0; 7) read_out: Load the results into DLM For the computation of unsigned data, the x_sgn and w_sgn should be set to IV E XPERIMENTAL R ESULT A 64KB macro using the 55nm CMOS technology was fabricated to evaluate the principal innovative points of RDCIM The detailed technical specifications of the RDCIM macro are shown in Table I The RDCIM macro operates at a nominal voltage of 1.2V and a clock frequency of 200MHz, with a total area of 9.8mm2 It consists of a 2.8mm2 DCIM macro with a 64KB SRAM array inside The RDCIM macro achieves a peak energy efficiency of 66.3 TOPS/W with an area efficiency of 0.288 TOPS/mm2 at bits and 16.6 TOPS/W with an area efficiency of 0.072 TOPS/mm2 at bits precision, respectively Multi-precision computation including 4/8/12/16 bits of weight and input data could be conducted in RDCIM to support various application scenarios To support the innovation circuits mentioned before, a RISCV CPU with 64KB instruction and data cache, which can be expanded to 4MB, is incorporated This RISC-V CPU has a 3stage pipeline structure that can execute RISC-V instructions seamlessly Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD A Performance Evaluation Fig 16(a) shows a technique breakdown analysis of our proposed DCIM for VVM computation in 55nm, 200MHz We use the power efficiency per area (PEPA) as the indicator to compare the area and power consumption advantages of different features The baseline is the conventional DCIM without any optimization By adopting the SPBUF and MPAA features, the area overhead is reduced and the PEPA is increased by about 1.08× and 1.18× respectively The AOMB feature reduces the power consumption by pre-processing the weight data and increases the PEPA by about 1.22× By combining these features, the PEPA is increased by about 1.59×, which is the result of co-optimizing the area and power consumption of DCIM We use the RISC-V instructions to control the DCIM, which increases the computation efficiency compared to the bus-type control To make a fair comparison, we set the bit width of the bus-type control to 32 bits and the frequency to 200MHz, which are the same as the RDCIM The bus-type control has to load the configuration parameters before the computation, which consumes extra clock cycles Fig 16(b) shows the clock cycle comparison between the bus-type control and the extended instruction-type control Our scheme shows high computation efficiency for the stages of loading input data and executing computation, which requires fewer clock cycles because the configuration parameters are given with the instructions For the stage of loading weight data, which requires a long execution time, the improvement is not significant However, the weight data should be rather static to achieve better energy efficiency, so these instructions are rarely used B Evaluation on Neural Network Models We use ResNet-10 and ResNet-18 to execute inference on our chip and evaluate the efficiency of RDCIM Each network model with bits and bits is used to implement on our chip The models are pre-trained in floating point and quantized to bits and bits We extend the memory inside the RISC-V CPU to 4MB in our evaluation environment to maximize the performance of our proposed structure We write the models in the C program and pre-load the weights into DLM to execute the inference of these models RISC-V CPU will first identify the operations involved in the models during chip execution These operations can be divided into MVM and non-MVM parts, with the convolutional part mainly consisting of MVM operations, which will be handed over to the DCIM by the RISC-V CPU through extended instructions The non-MVM part includes pooling, ReLU, etc., which will be executed by the RISC-V CPU’s own ALU in SIMD mode In the evaluation environment, the whole system operates at 1.2V, 200MHz We measure the latency of each layer by the time register inside the RISCV CPU We also evaluate the power consumption of the digital part in the Design Compiler environment and the power consumption of the analog part in the Virtuoso environment The analog power mainly consists of two parts: writing and reading of the weight data and input data Assuming the 11 TABLE II N ETWORK E VALUATION toggle rate of the models is 0.5, we evaluate the writing and reading power consumption of the SRAM array and the input buffer We obtain the total power consumption of RDCIM by adding up the power of the digital part and the analog part TABLE II shows the evaluation results The DCIM supports full precision computation, so it has no accuracy loss caused by process variations The quantization error causes a main decrease in the inference accuracy of neural networks The accuracy of ResNet-18 is 94.5% in bits and 94.8% in bits The accuracy of ResNet-10 is 93.7% in bits and 94.1% in bits To make full use of the DCIM macro, the weight inside the SRAM array should be rather static We use the convolution-kernel reuse strategy to reduce the weight reload [35] In the simulation environment, the inference latency of ResNet-10 is 0.32ms in bits and 1.4ms in bits with the whole model on the chip The inference latency of ResNet-18 is 0.71ms in bits and 3.1ms in bits The DCIM macro achieves 55.49 TOPS/W for ResNet-10 in bits and 13.87 TOPS/W in bits, and 52.22 TOPS/W for ResNet-18 in bits and 13.06 TOPS/W in bits in terms of energy efficiency The layer-wise efficiency of DCIM ranges from 27.97 TOPS/W to 65.28 TOPS/W C Energy and Area Breakdown Fig 17(a) and (b) show the energy breakdown and the area breakdown of DCIM visually As Fig 17(a) shows, the adder tree accounts for 65% of the main energy breakdown of our DCIM However, it reduces significantly compared to recent related work ∼79% Moreover, Fig 17(b) reveals that the SRAM array causes 78.1% of the area overhead Since RDCIM adopts the 8T SRAM cell strategy where 32 SRAM columns share one digital computation circuit, the area of the SRAM array and the digital computation circuits are balanced The accumulator takes only 3.6% area by adopting the MPAA Fig 17(c) and 17(d) show the layout of DCIM and the die micrograph of RDCIM, respectively The entire system consists of the DCIM macro and the RISC-V CPU In the DCIM macro, the 64KB SRAM array is split into two sections, with digital computation circuits sandwiched between them The SPBUF is placed on the bottom of DCIM In the die micrograph of RDCIM, the RISC-V CPU is placed on the boundary of DCIM In TABLE III, we compare our work with international state-of-the-art computation-in-memory designs, and the results demonstrate an improvement in energy efficiency, array size, and flexibility The compared works represent two Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination 12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Fig 16 (a) Technique breakdown analysis of our proposed DCIM (b) Computation efficiency comparison between bus-type control and extended instructions control TABLE III S UMMARY Fig 17 (a) Power breakdown of DCIM (b) area breakdown of DCIM (c) The layout of DCIM (d) The die micrograph of RDCIM mainstream CIM accelerator architectures: ACIM [22] and DCIM [17], [19], [23] JSSC’2022 [17] utilized data reuse and multiplication fusion to reduce the amount of MAC operations In addition, the paper presents a CIM macro that is optimized for bit-level sparsity and uses 1’s/2’s complement mixed computation mode JSSC’2023 [22] designed a highly compact CIM computing bit cell, which supported standard read/write access, 1-b × 1-b MAC, reference voltage generation, and in-memory A/D conversion By utilizing this method, the efficiency of its area became significantly higher JSSC’2022 [19] connected its CIM engines through a reconfigurable streaming network with dedicated modes for different layers in transformer models, avoiding additional offchip loading TVLSI’2023 [23] proposed a DCIM that can reuse memory space through a multi-cycle input activation scheme and internal write-back These methods can all improve the energy efficiency of CIM macros to a certain extent By adopting both MPAA and SPBUF, RDCIM improves area efficiency, while its energy efficiency is enhanced by the AOMB scheme compared to other works With multiple SRAM columns sharing an adder tree, RDCIM shows a larger array size than previous works [19], [22], [23], which reduces the data movement By adopting the AOMB and MAT, the power consumption of the adder tree is reduced For 4-bit input and weight data, RDCIM improves the energy efficiency by 4.82×, 2.77×, and 1.29× over [17], [19], [23], respectively For 8-bit computation, RDCIM achieves 16.6 TOPS/W, which is 2.81× and 2.78× over [17], [22], respectively RDCIM has the advantage of configurable arithmetic precision and supports 4/8/12/16-bit computation, providing more flexibility Moreover, RDCIM uses RISC-V extended instructions for Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination YI et al.: RDCIM WITH HIGH ENERGY EFFICIENCY AND LOW AREA OVERHEAD control instead of FPGA as in [17] and [19], which shows better computation efficiency at the architectural level V C ONCLUSION DCIM can avoid inaccuracies caused by process variations, data conversion overhead, and poor scaling of analog circuits However, it also comes with new challenges in power and area aspects This paper proposes a RISC-V supported fulldigital computing-in-memory processor with high energy efficiency and low area overhead While most previous works focus on the optimization of circuits and architecture separately, RDCIM co-optimizes circuit and architecture by adopting the AOMB, MAT, MPAA, SPBUF, and RISC-V extended instructions This work advances the frontier of DCIM for energy-efficient and area-saving MACs, promising transformative benefits for diverse DNNs-based artificial intelligent applications R EFERENCES [1] M Liang, B Yang, S Wang, and R Urtasun, “Deep continuous fusion for multi-sensor 3D object detection,” in Proc ECCV, 2018, pp 641–656 [2] A Canziani, A Paszke, and E Culurciello, “An analysis of deep neural network models for practical applications,” 2016, arXiv:1605.07678 [3] O Caglayan, R Sanabria, S Palaskar, L Barraul, and F Metze, “Multimodal grounding for sequence-to-sequence speech recognition,” in Proc IEEE Int Conf Acoust., Speech Signal Process (ICASSP), May 2019, pp 8648–8652 [4] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun 2016, pp 770–778 [5] A Shafiee et al., “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Comput Archit News, vol 44, no 3, pp 14–26, 2016 [6] Y Chen, T Chen, Z Xu, N Sun, and O Temam, “DianNao family: Energy-efficient hardware accelerators for machine learning,” Commun ACM, vol 59, no 11, pp 105–112, Oct 2016 [7] Y.-H Chen, T Krishna, J S Emer, and V Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J Solid-State Circuits, vol 52, no 1, pp 127–138, Jan 2017 [8] T Yuan, W Liu, J Han, and F Lombardi, “High performance CNN accelerators based on hardware and algorithm co-optimization,” IEEE Trans Circuits Syst I, Reg Papers, vol 68, no 1, pp 250–263, Jan 2021 [9] M M Waldrop, “More than Moore,” Nature, vol 530, no 7589, pp 144–148, 2016 [10] C.-J Jhang, C.-X Xue, J.-M Hung, F.-C Chang, and M.-F Chang, “Challenges and trends of SRAM-based computing-in-memory for AI edge devices,” IEEE Trans Circuits Syst I, Reg Papers, vol 68, no 5, pp 1773–1786, May 2021 [11] X J Guo and S D Wang, “Overview of edge intelligent computingin-memory chips,” Micro/Nano Electron Intell Manuf., vol 1, no 2, pp 73–82, Jun 2019 [12] H Zhang et al., “HD-CIM: Hybrid-device computing-in-memory structure based on MRAM and SRAM to reduce weight loading energy of neural networks,” IEEE Trans Circuits Syst I, Reg Papers, vol 69, no 11, pp 4465–4474, Nov 2022 [13] N Bruschi et al., “End-to-end DNN inference on a massively parallel analog in memory computing architecture,” 2022, arXiv: 2211.12877 [14] H Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltagefrequency scaling and simultaneous MAC and write operations,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, vol 65, Feb 2022, pp 1–3 13 [15] Y.-D Chih et al., “An 89 TOPS/W and 16.3 TOPS/mm2 all-digital SRAM-based full-precision compute-in memory macro in 22 nm for machine-learning edge applications,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, Feb 2021, pp 252–254 [16] H Zhu et al., “COMB-MCM: Computing-on-memory-boundary NN processor with bipolar bitwise sparsity optimization for scalable multichiplet-module edge machine learning,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, vol 65, Feb 2022, pp 1–3 [17] R Guo et al., “TT@CIM: A tensor-train in-memory-computing processor using bit-level-sparsity optimization and variable precision quantization,” IEEE J Solid-State Circuits, vol 58, no 3, pp 852–866, Mar 2023 [18] F Tu et al., “A 28 nm 29.2 TFLOPS/W BF16 and 36.5 TOPS/W INT8 reconfigurable digital CIM processor with unified FP/INT pipeline and bitwise in-memory booth multiplication for cloud deep learning acceleration,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, vol 65, Feb 2022, pp 1–3 [19] F Tu et al., “TranCIM: Full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” IEEE J Solid-State Circuits, vol 58, no 6, pp 1798–1809, Oct 2022 [20] J Yue et al., “A 28 nm 16.9–300 TOPS/W computing-in-memory processor supporting floating-point NN inference/training with intensiveCIM sparse-digital architecture,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, Feb 2023, pp 1–3 [21] F Tu et al., “16.4 TensorCIM: A 28 nm 3.7 nJ/gather and 8.3 TFLOPS/W FP32 digital-CIM tensor processor for MCM-CIMbased beyond-NN acceleration,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, Feb 2023, pp 254–256 [22] C.-Y Yao, T.-Y Wu, H.-C Liang, Y.-K Chen, and T.-T Liu, “A fully bitflexible computation in memory macro using multi-functional computing bit cell and embedded input sparsity sensing,” IEEE J Solid-State Circuits, vol 58, no 5, pp 1487–1495, May 2023 [23] Z Lin et al., “A fully digital SRAM-based four-layer in-memory computing unit achieving multiplication operations and results store,” IEEE Trans Very Large Scale Integr (VLSI) Syst., vol 31, no 6, pp 776–788, Jun 2023 [24] Y S Lee, Y.-H Gong, and S W Chung, “Scale-CIM: Precision-scalable computing-in-memory for energy-efficient quantized neural networks,” J Syst Archit., vol 134, Jan 2023, Art no 102787 [25] H Jia, H Valavi, Y Tang, J Zhang, and N Verma, “A programmable heterogeneous microprocessor based on bit-scalable in-memory computing,” IEEE J Solid-State Circuits, vol 55, no 9, pp 2609–2621, Sep 2020 [26] S Spetalnick and A Raychowdhury, “A practical design-space analysis of compute-in-memory with SRAM,” IEEE Trans Circuits Syst I, Reg Papers, vol 69, no 4, pp 1466–1479, Apr 2022 [27] H Jia et al., “Scalable and programmable neural network inference accelerator based on in-memory computing,” IEEE J Solid-State Circuits, vol 57, no 1, pp 198–211, Jan 2022 [28] L Han et al., “Efficient discrete temporal coding spike-driven inmemory computing macro for deep neural network based on nonvolatile memory,” IEEE Trans Circuits Syst I, Reg Papers, vol 69, no 11, pp 4487–4498, Nov 2022 [29] B Zhang et al., “PIMCA: A programmable in-memory computing accelerator for energy-efficient DNN inference,” IEEE J Solid-State Circuits, vol 58, no 5, pp 1436–1449, May 2023 [30] H Zhang et al., “CP-SRAM: Charge-pulsation SRAM marco for ultrahigh energy-efficiency computing-in-memory,” in Proc 59th ACM/IEEE Design Autom Conf., Jul 2022, pp 109–114 [31] Y Wang et al., “Trainer: An energy-efficient edge-device training processor supporting dynamic weight pruning,” IEEE J Solid-State Circuits, vol 57, no 10, pp 3164–3178, Oct 2022 [32] S Yu, H Jiang, S Huang, X Peng, and A Lu, “Compute-in-memory chips for deep learning: Recent trends and prospects,” IEEE Circuits Syst Mag., vol 21, no 3, pp 31–56, 3rd Quart., 2021 [33] J.-W Su et al., “A 28 nm 384 kb 6T-SRAM computation-in-memory macro with b precision for AI edge chips,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, Feb 2021, pp 250–252 [34] P.-C Wu et al., “A 28 nm Mb time-domain computing-in-memory 6T-SRAM macro with a 6.6 ns latency, 1241 GOPS and 37.01 TOPS/W for b-MAC operations for edge-AI devices,” in IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, vol 65, Feb 2022, pp 1–3 [35] F Tu, S Yin, P Ouyang, S Tang, L Liu, and S Wei, “Deep convolutional neural network architecture with reconfigurable computation patterns,” IEEE Trans Very Large Scale Integr (VLSI) Syst., vol 25, no 8, pp 2220–2233, Aug 2017 Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply This article has been accepted for inclusion in a future issue of this journal Content is final as presented, with the exception of pagination 14 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS Wente Yi (Graduate Student Member, IEEE) received the B.S degree in electronic science and technology from Beihang University, Beijing, China, in 2022, where he is currently pursuing the M.S degree in electronic and information engineering His research interests include computing-in-memory circuits and CIM-based neural network design Kefan Mo received the B.S degree in optoelectronic information science and engineering from the Nanjing University of Aeronautics and Astronautics, China, in 2021 He is currently pursuing the M.S degree in electronic science and technology with Beihang University His research interests include computing-in-memory circuits and AI processors Wenjia Wang received the B.S degree in electronic science and technology from Jilin University, Changchun, China, in 2022 He is currently pursuing the M.S degree in integrated circuit science and engineering with Beihang University, Beijing, China His research interests include brain-like computing circuits and computing-in-memory circuits Yitong Zhou received the B.S degree in automation from the Nanjing University of Science and Technology, Nanjing, China, in 2021 He is currently pursuing the M.S degree in electronic science and technology with Beihang University His research interests include compute-in-memory circuits Yejun Zeng received the B.S degree in electronic and information engineering from Beihang University, Beijing, China, in 2022, where he is currently pursuing the M.S degree in electronic science and technology His current research interests include neural network deployment and accelerator design Zihan Yuan received the B.S degree in electronic science and technology from the Hefei University of Technology in 2022 She is currently pursuing the M.S degree in electronic and information engineering with Beihang University Her research interests include computing-in memory, embedded deep learning, and reconfigurable computing Bojun Cheng (Member, IEEE) received the B.Sc degree from the School of the Gifted Young, University of Science and Technology of China, in 2012, and the Ph.D degree from ETH Zürich in 2021, advised by Prof Juerg Leuthold He is currently an Assistant Professor with the Thrust of Microelectronics, The Hong Kong University of Science and Technology (Guangzhou), China His research interests include emerging memory devices and the application of the emerging memory devices in neuromorphic computing and optical interconnects Biao Pan (Member, IEEE) received the Ph.D degree in optical engineering from the Huazhong University of Science and Technology, Wuhan, in 2015 He is currently an Associate Professor with the School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China His research interests include computing-in-memory circuit design and neuromorphic computing with the emerging nonvolatile memory devices Authorized licensed use limited to: NASATI Downloaded on January 27,2024 at 07:26:13 UTC from IEEE Xplore Restrictions apply

Ngày đăng: 28/01/2024, 15:07

w