Compute-in-memory designs for deep neural network and combinatorial optimization problems accelerators




Xie, Shanshan, Ph. D.

Journal Title

Journal ISSN

Volume Title



The unprecedented growth in Deep Neural Networks (DNN) model size has resulted into a massive amount of data movement from off-chip memory to on-chip processing cores in modern Machine Learning (ML) accelerators. Compute-In-Memory (CIM) designs performing analog DNN computations within a memory array along with peripheral data converter circuits, are being explored to mitigate this ‘Memory Wall’ bottleneck of latency and energy overheads. Embedded non-volatile magnetic [Wei et al. [2019]; Chih et al. [2020]; Dong et al. [2018]; Shih et al. [2019]], and resistive [Jain et al. [2019]; Chou et al. [2020]; Chang et al. [2014]; Lee et al. [2017]] as well as standalone Flash memories suffer from lower write-speeds and poor write-endurance and can’t be used for programmable accelerators requiring fast and frequent model updates. Similarly, cost-sensitive commodity DRAM (Dynamic Random Access Memory) can’t be leveraged for high-speed, custom CIM designs due to limited metal layers and dense floorplan constraints often leading to compute-near-memory designs limiting its throughput benefits [Aga et al. [2019]]. Among the prevalent semiconductor memories, eDRAM (embedded-DRAM) which integrates the DRAM bitcell monolithically along with high-performance logic transistors and interconnects can enable custom CIM designs by offering the densest embedded bitcell, low pJ/bit access energy, high-endurance, high-performance, and high-bandwidth; all desired attributes for ML accelerators [Fredeman et al. [2015]; Berry et al. [2020]]. Yet, eDRAM has been used in niche applications due to its high cost/bit, low retention time, and high noise sensitivity. On the DNN algorithms front, the landscape is rapidly changing with the adoption of 8-bit integer arithmetic for both DNN inference and training algorithms [Jouppi et al. [2017]; Yang et al. [2020]]. These reduced bit-width computations are extremely conducive for CIM designs which have shown promising results for integer arithmetic [Biswas and Chandrakasan [2018]; Gonugondla et al. [2018a]; Zhang et al. [2017]; Si et al. [2019]; Yang et al. [2019]; Khwa et al. [2018]; Chen et al. [2019]; Dong et al. [2020]; Valavi et al. [2019]; Dong et al. [2017]; Jiang et al. [2019]; Yin et al. [2020]]. Thus, high cost/bit of eDRAM can now be amortized by repurposing existing eDRAM in high-end processors for enabling CIM circuits. Despite the potential of eDRAM technology and the progress in DNN integer arithmetic, no hardware demonstration for eDRAM-based CIM design has been reported so far. Therefore, in this dissertation, the first project explores the compute-in-memory concept with the dense 1T1C eDRAM bitcells as charge domain circuits for convolution neural network (CNN) multiply-accumulation-averaging (MAV) computation. This method minimizes area overhead by leveraging existing 1T1C eDRAM columns to construct an adaptive data converter, dot-product, averaging, pooling, and ReLU activation on the memory array. The second project presents a leakage and read bitline (RBL) swing-aware compute-in-memory (CIM) design leveraging a promising high-density gain-cell embedded DRAM bitcell and the intrinsic RBL capacitors to perform CIM computations within the limited RBL swing available in a 2T1C eDRAM. The CIM D/A converters (DAC) are realized intrinsically with variable RBL precharge voltage levels. A/D converters (ADC) are realized using Schmitt Triggers (ST) as compact and reconfigurable Flash comparators. Similar to machine learning applications, combinatorial optimation problems (COP) also require data-intensive computations, which are naturally suitable for adopting the compute-in-memory concept as well. Combinatorial optimization problems find many real-world social and industrial data-intensive computing applications. Examples include optimization of mRNA sequences for COVID-19 vaccines [Leppek et al. [2021]; Pardi et al. [2018]], semiconductor supply-chains [Crama [1997]; Kempf [2004]], and financial index tracking [Benidis et al. [2018]], to name a few. Such COPs are predominantly NP-hard [Yuqi Su and Kim [2020]], and performing an exhaustive brute force search becomes untenable as the COP size increases. An efficient way to solve COPs is to let nature perform the exhaustive search in the physical world using the Ising model, which can map many types of COPs [Lucas [2014]], The Ising model describes spin dynamics in a ferromagnetic [Peierls [1936]], wherein spins naturally orient to achieve the lowest ensemble energy state of the Ising model, representing the optimal COP solution [Yoshimura et al. [2015]]. Therefore, in order to accelerate the COP computations, the third project focuses on implementing analog compute-in-memory techniques for Ising computation to eliminate unnecessary data movement and to reduce energy costs. The COPs can be mapped into a generic Ising model framework, and the computations are performed directly on the bitlines. Spin updates are performed locally using the existing sense amplifier in the peripheral circuits and the write-after-read mechanism in the memory array controller. Beyond that, the fourth project explores the CIM designs for solving Boolean Satisfiability (SAT) problems, which s a non-deterministic polynomial time (NP)-complete problems with many practical and industrial data-intensive applications. An all-digital SAT solver, called Snap-SAT, is presented to accelerate the iterative computations using the static random-access memory (SRAM) array to reduce the frequent memory access and minimize the hardware implementation cost. This design demonstrates a promising, fast, reliable, reconfigurable, and scalable compute-in-memory design for solving and accelerating large-scale hard SAT problems, suggesting its potential for solving time-critical SAT problems in real-life applications (e.g., defense, vaccine development, etc.)


LCSH Subject Headings