Browsing by Subject "Deep neural networks"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
Item Analysis of storage bottlenecks in Deep Learning models(2019-06-25) Tripathi, Aastha; Chidambaram, VijayDeep Learning (DL) is gaining prominence and is widely used for a plethora of problems. DL models, however, take in the order of days to train. Optimizing hyper-parameters is another factor that adds to the training time. This thesis aims to analyze the training pattern on Convolutional Neural Networks from a systems perspective. We perform a thorough study on the effects of systems resources like DRAM, persistent storage (SSD/HDD space), and GPU on the training time. We explore how one could avoid bottlenecks in the data processing pipeline in the training phase. Our analysis illustrates how GPU utilization can be maximized in the training pipeline by choosing the right combination of two hyper-parameters - batch size and the number of data prefetching worker processes. We also take a step forward and propose a novel strategy to optimize these hyper-parameters by estimating the maximum batch size that can be used. Additionally, our strategy provides an approximate efficient combination of batch size and the number of worker processes for the given resources.Item Bottlenecks in big data analytics and AI applications and opportunities for improvement(2022-12-21) Richins, Daniel J.; Janapa Reddi, Vijay; John, Lizy Kurian; Wu, Carole-Jean; Julien, Christine L; Marculescu, DianaFrom shopping to social interaction, the related domains of big data analytics and artificial intelligence (AI) applications affect many aspects of our daily activities. Their success arises in part from their highly parallelizable compute, which allows them to process massive data sets in data centers, serve large numbers of users simultaneously, and perform almost innumerable simple calculations very quickly. Despite the success and ubiquity of big data analytics and AI, I show that the foundational principle of high performance in these paradigms—abundant and easily exploited parallel computation—has been pushed to the point where the limitations of parallel computing have come to dictate application performance. Using the industry benchmark TPCx-BB, I demonstrate that most of the compute is spent in code regions unable to fully utilize the available cores. In accordance with Amdahl’s law, overall performance is dictated by these less-parallel regions of compute. And in a data center deployment of an end-to-end AI application, the abundant parallelism of DNN inference is overshadowed by the non-parallel portions of the application pipeline: pre- and post-processing and inter-server communication. In a study of accelerated AI, I show that at a modest 8x compute speedup, performance improvement is completely halted by the limited storage bandwidth of just a handful of servers. Even within DNN inference itself, the demand for higher performance is pushing current hardware to its limits, to the point where DNN accuracy must sometimes be sacrificed for latency. To address the limitations at the boundaries of parallel computing in these domains, I propose solutions targeted to each domain. In big data analytics, I demonstrate that restricting big data software to a small subset of the available cores on each server can substantially improve performance and I propose a combined hardware/software solution called core packing that would extend these benefits (up to 20% latency reduction) to a wide range of big data applications. In data center AI applications, I demonstrate how an edge data center, carefully tailored to the specific behavior of accelerated AI applications, can accommodate up to 32x accelerated AI at 15% lower total cost of ownership than a comparable data center that does not tailor itself to the needs of the application. And within DNN inference, I show that an additional source of parallelism—between adjacent layers in the DNN graph—can be exploited to offer latency reductions up to 39%.Item Compute-in-memory designs for deep neural network and combinatorial optimization problems accelerators(2023-04-23) Xie, Shanshan, Ph. D.; Kulkarni, Jaydeep P.; Pan, David Z.; Orshansky, Michael; Jia, Yaoyao; Hamzaoglu, FatihThe unprecedented growth in Deep Neural Networks (DNN) model size has resulted into a massive amount of data movement from off-chip memory to on-chip processing cores in modern Machine Learning (ML) accelerators. Compute-In-Memory (CIM) designs performing analog DNN computations within a memory array along with peripheral data converter circuits, are being explored to mitigate this ‘Memory Wall’ bottleneck of latency and energy overheads. Embedded non-volatile magnetic [Wei et al. [2019]; Chih et al. [2020]; Dong et al. [2018]; Shih et al. [2019]], and resistive [Jain et al. [2019]; Chou et al. [2020]; Chang et al. [2014]; Lee et al. [2017]] as well as standalone Flash memories suffer from lower write-speeds and poor write-endurance and can’t be used for programmable accelerators requiring fast and frequent model updates. Similarly, cost-sensitive commodity DRAM (Dynamic Random Access Memory) can’t be leveraged for high-speed, custom CIM designs due to limited metal layers and dense floorplan constraints often leading to compute-near-memory designs limiting its throughput benefits [Aga et al. [2019]]. Among the prevalent semiconductor memories, eDRAM (embedded-DRAM) which integrates the DRAM bitcell monolithically along with high-performance logic transistors and interconnects can enable custom CIM designs by offering the densest embedded bitcell, low pJ/bit access energy, high-endurance, high-performance, and high-bandwidth; all desired attributes for ML accelerators [Fredeman et al. [2015]; Berry et al. [2020]]. Yet, eDRAM has been used in niche applications due to its high cost/bit, low retention time, and high noise sensitivity. On the DNN algorithms front, the landscape is rapidly changing with the adoption of 8-bit integer arithmetic for both DNN inference and training algorithms [Jouppi et al. [2017]; Yang et al. [2020]]. These reduced bit-width computations are extremely conducive for CIM designs which have shown promising results for integer arithmetic [Biswas and Chandrakasan [2018]; Gonugondla et al. [2018a]; Zhang et al. [2017]; Si et al. [2019]; Yang et al. [2019]; Khwa et al. [2018]; Chen et al. [2019]; Dong et al. [2020]; Valavi et al. [2019]; Dong et al. [2017]; Jiang et al. [2019]; Yin et al. [2020]]. Thus, high cost/bit of eDRAM can now be amortized by repurposing existing eDRAM in high-end processors for enabling CIM circuits. Despite the potential of eDRAM technology and the progress in DNN integer arithmetic, no hardware demonstration for eDRAM-based CIM design has been reported so far. Therefore, in this dissertation, the first project explores the compute-in-memory concept with the dense 1T1C eDRAM bitcells as charge domain circuits for convolution neural network (CNN) multiply-accumulation-averaging (MAV) computation. This method minimizes area overhead by leveraging existing 1T1C eDRAM columns to construct an adaptive data converter, dot-product, averaging, pooling, and ReLU activation on the memory array. The second project presents a leakage and read bitline (RBL) swing-aware compute-in-memory (CIM) design leveraging a promising high-density gain-cell embedded DRAM bitcell and the intrinsic RBL capacitors to perform CIM computations within the limited RBL swing available in a 2T1C eDRAM. The CIM D/A converters (DAC) are realized intrinsically with variable RBL precharge voltage levels. A/D converters (ADC) are realized using Schmitt Triggers (ST) as compact and reconfigurable Flash comparators. Similar to machine learning applications, combinatorial optimation problems (COP) also require data-intensive computations, which are naturally suitable for adopting the compute-in-memory concept as well. Combinatorial optimization problems find many real-world social and industrial data-intensive computing applications. Examples include optimization of mRNA sequences for COVID-19 vaccines [Leppek et al. [2021]; Pardi et al. [2018]], semiconductor supply-chains [Crama [1997]; Kempf [2004]], and financial index tracking [Benidis et al. [2018]], to name a few. Such COPs are predominantly NP-hard [Yuqi Su and Kim [2020]], and performing an exhaustive brute force search becomes untenable as the COP size increases. An efficient way to solve COPs is to let nature perform the exhaustive search in the physical world using the Ising model, which can map many types of COPs [Lucas [2014]], The Ising model describes spin dynamics in a ferromagnetic [Peierls [1936]], wherein spins naturally orient to achieve the lowest ensemble energy state of the Ising model, representing the optimal COP solution [Yoshimura et al. [2015]]. Therefore, in order to accelerate the COP computations, the third project focuses on implementing analog compute-in-memory techniques for Ising computation to eliminate unnecessary data movement and to reduce energy costs. The COPs can be mapped into a generic Ising model framework, and the computations are performed directly on the bitlines. Spin updates are performed locally using the existing sense amplifier in the peripheral circuits and the write-after-read mechanism in the memory array controller. Beyond that, the fourth project explores the CIM designs for solving Boolean Satisfiability (SAT) problems, which s a non-deterministic polynomial time (NP)-complete problems with many practical and industrial data-intensive applications. An all-digital SAT solver, called Snap-SAT, is presented to accelerate the iterative computations using the static random-access memory (SRAM) array to reduce the frequent memory access and minimize the hardware implementation cost. This design demonstrates a promising, fast, reliable, reconfigurable, and scalable compute-in-memory design for solving and accelerating large-scale hard SAT problems, suggesting its potential for solving time-critical SAT problems in real-life applications (e.g., defense, vaccine development, etc.)Item Data-driven methodologies for supporting decision-making in roadway safety and pavement management(2023-08) Xu, Yang, M.S. in Engineering; Bhasin, Amit; Li, Jenny; Caldas, Carlos H; Boyles, Stephen DThere has been a significant rise in the utilization of data-driven methods within the contemporary realm of transportation engineering. This trend is primarily attributed to the limitations associated with experience-based methods, such as subjectivity and non-reproducibility. In contrast, data-driven methods have proven to offer a more objective and effective approach to problem analysis, thereby providing decision-makers with a reliable basis for informed decision-making. This present research focuses on two types of data-driven methodologies: geostatistical analyses utilizing geographic information systems (GIS) and cutting-edge algorithms associated with artificial intelligence (AI). In numerical analysis, data provides a means to gain valuable insights into a problem of interest. While AI-oriented methods have been shown in many studies to be more effective than traditional approaches, the accuracy of the analysis still heavily depends on the quality of the data. This dissertation endeavors to shed light on the pivotal role that data plays in both roadway safety analysis and pavement management. To accomplish this, four distinct studies are proposed that examine different aspects of data-driven methods. The studies encompass an evaluation of data consistency in motor vehicle crash databases, the identification of crash hot spots within a road network, a synthesis of advancements in the application of AI algorithms to various activities of pavement management, and an exploration of the relationship between pavement conditions and roadway safety using AI-oriented methods. The knowledge acquired from these studies serves as a foundation for future research, advancements, and the adoption of innovative approaches to enhance the efficiency of safety analysis and pavement management. This research ultimately facilitates informed decision-making, effective resource allocation, and the implementation of cost-effective interventions to enhance roadway safety and optimize pavement management practices.Item Distributed deep neural networks(2017-11-09) Mullapudi, Subhash Venkat; Caramanis, Constantine; Khurshid, SarfrazDeep neural networks have become popular for solving machine learning problems in the field of computer vision. Although computers have reached parity in the task of image classification in machine learning competitions, the task of mining massive training data often takes expensive hardware a long time to process. Distributed protocol for model training can be attractive because less powerful distributed nodes are cheaper to operate than specialized high-performance cluster. Stochastic gradient descent (SGD) is a popular optimizer at the heart of many deep learning systems. To investigate the performance of distributed asynchronous SGD, Tensorflow deep learning framework was tested with Downpour SGD and Delay Compensated SGD to see effect of model training in typical commercial environments. Experimental results show that both Downpour and Delay Compensated SGD are viable protocols for distributed deep learning.Item Efficient deep learning for sequence data(2020-05) Zhang, Jiong, Ph. D.; Dhillon, Inderjit S.; Ward, Rachel A; Bajaj, Chandrajit L; Martinsson, Per-Gunnar J; Hsieh, Cho-JuiDeep learning has achieved great success in many sequence learning tasks such as machine translation, speech recognition, and time series prediction. Powerful deep sequence learning models, including recurrent neural networks (RNNs) and Transformers, have tremendous expressive power to fit very complex functions. However, they sometimes cannot be applied to real-world scenarios due to lack of efficiency. On one hand, deep learning models usually have millions of parameters and requires computationally intensive algorithms to train. This leads to tediously long training processes, even with the most powerful hardware. On the other hand, capturing long-term dependencies within a sequence remains a contemporary challenge for most deep architectures. To overcome these challenges, we develop a series of methods to improve the efficiency of these deep learning architectures. In particular, we make the following contributions: (1)We propose methods to solve the vanishing and exploding gradient issues that arise in RNNs. These methods enable capturing dependencies over longer ranges by exploiting the orthogonality of Householder matrices or the expressive power of the Fourier basis; (2) We develop a GPU efficient training algorithm to improve the hardware efficiency of the proposed recurrent architectures with advanced linear algebra tools. The GPU efficient algorithm achieves training speed similar to vanilla RNNs while allowing explicit management of recurrent memories; (3) To solve the scalability issue of the self-attentional Transformer models, we design a dynamic training scheme called AutoAssist and an advanced Transformer model with memory summarization (Transformer-FS). We show that the proposed AutoAssist pipeline can save up to 40% of SGD updates and the Transformer-FS can capture long-term dependencies with relatively fewer additional memory cells.Item RTL design and analysis of Softmax Layer in Deep Neural Networks(2020-05-07) Xavier, Jim; John, Lizy KurianDeep neural networks (DNNs) are widely used in modern machine learning systems in the big data era for their superior accuracy. These artificial neural networks suffer from high computational complexity. The structure of DNN layers vary depending on the nature of training and inference tasks. Softmax Layer is a critical layer in DNNs and is usually used as the output layer in multi-category classification tasks. Softmax layer involves exponentiation and division, thereby resulting in high computational complexity and long critical paths. This report focuses on frontend implementation of an efficient microarchitecture of Softmax layer, which tries to address some of the problems associated with a simple, direct implementation. Techniques like pipelining are employed to boost the performance of the complex datapath logic. Error analysis of the hardware is performed with software results from MATLAB. Synthesis of the RTL code is performed on Xilinx Artix-7 FPGA, resulting in a clock frequency of 274.3 MHz.