Efficient deep neural network model training by reducing memory and compute demands
MetadataShow full item record
Deep neural network models are commonly used in various real-life applications due to their high prediction accuracy for different tasks. In particular, CNN (convolutional neural network) models have become the de facto choices for most vision applications such as image classification, object segmentation, and object detection. Modern CNN models contain hundreds of million of parameters and training them requires millions of computation- and memory access-heavy iterations. To reduce this expensive CNN model training cost, this dissertation presents computation and memory cost-efficient training mechanisms with a combination of workload scheduling, learning algorithm, and accelerator architecture optimizations. This dissertation also introduces a performance model for data-parallel accelerators as a fast and accurate method to estimate the performance impact of the proposed architectural optimizations and to help fine-grain accelerator design space exploration. The first part of this dissertation discusses reducing the memory bandwidth demand for CNN training. I first analyze data reuse opportunities in CNN training and show that CNN training has high data locality between network layers but that conventional training mechanisms fail to utilize this inter-layer locality. Then, I develop a CNN training scheduling mechanism that modifies the network execution ordering in a way that captures the inter-layer locality while supporting high compute resource utilization. I also introduce a training accelerator that adopts architectural optimizations that hide additional data transfers caused by the proposed scheduling modification and realize effective training speedup. The proposed training accelerator has 45 mixed precision FLOPS and, with the memory bandwidth-efficient network training scheduling, beats a state-of-the-art GPU that has ∼3X higher peak FLOPs. The second part of this dissertation focuses on reducing the computation cost of CNN training. To reduce computations during training, I use neural network model pruning from the beginning of training. The insight is that a fully trained CNN model contains many non-critical parameters and pruning such parameters during training has only a minor impact on the learning quality. I also choose to structurally prune these parameters to provide high data parallelism avoiding complex data indexing, thus maintaining high compute resource utilization. For the practical implementation of pruning while training, I propose three algorithmic optimizations. Theses optimizations are designed to remove the need for the memory accesses caused by tensor reshaping, reduce the number of training runs in finding the desired pruning hyper-parameters, and maintain high data parallelism even for processing a highly pruned CNN model. Overall, the proposed algorithm speeds up the training of commonly used state-of-the-art image classifiers by 39% with only 1.9% accuracy loss. The third part of this dissertation deals with training pruned CNN models on accelerators with large systolic arrays. I first show my observation that processing structurally-pruned CNN models on a large systolic array severely underutilizes its PEs (processing elements) because the reduced number of channels decreases parallelism. Then, I show that naively splitting a large core into multiple small cores improves PE utilization but decreases input reuse and incurs >4% area overhead. To improve PE utilization and maintain high input reuse, I propose a flexible systolic array architecture that can reconfigure its structure to one of several modes, each designed for efficient execution of CNN layers with different dimensions. I also develop compile-time heuristics that optimize mapping the layer workload to the flexible systolic array resources for both high performance and energy efficiency. My new mechanisms increase PE utilization by 36% compared to a single large-core design and improve training energy efficiency by 18% compared to many-small-core designs. The last part of this dissertation is about developing an accelerator performance model for accurate CNN execution time estimation. For accurate performance modeling, I introduce a memory traffic model that predicts the data traffic at different levels of the GPU memory system hierarchy. This involves an in-depth analysis of the memory access patterns of data-parallel convolution kernels and the spatial locality. I demonstrate that the proposed performance model can provide guideline to fine-tune the GPU resources for efficient CNN performance scaling.