Analysis of storage bottlenecks in Deep Learning models
Abstract
Deep Learning (DL) is gaining prominence and is widely used for a plethora of problems. DL models, however, take in the order of days to train. Optimizing hyper-parameters is another factor that adds to the training time. This thesis aims to analyze the training pattern on Convolutional Neural Networks from a systems perspective. We perform a thorough study on the effects of systems resources like DRAM, persistent storage (SSD/HDD space), and GPU on the training time. We explore how one could avoid bottlenecks in the data processing pipeline in the training phase. Our analysis illustrates how GPU utilization can be maximized in the training pipeline by choosing the right combination of two hyper-parameters - batch size and the number of data prefetching worker processes. We also take a step forward and propose a novel strategy to optimize these hyper-parameters by estimating the maximum batch size that can be used. Additionally, our strategy provides an approximate efficient combination of batch size and the number of worker processes for the given resources.