Efficient deep learning for sequence data

Zhang, Jiong, Ph. D.

Efficient deep learning for sequence data

Access full-text files

ZHANG-DISSERTATION-2020.pdf (6.04 MB)

Date

2020-05

Authors

Zhang, Jiong, Ph. D.

Abstract

Deep learning has achieved great success in many sequence learning tasks such as machine translation, speech recognition, and time series prediction. Powerful deep sequence learning models, including recurrent neural networks (RNNs) and Transformers, have tremendous expressive power to fit very complex functions. However, they sometimes cannot be applied to real-world scenarios due to lack of efficiency. On one hand, deep learning models usually have millions of parameters and requires computationally intensive algorithms to train. This leads to tediously long training processes, even with the most powerful hardware. On the other hand, capturing long-term dependencies within a sequence remains a contemporary challenge for most deep architectures. To overcome these challenges, we develop a series of methods to improve the efficiency of these deep learning architectures. In particular, we make the following contributions: (1)We propose methods to solve the vanishing and exploding gradient issues that arise in RNNs. These methods enable capturing dependencies over longer ranges by exploiting the orthogonality of Householder matrices or the expressive power of the Fourier basis; (2) We develop a GPU efficient training algorithm to improve the hardware efficiency of the proposed recurrent architectures with advanced linear algebra tools. The GPU efficient algorithm achieves training speed similar to vanilla RNNs while allowing explicit management of recurrent memories; (3) To solve the scalability issue of the self-attentional Transformer models, we design a dynamic training scheme called AutoAssist and an advanced Transformer model with memory summarization (Transformer-FS). We show that the proposed AutoAssist pipeline can save up to 40% of SGD updates and the Transformer-FS can capture long-term dependencies with relatively fewer additional memory cells.