Deep neural networks with contextual probabilistic units

Fan, Xinjie

Deep neural networks with contextual probabilistic units

Access full-text files

FAN-DISSERTATION-2021.pdf (22.68 MB)

Date

2021-04-30

Authors

Fan, Xinjie

Abstract

Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research fields. Unlike the traditional machine learning techniques that require hand-crafted feature extractors to transform raw data, deep learning methods are able to automatically learn useful representations by exploiting the data. Despite the great success of deep learning methods, there are still many challenges in front of us. In this thesis, we propose new contextual probabilistic units to make progress along three directions in deep learning, including uncertainty estimation, generalization, and optimization.

Unlike traditional probabilistic models that learn a distribution of predictions, deep learning models, composed of deterministic mappings, often only give us point estimates of predictions, lacking a sense of uncertainty. Dropout is an effective probabilistic unit to estimate uncertainty for neural networks. However, the quality of uncertainty estimation depends heavily on the dropout probabilities. Existing methods treat dropout probabilities as global parameters shared across all data samples. We introduce contextual dropout, a sample-dependent dropout, where we consider parameterizing dropout probabilities as a function of input covariates. This generalization could greatly enhance the neural network's capability of modeling uncertainty and bridge the gap between traditional probabilistic models and deep neural networks.

To obtain uncertainty estimation for attention neural networks, we propose Bayesian attention modules where the attention weights are related to continuous latent alignment random variables dependent on the contextual information and learned in a probabilistic manner. The whole training process can be made differentiable via the reparameterization trick. Our method is able to capture complicated probabilistic dependencies as well as obtain better uncertainty estimation than previous methods while maintaining scalability.

Deep NNs learn the representations from data in an implicit way, making them prone to learning features that do not generalize across domains. We study the impact on domain generalization from transferring the training-domain statistics to the testing domain in the normalization layer. We propose a novel normalization approach to learn both the standardization and rescaling statistics via neural networks, transforming input features to useful contextual statistics. This new form of normalization can be viewed as a generic form of the traditional normalizations. The statistics are learned to be adaptive to the data coming from different domains, and hence improve the model generalization performance across domains.

Stochastic gradient descent has achieved great success in optimizing deterministic neural networks. However, standard backpropagation no longer applies to the training process of neural networks with stochastic latent variables and one often resorts to a REINFORCE gradient estimator, which has large variance. We address this issue on challenging contextual categorical sequence generation tasks, where the learning signal is noisy and/or sparse and the learning space is exponentially large. We adapt the ARSM estimator to our solution, using correlated Monte Carlo rollouts to reduce gradient variances. Our methods show significant reduction of gradient variance and consistently outperform related baselines.