Nondeterminism as a reproducibility challenge for deep reinforcement learning
MetadataShow full item record
In recent years, deep neural networks have powered many successes in deep reinforcement learning (DRL) and artificial intelligence by serving as effective function approximators in high-dimensional domains. However, there are several difficulties in reproducing such successes. These difficulties have risen due to several factors, including researchers' limited access to compute power and a general lack of knowledge of implementation details that are critical for reproducing results successfully. However, nondeterminism is a reproducibility challenge that is perhaps less emphasized despite being particularly relevant in DRL. DRL algorithms tend to have high variance, in no small part due to the fact that agents must learn from a nonstationary training distribution in the presence of additional sources of randomness that are absent from other machine learning paradigms. The high variance of DRL algorithms, combined with the low sample sizes used in research, makes it difficult to match reported results. As such, the ability to control for sources of nondeterminism is especially important for achieving reproducibility in DRL. If we are to maximize progress in DRL, we need research to be reproducible and verifiable, ensuring the validity of our claims. Reproducibility is a necessary prerequisite for improving upon or comparing algorithms, both of which are done frequently in DRL research. In this thesis, we take steps towards studying the impact of nondeterminism on two important pillars of DRL research: the reproducibility of results and the statistical comparison of algorithms. We do so by (1) enabling deterministic training in DRL by identifying and controlling for all sources of nondeterminism present during training, (2) performing a sensitivity analysis that shows how these sources of nondeterminism can impact a DRL agent's performance and policy, and (3) showing how nondeterminism negatively impacts algorithm comparison in DRL and describing how deterministic training can mitigate this negative impact. We find that individual sources of nondeterminism such as the random network initialization can affect an agent's performance substantially. We also find that the current sample sizes used in DRL may not satisfactorily capture differences in performance between two algorithms. Lastly, we make available our deterministic implementation of deep Q-learning.