Scalable smoothing algorithms for massive graph-structured data
Probabilistically modeling noisy data is a crucial step in virtually all scientific experiments and engineering pipelines. Recent years have seen the rise of several high-throughput techniques in science and a proliferation of cheap sensors in engineering. These dual phenomena have resulted in the generation of massive datasets, each often containing rich, problem-dependent structural dependencies within and between their many observations. Classical ``scalable'' modeling procedures for common tasks such as hypothesis testing and conditional density estimation make the simplifying assumption that the data contains little or no underlying dependency structure. More sophisticated techniques to correct for latent correlations in the data have historically dealt only with small datasets where computational complexity was not a consideration. This creates a clear need for scalable, dependency-aware methods in many areas of computational statistics.
To this end, we develop novel graph-based smoothing algorithms that form the foundations of three new methodologies for large-scale structured statistical inference: False Discovery Rate Smoothing (FDRS), Spatial Density Smoothing (SDS), and Smoothed Dyadic Partitioning (SDP). FDRS improves the power of classical multiple hypothesis testing in the scenario where a dependency graph can be defined over each test site. SDS provides a more sample-efficient marginal density estimator when a dependency graph is defined over multiple distributions such as when observing samples arranged on a spatial grid. Finally, when the dependence is between a set of possible outcome values in a discrete conditional probability distribution, SDP leverages the structure of the space to improve the accuracy of the predictions. We demonstrate the utility of our new procedures via a series of benchmarks and three real-world case studies: fMRI analysis with FDRS, detecting radiological anomalies with SDS, and generative modeling of image data with SDP. All code for FDR smoothing, spatial density smoothing, and smoothed dyadic partitioning is publicly available.