Distributed and dynamic factor modeling of online data
MetadataShow full item record
The domain of data mining and machine learning has expanded rapidly in recent years to include both large-scale distributed and streaming computation. Although many open-source and cloud-based frameworks are available for these tasks, many of which are used in-production by industry, this is a rapidly-evolving technology landscape, and the gap between the academic role of algorithm development and discovery and code available for use with real-world data has grown. In addition, although there is a rich history of mathematical models for streaming data on continuous vector spaces, there has been significantly less work on streaming discrete spaces. However, much if not most of the data available online is composed of high-dimensional sparse counts, such as text corpora and interaction networks. We attempt to help bridge this gap by extending promising Bayesian Poisson factorization and co-factorization models that can be used, for example, to model not only text corpora but also related user interactions in a social network. We construct a dependent process prior that enables dynamic latent factor modeling in the natural probability space of the factors, rather than in the raw data. These models are then scaled to and implemented for distributed compute systems and streaming data. We develop an adaptive hashing method (AdaHash) for lambda architectures that can use latent factors calculated during periodic batch mode updates as a similarity metric for hierarchical grouping, or for finding similar factors to reconcile parameters in a distributed compute scenario. In addition, we develop a novel Hidden Markov variant using particle filters to update prior factors and probabilistically group with new factors in a dynamic inference model (D-GaPS). We show experimentally that the distributed model converges to similar factors as single-process inference, and the dynamic model yields superior quality topics over batch mode alternatives. Empirical studies are presented on the use of a U.S. Senate voting and bill summary data set that is readily interpretable with regard to latent factors.