Discovering Spatially Coherent Gene Modules from Spatial Transcriptomics Data

Access full-text files




Larina, Maria
Singh, Salvi
Samee, Md. Abul Hassan

Journal Title

Journal ISSN

Volume Title



Spatial transcriptomics (ST) is an emerging technology that quantifies gene expression at spatial resolution from intact tissue sections. Although ST is enabling unprecedented studies on spatial gene expression, it has posed new challenges to biological data science. A typical ST dataset contains information of ~20K genes from 50K-100K cells. It is challenging to design efficient and scalable algorithms that generate new biological insights from these datasets. Here we feature an efficient and scalable non-negative matrix factorization (NMF) algorithm for identifying “spatial gene modules” (spatial-gems), i.e., groups of genes that express at spatially adjacent locations, in ST data. Spatial-gems are fundamental aspects of multi-cellular organisms. NMF is suitable for this problem since, in theory, NMF can identify the “informative parts” constituting a dataset, e.g., lips and eyes in human facial images and spatial-gems in ST data. The basic NMF formulation, however, can give sub-optimal results for spatial datasets – it ignores spatial locations of data points and thus does not guarantee informative parts that are spatially coherent. Graph-regularized NMF (GNMF) overcomes this issue by constraining the informative parts to comprise spatially adjacent data points. We introduce three changes to tailor the state-of-the-art GNMF algorithm for ST data. First, we statistically determine the optimal number of spatial-gems in an ST dataset. Secondly, we introduce regularizations that minimize the number of genes common between spatial-gems. Finally, we leverage numerical libraries and efficient data structures to obtain a scalable implementation. We benchmarked our GNMF against alternative algorithms on a brain ST dataset. Our algorithm comprehensively charted the spatial-gems in this dataset with a 20x speedup in execution time, making this an attractive tool for large-scale ST consortia like HuBMAP (Human BioMolecular Atlas Program). This tool and our multifaceted approach to enhance efficiency and scalability will be of major interest to the broad userbase of TACC.


LCSH Subject Headings