Developing and Using Reference Datasets to Support Reproducible, Big Data Neuroscience
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
"Advancements in data collection in neuroimaging have ushered in an “Age of Big Data” in neuroscience(Fair et al., 2021; Poldrack & Gorgolewski, 2014; Webb-Vargas et al., 2017). With the growing size of neuroimaging datasets(Alexander et al., 2017; Casey et al., 2018; Sudlow et al., 2015), and the continued persistence of the Replicability Crisis in neuroscience(Tackett et al., 2019), data quality assurance becomes a challenge requiring new approaches for quality assurance at scale. The traditional methods for QA do not scale well. More specifically, the gold standard for QA requires a combination of visual inspection of each individual data derivative, and complex reports that require expertise and time (fMRIPrep, Freesurfer, QSIPrep, Fibr)(Cieslak et al., 2021; Dale et al., 1999; Esteban et al., 2019; Jenkinson et al., 2012; Richie-Halford et al., 2022). Some attempts have been made to approach this at scale(Richie-Halford et al., 2022), however few approaches exist to bridge the gap between community-based visual inspection and expertise-required technical reports. To address this gap, we propose a data-driven approach that uses the natural statistics and variability of large datasets and provides a reference whose variability in value can be compared against. To do this, we processed over 2,000 individual brains from 3 large-scale, open datasets using TACC supercomputers (i.e. PING(Jernigan et al., 2016), HCP(Van Essen et al., 2012), CAMCAN(Shafto et al., 2014)), across multiple imaging modalities and statistical brain properties. For each brain property and dataset, distributions were computed, statistical outliers were removed, and the cleaned distributions were released via brainlife.io(Avesani et al., 2019). The goal of this work is to provide the greater community with tools to perform efficient, automated, data-drive quality assurance, ultimately allowing for the scaling up and increasing of value of large scale datasets processed on supercomputers."