Learning and validating clinically meaningful phenotypes from electronic health data
MetadataShow full item record
The ever-growing adoption of electronic health records (EHR) to record patients' health journeys has resulted in vast amounts of heterogeneous, complex, and unwieldy information [Hripcsak and Albers, 2013]. Distilling this raw data into clinical insights presents great opportunities and challenges for the research and medical communities. One approach to this distillation is called computational phenotyping. Computational phenotyping is the process of extracting clinically relevant and interesting characteristics from a set of clinical documentation, such as that which is recorded in electronic health records (EHRs). Clinicians can use computational phenotyping, which can be viewed as a form of dimensionality reduction where a set of phenotypes form a latent space, to reason about populations, identify patients for randomized case-control studies, and extrapolate patient disease trajectories. In recent years, high-throughput computational approaches have made strides in extracting potentially clinically interesting phenotypes from data contained in EHR systems. Tensor factorization methods have shown particular promise in deriving phenotypes. However, phenotyping methods via tensor factorization have the following weaknesses: 1) the extracted phenotypes can lack diversity, which makes them more difficult for clinicians to reason about and utilize in practice, 2) many of the tensor factorization methods are unsupervised and do not utilize side information that may be available about the population or about the relationships between the clinical characteristics in the data (e.g., diagnoses and medications), and 3) validating the clinical relevance of the extracted phenotypes requires domain training and expertise. This dissertation addresses all three of these limitations. First, we present tensor factorization methods that discover sparse and concise phenotypes in unsupervised, supervised, and semi-supervised settings. Second, via two tools we built, we show how to leverage domain expertise in the form of publicly available medical articles to evaluate the clinical validity of the discovered phenotypes. Third, we combine tensor factorization and the phenotype validation tools to guide the discovery process to more clinically relevant phenotypes.