Outcome prediction and structure discovery in healthcare data
Growing use of electronic medical records, advances in data mining and machine learning, and the continually increasing cost of healthcare in the United States drive the necessity of algorithmic solutions with the potential to improve patient care and reduce healthcare costs. Such algorithms can enable the identification of the most relevant parameters for predicting adverse events, reveal underlying physiological mechanisms of diseases, and determine likelihood of complications that may lead to rehospitalization of discharged patients. Key limitations in computational tools currently used in healthcare or with the potential to greatly benefit the healthcare system can be overcome by methods that allow for soft constraints or promote smoothness. In this dissertation we develop three main algorithms incorporating softness or smoothness in the constraints or solution and demonstrate applications in diverse aspects of healthcare with the potential to greatly reduce healthcare costs. We first develop an outcome prediction algorithm that preserves the clinical knowledge from the development of additive risk scores with hard thresholds (of the form add p points if variable x is above/below threshold t). This novel method is not only easily optimizable for different patient sub-populations, but reveals clinically interpretable information such as the maximum contribution of a physiologic variable to the risk score and the range of values for which risk increases. We then turn to overcoming limitations in two clustering settings. In a semi-supervised setting, where pairwise constraints (relationships between pairs of points) are available, we develop an algorithm capable of performing accurate clustering under noisy constraints. This is achieved via soft constraints that impose a penalty on the objective when violated. Finally, we examine the scenario where clustering data are available at multiple points in time under the assumption of temporal smoothness, i.e., data points are more likely to remain in the same cluster than to change cluster membership between consecutive time steps. In this setting, we develop an evolutionary clustering algorithm that automatically infers the number of clusters at each time and matches the clusters across time steps while finding a global clustering solution. The proposed schemes outperform existing methods in benchmark and non-healthcare datasets as well as in the tasks of mortality prediction from clinical data and breast cancer metastasis prediction from gene expression data. As an additional healthcare application, we use our proposed evolutionary clustering algorithm to study the evolution of health plan clusters inferred from medication adherence data and provide a detailed analysis of the clusters.