Reliable predictions for structured and corrupted data

Access full-text files



Journal Title

Journal ISSN

Volume Title



The burgeoning use of machine learning models has spurred the use of diverse datasets that are collated, processed, and analyzed in various manners. To facilitate storage and analysis, data is often stored in a structured format. Structured data is characterized by an organizational structure or specific constraints on certain features. Such implicit or explicit constraints impose extra considerations for building predictive models based on such data, and current methodologies grapple with capturing the complex relationships inherent in the data. Factors like measurement errors, faulty equipment, or adversarial attacks can also result in the corruption of training data, making it challenging to achieve high performance. This thesis presents several approaches that can provide reliable predictions from data in a variety of complex formats while simultaneously ensuring a model’s reliability. First, a probabilistic quantile forecasting framework is introduced to tackle the challenges associated with forecasting large-scale time series that are subject to hierarchical or grouped constraints. This framework reconciles time series across various aggregation levels, taking into account any imposed constraints. It also dynamically amalgamates heterogeneous forecasting models specifically customized for different time series. Additionally, a multilevel clustering approach is proposed to mitigate computational costs associated with a vast number of forecasts. The next set of contributions lies in novel interpretable and robust Machine Learning approaches to ensure that trustworthy inferences are drawn from corrupted data. This includes counterfactual explanations and strategies to guard against outliers and adversarial examples, offering assurances of the monotonic property of neural networks, and devising robust estimations for datasets with missing values. Finally, a conformal prediction method with conditional coverage guarantees in the asymptotic limit is introduced to furnish adaptive and informative prediction intervals for heterogeneous data free of distributional assumptions. Collectively, these contributions bolster our ability to provide reliable predictions for data with complex structures or quality issues. Moreover, they hold vast potential for applications in various sectors, including healthcare and finance.


LCSH Subject Headings