Data reduction methods for human decision making and learning

Mourad, Sara J.

Data reduction methods for human decision making and learning

Access full-text files

MOURAD-DISSERTATION-2019.pdf (2.88 MB)

Date

2019-07-16

Authors

Mourad, Sara J.

Abstract

The rapidly increasing size of data is becoming a major challenge for both humans and machines to process. While more data means more information and less uncertainty, and consequently better performance, more data also means more processing time and more storage. This motivated my thesis which is centered around finding ways to cut the size of data shown to humans, as well as data fed into machine learning algorithms, without compromising the performance. First, in the context of human decision making, we aim at reducing and reordering the data to show to a human subject to enhance their decision performance. We propose a statistical model for human decision making that incorporates cognitive biases. We then propose an algorithm that allows, in polynomial time, to construct an ordered subset of the data so that the human performance approximately matches the optimal performance. Second, we propose an algorithm for selecting a subset of the training data to train the SVM on. The algorithm optimizes a submodular set function, that represents the diversity and the relevance of the subset considered, while providing some performance guarantees. We then propose an algorithm for selecting a weighted subset of the training data to train the SVM on. The weighted subset construction is based on constructing the maximal independent set of the graph induced by the approximate nearest neighborhood properties of the dataset. Third, we propose two algorithms for online selective training for neural networks. The first method is based on picking batches that maximize the reduction in entropy of the estimator. The second method consists of constructing the batches such that all the datapoints included have predicted probabilities under some threshold. Our approaches allow to keep the epoch based framework of training neural networks, and to make the decisions based on up to date values.