Creating diverse ensemble classifiers to reduce supervision
MetadataShow full item record
Ensemble methods like Bagging and Boosting which combine the decisions of multiple hypotheses are some of the strongest existing machine learning methods. The diversity of the members of an ensemble is known to be an important factor in determining its generalization error. In this thesis, we present a new method for generating ensembles, DECORATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples), that directly constructs diverse hypotheses using additional artificially-generated training examples. The technique is a simple, general meta-learner that can use any strong learner as a base classifier to build diverse committees. The diverse ensembles produced by DECORATE are very effective for reducing the amount of supervision required for building accurate models. The first task we demonstrate this on is classification given a fixed trainvii ing set. Experimental results using decision-tree induction as a base learner demonstrate that our approach consistently achieves higher predictive accuracy than the base classifier, Bagging and Random Forests. Also, DECORATE attains higher accuracy than Boosting on small training sets, and achieves comparable performance on larger training sets. Additional experiments demonstrate DECORATE’s resilience to imperfections in data, in the form of missing features, classification noise, and feature noise. DECORATE ensembles can also be used to reduce supervision through active learning, in which the learner selects the most informative examples from a pool of unlabeled examples, such that acquiring their labels will increase the accuracy of the classifier. Query by Committee is one effective approach to active learning in which disagreement within the ensemble of hypotheses is used to select examples for labeling. Query by Bagging and Query by Boosting are two practical implementations of this approach that use Bagging and Boosting respectively, to build the committees. For efficient active learning it is critical that the committee be made up of consistent hypotheses that are very different from each other. Since DECORATE explicitly builds such committees, it is well-suited for this task. We introduce a new algorithm, ACTIVEDECORATE, which uses DECORATE committees to select good training examples. Experimental results demonstrate that ACTIVEDECORATE typically requires labeling fewer examples to achieve the same accuracy as Query by Bagging and Query by Boosting. Apart from optimizing classification accuracy, in many applications, producing good class probability estimates is also important, e.g., in fraud detection, which has unequal misclassification costs. This thesis introduces a novel approach to active learning based on ACTIVEDECORATE which uses Jensen-Shannon divergence (a similarity measure for probability distributions) to improve the selection of training examples for optimizing probability estimation. Comprehensive experimental results demonstrate the benefits of our approach. Unlike the active learning setting, in many learning problems the class labels for all instances are known, but feature values may be missing and can be acquired at a cost. For building accurate predictive models, acquiring complete information for all instances is often quite expensive, while acquiring information for a random subset of instances may not be optimal. We formalize the task of active feature-value acquisition, which tries to reduce the cost of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. We present an approach, based on DECORATE, in which instances are selected for acquisition based on the current model’s accuracy and its confidence in the prediction. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions than random sampling.