Learning with positive and unlabeled examples
MetadataShow full item record
Developing partially supervised models is becoming increasingly relevant in the context of modern machine learning applications, where supervision often comes at a cost. In particular, there are several application domains where the available training data consists only of positive and unlabeled examples (no negative examples). One motivating application in computational biology is that of predicting genes linked to human genetic disorders, where we do *not* have access to ``negative'' gene-disease associations but only a few positive associations. Existing methods for supervised learning (i.e. when the learner has access to both positive and negative examples) do not always work when the training data has examples from only one class. In this thesis, we study various machine learning problems with positive-unlabeled (PU) supervision and develop methods for the corresponding *PU learning* problems. We show that by reducing PU learning to learning with ``one-sided label noise'', one can obtain a family of methods applicable to diverse problems including binary classification, multi-label learning, matrix completion and multiple-instance learning. The benefits of such a reduction are twofold: (1) We can essentially use the algorithms for supervised learning, albeit with appropriate modifications to account for partial supervision; (2) The resulting problem formulations are amenable to analysis, leading to strong theoretical guarantees for the performance of the proposed methods in PU learning tasks. Finally, we consider performance measures widely used in PU learning applications beyond the traditional measures such as classification accuracy, and extend some of the guarantees to general performance measures.