Fine-grained visual comparisons
Beyond recognizing objects, a computer vision system ought to be able to compare them. A promising way to represent visual comparisons is through attributes, which are mid-level properties that appear across category boundaries. The ability to compare attributes opens up new opportunities in areas such as online shopping, object recognition, and human biometrics, where a relative decision is often more informative than its binary counterpart. For example, given two human faces, a decision that one face is smiling more than the other may be more informative -- and even more appropriate -- than a hard yes or no decision. In this thesis, I explore the task of fine-grained visual comparisons, or relative attributes. Given two images, we want to predict which exhibits a particular visual attribute more than the other. Specifically, I explore the scenario where the images exhibit subtle -- thus fine-grained -- visual differences. I propose improvements on two fronts, through the algorithms and the source data, to target these fine-grained comparison tasks that standard models fail to handle. On the algorithmic front, existing relative attribute methods rely exclusively on global ranking functions. However, rarely will the visual cues relevant to a comparison be constant for all data, nor will humans' perception of the attribute necessarily permit a global ordering. Furthermore, not every image pair is even orderable for a given attribute. Attempting to map relative attribute ranks to "equality" predictions is non-trivial, particularly since the span of indistinguishable pairs in attribute space may vary in different parts of the feature space. To address these issues, we introduce local learning approaches for fine-grained visual comparisons, where a predictive model is trained on-the-fly using only the data most relevant to the novel input. In particular, given a novel pair of images, we develop local learning methods to (1) infer their relative attribute ordering with a ranking function trained using only analogous labeled image pairs, (2) infer the optimal "neighborhood", i.e., the subset of the training instances most relevant for training a given local model, and (3) infer whether the pair is even distinguishable, based on a local model for just noticeable differences in attributes. On the source data front, we address the sparsity of supervision issue that affects all ranking algorithms for fine-grained tasks. Due to the pairwise nature of the supervision labels, the space of all possible comparisons is quadratic with respect to the total number of images. Even if we could hypothetically obtain complete supervision, we still cannot ensure sufficient diversity of fine-grained differences, at least not using only the provided real images. The problem is that we lack a direct way to curate photos demonstrating all sorts of subtle attribute changes. We propose to overcome this challenge using synthetic images that are conditionally generated based on the strength of a set of attributes. Building on a state-of-the-art image generation engine, we generate pairs of training images both passively and actively. In the passive case, we sample pairs of pre-generated training images exhibiting slight modifications of individual attributes. The proposed "semantic jitter" approach can be seen as a new form of data augmentation where training samples with subtly different attributes are automatically created. In the active case, we jointly learn the attribute ranking task while also learning to generate realistic image pairs that will benefit that task. We introduce an end-to-end framework that dynamically "imagines" image pairs that would confuse the current model, presents them to human annotators for labeling, then improves the predictive model with the new examples. Whether generated actively or passively, we augment real training image pairs with these generated pairs, and then train attribute ranking models to predict the relative strength of an attribute in novel pairs of real images. Our results demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Overall, our proposed approaches outperform state-of-the-art baselines for relative attribute prediction on challenging datasets, including UT-Zap50K, a large new shoe dataset curated specifically for fine-grained comparison tasks. We find that for fine-grained comparisons, performance is optimized when the algorithm works in conjunction with the data source. In this thesis, the optimal pipeline functions by first densifying the attribute space through generating the "right" data, and then applying fine-grained algorithms that leverage and learn from these "right" data.