Interactive image search with attributes
MetadataShow full item record
An image retrieval system needs to be able to communicate with people using a common language, if it is to serve its user's information need. I propose techniques for interactive image search with the help of visual attributes, which are high-level semantic visual properties of objects (like "shiny" or "natural"), and are understandable by both people and machines. My thesis explores attributes as a novel form of user input for search. I show how to use attributes to provide relevance feedback for image search; how to optimally choose what to seek feedback on; how to ensure that the attribute models learned by a system align with the user's perception of these attributes; how to automatically discover the shades of meaning that users employ when applying an attribute term; and how attributes can help learn object category models. I use attributes to provide a channel on which the user of an image retrieval system can communicate her information need precisely and with as little effort as possible. One-shot retrieval is generally insufficient, so interactive retrieval systems seek feedback from the user on the currently retrieved results, and adapt their relevance ranking function accordingly. In traditional interactive search, users mark some images as "relevant" and others as "irrelevant", but this form of feedback is limited. I propose a novel mode of feedback where a user directly describes how high-level properties of retrieved images should be adjusted in order to more closely match her envisioned target images, using relative attribute feedback statements. For example, when conducting a query on a shopping website, the user might state: "I want shoes like these, but more formal." I demonstrate that relative attribute feedback is more powerful than traditional binary feedback. The images believed to be most relevant need not be most informative for reducing the system's uncertainty, so it might be beneficial to seek feedback on something other than the top-ranked images. I propose to guide the user through a coarse-to-fine search using a relative attribute image representation. At each iteration of feedback, the user provides a visual comparison between the attribute in her envisioned target and a "pivot" exemplar, where a pivot separates all database images into two balanced sets. The system actively determines along which of multiple such attributes the user's comparison should next be requested, based on the expected information gain that would result. The proposed attribute search trees allow us to limit the scan for candidate images on which to seek feedback to just one image per attribute, so it is efficient both for the system and the user. No matter what potentially powerful form of feedback the system offers the user, search efficiency will suffer if there is noise on the communication channel between the user and the system. Therefore, I also study ways to capture the user's true perception of the attribute vocabulary used in the search. In existing work, the underlying assumption is that an image has a single "true" label for each attribute that objective viewers could agree upon. However, multiple objective viewers frequently have slightly different internal models of a visual property. I pose user-specific attribute learning as an adaptation problem in which the system leverages any commonalities in perception to learn a generic prediction function. Then, it uses a small number of user-labeled examples to adapt that model into a user-specific prediction function. To further lighten the labeling load, I introduce two ways to extrapolate beyond the labels explicitly provided by a given user. While users differ in how they use the attribute vocabulary, there exist some commonalities and groupings of users around their attribute interpretations. Automatically discovering and exploiting these groupings can help the system learn more robust personalized models. I propose an approach to discover the latent factors behind how users label images with the presence or absence of a given attribute, from a sparse label matrix. I then show how to cluster users in this latent space to expose the underlying "shades of meaning" of the attribute, and subsequently learn personalized models for these user groups. Discovering the shades of meaning also serves to disambiguate attribute terms and expand a core attribute vocabulary with finer-grained attributes. Finally, I show how attributes can help learn object categories faster. I develop an active learning framework where the computer vision learning system actively solicits annotations from a pool of both object category labels and the objects' shared attributes, depending on which will most reduce total uncertainty for multi-class object predictions in the joint object-attribute model. Knowledge of an attribute's presence in an image can immediately influence many object models, since attributes are by definition shared across subsets of the object categories. The resulting object category models can be used when the user initiates a search via keywords such as "Show me images of cats" and then (optionally) refines that search with the attribute-based interactions I propose. My thesis exploits properties of visual attributes that allow search to be both effective and efficient, in terms of both user time and computation time. Further, I show how the search experience for each individual user can be improved, by modeling how she uses attributes to communicate with the retrieval system. I focus on the modes in which an image retrieval system communicates with its users by integrating the computer vision perspective and the information retrieval perspective to image search, so the techniques I propose are a promising step in closing the semantic gap.