Densifying Supervision for Fine-Grained Visual Comparisons

摘要

Detecting subtle differences in visual attributes requires inferring which of two images exhibits a property more, e.g., which face is smiling slightly more, or which shoe is slightly more sporty. While valuable for applications ranging from biometrics to online shopping, fine-grained attributes are challenging to learn. Unlike traditional recognition tasks, the supervision is inherently comparative. Thus, the space of all possible training comparisons is vast, and learning algorithms face a sparsity of supervision problem: it is difficult to curate adequate subtly different image pairs for each attribute of interest. We propose to overcome this problem by densifying the space of training images with attribute-conditioned image generation. The main idea is to create synthetic but realistic training images exhibiting slight modifications of attribute(s), obtain their comparative labels from human annotators, and use the labeled image pairs to augment real image pairs when training ranking functions for the attributes. We introduce two variants of our idea. The first passively synthesizes training images by “jittering” individual attributes in real training images. Building on this idea, our second model actively synthesizes training image pairs that would confuse the current attribute model, training both the attribute ranking functions and a generation controller simultaneously in an adversarial manner. For both models, we employ a conditional Variational Autoencoder (CVAE) to perform image synthesis. We demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Our approach yields state-of-the-art performance for challenging datasets from two distinct domains.