SUM: Serialized Updating and Matching for text-based person retrieval

摘要

The central problem of text-based person retrieval is how to properly bridge the gap between heterogeneous cross-modal data. Many of the previous works contrive to learn a latent common space to bridge the modality gap and extract modality-invariant feature vectors. Within these methods, the common space mapping and cross-modal information matching operations are conducted in a one-off manner, which aims to extract sufficient discriminative clues from the high-dimensional multi-modal data at first glance, but it is inconsistent with the fact that humans usually follow a step-by-step process to properly recognize and match two objects. Intuitively, the large heterogeneity gap between multi-modal data can be better bridged by gradually analyzing the complex cross-modal relationships. In this paper, we propose a Serialized Updating and Matching (SUM) method for text-based person retrieval to bridge the heterogeneity gap between cross-modal data in a step-by-step manner. The core component of SUM is the proposed Memory Gating Modules (MGM), which can be stacked to gradually update and match features extracted from visual/textual modalities. To fully excavate the correlations lie within multi-granular cross-modal data, two variants are designed to care for both global and fine-grain local information, namely, Global Memory Gating Module (GMGM) and Fine-grained Memory Gating Module (FMGM) with which the updating rate of information at each step is dynamically determined after observing the feature in opposite modality. Moreover, SUM can be flexibly utilized as an add-on to any multi-granular text-based person retrieval methods to further improve the performance. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid along with two general cross-modal retrieval datasets Flickr8K and Flickr30K to see its generalization ability. Experimental results present that the proposed SUM outperforms existing methods and achieves the state-of-the-art performance.