Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching

Highlights：

• This paper proposes a novel stacked Squeeze-and-Excitation Recurrent Residual Network (SER2-Net) for visual-semantic matching.

• This paper develops an effective and efficient cross-modal representation learning module, which is capable of generating semantically complementary multi-level features for both modalities.

• This paper presents a novel objective function for aligning cross-modal data, which is able to capture the interdependency among multiple semantic levels to alleviate the distribution inconsistency between visual and textual modality.

• Extensive experiments on two benchmark datasets demonstrate the superiority of our proposed model compared to the state-of-the-art approaches.

摘要

•This paper proposes a novel stacked Squeeze-and-Excitation Recurrent Residual Network (SER2-Net) for visual-semantic matching.•This paper develops an effective and efficient cross-modal representation learning module, which is capable of generating semantically complementary multi-level features for both modalities.•This paper presents a novel objective function for aligning cross-modal data, which is able to capture the interdependency among multiple semantic levels to alleviate the distribution inconsistency between visual and textual modality.•Extensive experiments on two benchmark datasets demonstrate the superiority of our proposed model compared to the state-of-the-art approaches.

论文评审过程：Received 3 December 2019, Revised 23 March 2020, Accepted 29 March 2020, Available online 22 April 2020, Version of Record 5 June 2020.