Coarse-to-fine dual-level attention for video-text cross modal retrieval

作者：

Highlights：

•

摘要

The effective representation of video features plays an important role in video vs. text cross-modal retrieval, and many researchers either use a single modal feature of the video or simply combine multi-modal features of the video. This makes the learned video features less robust. To enhance the robustness of video feature representation, we use coarse-fine-grained parallel attention model and feature fusion module to learn more effective video feature representation. Among them, coarse-grained attention learns the relationship between different feature blocks in the same modality feature and fine-grained attention applies attention to global features and strengthens the connection between points. Coarse-grained attention and Fine-grained attention complement each other. We integrate multi-head attention network into the model to expand the receptive field for features, and use the feature fusion module to further reduce the semantic gap between different video modalities. Our proposed model architecture not only strengthens the relationship between global features and local features, but also compensates the differences between different modality features in the video. Evaluation on three widely used datasets AcitivityNet-Captions, MSRVTT and LSMDC demonstrates its effectiveness.

论文关键词：Video vs. text cross-modal retrieval,Coarse-fine-grained parallel attention,Multi-head attention,Feature fusion

论文评审过程：Received 29 November 2021, Revised 7 January 2022, Accepted 29 January 2022, Available online 8 February 2022, Version of Record 15 February 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.108354