Linguistically-aware attention for reducing the semantic gap in vision-language tasks

作者：

Highlights：

• Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.

• Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.

• Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.

• Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.

摘要

•Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.•Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.•Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.•Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.

论文关键词：Attention models,Visual question answering,Counting in visual question answering,Image captioning

论文评审过程：Received 12 March 2020, Revised 14 December 2020, Accepted 26 December 2020, Available online 1 January 2021, Version of Record 8 January 2021.

论文官网地址：https://doi.org/10.1016/j.patcog.2020.107812