Linguistically-aware attention for reducing the semantic gap in vision-language tasks

作者:

Highlights:

• Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.

• Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.

• Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.

• Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.

摘要

•Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.•Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.•Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.•Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.

论文关键词:Attention models,Visual question answering,Counting in visual question answering,Image captioning

论文评审过程:Received 12 March 2020, Revised 14 December 2020, Accepted 26 December 2020, Available online 1 January 2021, Version of Record 8 January 2021.

论文官网地址:https://doi.org/10.1016/j.patcog.2020.107812