Image captioning for effective use of language models in knowledge-based visual question answering

作者：

Highlights：

• Captions are more effective than images for OK-VQA, a knowledge intensive VQA task.

• Increasing the capacity of language models allows to reach state-of-the-art results.

• Our best model obtains results comparable to five GPT-3 runs which are 15x larger.

• Our system is effective when external knowledge is needed.

摘要

•Captions are more effective than images for OK-VQA, a knowledge intensive VQA task.•Increasing the capacity of language models allows to reach state-of-the-art results.•Our best model obtains results comparable to five GPT-3 runs which are 15x larger.•Our system is effective when external knowledge is needed.

论文关键词：Visual question answering,Image captioning,Language models,Deep learning

论文评审过程：Received 1 April 2022, Revised 12 July 2022, Accepted 21 August 2022, Available online 28 August 2022, Version of Record 12 September 2022.

论文官网地址：https://doi.org/10.1016/j.eswa.2022.118669