Image captioning for effective use of language models in knowledge-based visual question answering
作者:
Highlights:
• Captions are more effective than images for OK-VQA, a knowledge intensive VQA task.
• Increasing the capacity of language models allows to reach state-of-the-art results.
• Our best model obtains results comparable to five GPT-3 runs which are 15x larger.
• Our system is effective when external knowledge is needed.
摘要
•Captions are more effective than images for OK-VQA, a knowledge intensive VQA task.•Increasing the capacity of language models allows to reach state-of-the-art results.•Our best model obtains results comparable to five GPT-3 runs which are 15x larger.•Our system is effective when external knowledge is needed.
论文关键词:Visual question answering,Image captioning,Language models,Deep learning
论文评审过程:Received 1 April 2022, Revised 12 July 2022, Accepted 21 August 2022, Available online 28 August 2022, Version of Record 12 September 2022.
论文官网地址:https://doi.org/10.1016/j.eswa.2022.118669