A multimodal attention fusion network with a dynamic vocabulary for TextVQA

作者:

Highlights:

• A novel encoder-decoder method for textVQA is proposed.

• The proposed method utilizes the multimodal features to improve model accuracy.

• Attention map loss is used to address the dynamic vocabulary problem.

• Achieved the first place on ICDAR ST-VQA 2019 challenge.

摘要

•A novel encoder-decoder method for textVQA is proposed.•The proposed method utilizes the multimodal features to improve model accuracy.•Attention map loss is used to address the dynamic vocabulary problem.•Achieved the first place on ICDAR ST-VQA 2019 challenge.

论文关键词:Dynamic vocabulary,Attention map,Multimodal fusion,ST-VQA

论文评审过程:Received 2 November 2020, Revised 10 July 2021, Accepted 27 July 2021, Available online 19 August 2021, Version of Record 24 August 2021.

论文官网地址:https://doi.org/10.1016/j.patcog.2021.108214