Text-instance graph: Exploring the relational semantics for text-based visual question answering

作者:

Highlights:

摘要

It is time to stop neglecting the text around your world. In VQA, the surrounding text helps humans to understand complete visual scenes and reason question semantics efficiently. Here, we address the challenging Text-based Visual Question Answering (TextVQA) problem, which requires a model to answer the VQA questions with text reading ability. Existing TextVQA methods mainly focus on the latent relationships between detected object instances and scene texts with the given question, but ignore spatial location relationships and complex relational semantics between visual object instances and OCR texts (e.g. the A of B on C). To deal with these challenges, we propose a novel Text-Instance Graph (TIG) network for TextVQA. The TIG builds an OCR-OBJ graph for overlapping relationships modeling, where each node of graph is updated by utilizing relative objects or OCR texts. To deal with the question with complex logic, we propose a dynamic OCR-OBJ graph network to extend the perception space of graph nodes, which grasps the information of non-directly adjacent node features. Considering a scene about “the brand of the computer on the table”, the model would build correlations between “brand” and “table” using “the computer” node as the intermediate node. Extensive experiments on three benchmarks demonstrate the effectiveness and superiority of the proposed method. In addition, our TIG achieves 0.505 ANLS on ST-VQA challenge leaderboard and sets a new state-of-the-art.

论文关键词:Text-based visual question answering,Spatial overlapping,Text-Instance graph,Copy mechanism

论文评审过程:Received 1 February 2021, Revised 27 September 2021, Accepted 25 November 2021, Available online 27 November 2021, Version of Record 25 December 2021.

论文官网地址:https://doi.org/10.1016/j.patcog.2021.108455