Research on Visual Question Answering Based on GAT Relational Reasoning

作者：Yalin Miao, Wenfang Cheng, Shuyun He, Hui Jiang

摘要

Due to the diversity of questions in VQA, it brings new challenges to the construction of VQA model. Existing VQA models focus on constructing a new attention mechanism, which makes the model increasingly complex. In addition, most of them concentrate on object recognition, ignore the research on spatial reasoning, semantic relations and even scene understanding. Therefore, a Graph Attention Network Relational Reasoning (GAT2R) model is proposed in this paper, which mainly includes scene graph generation and scene graph answer prediction. The scene map generation module mainly extracts the regional and spatial features of objects through the object detection model, and uses the relation decoder to predict the relations between object pairs. The scene graph answer prediction dynamically updates the node representation through the question-guided graph attention network, then performs multi-modal fusion with the question features, the answer is obtained finally. The experimental result shows that the accuracy of the proposed model is 54.45% on the natural scene dataset GQA, which is mainly based on relational reasoning. The experimental result is 68.04% on the widely used VQA2.0 dataset. Compared with the benchmark model, the accuracy of the proposed model is increased by 4.71% and 2.37% on GQA and VQA2.0 respectively, which proves the effectiveness and generalization of the model.

论文关键词：Visual question answering, Relational reasoning, Attention mechanism, Graph attention network, Multi-model feature fusion

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-021-10689-2