Multimodal graph inference network for scene graph generation

作者：Jingwen Duan, Weidong Min, Deyu Lin, Jianfeng Xu, Xin Xiong

摘要

A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.

论文关键词：Scene graph generation, Visual relationship detection, Image understanding, Semantic analysis

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-021-02304-7