Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

作者：Shaoning Xiao, Yimeng Li, Yunan Ye, Long Chen, Shiliang Pu, Zhou Zhao, Jian Shao, Jun Xiao

摘要

This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.

论文关键词：Video question answering, Multi-grained representation, Temporal co-attention

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-019-10003-1