Pyramid regional graph representation learning for content-based video retrieval

作者:

Highlights:

摘要

Conventionally, it is common that video retrieval methods aggregate the visual feature representations from every frame as the feature of the video, where each frame is treated as an isolated, static image. Such methods lack the power of modeling the intra-frame and inter-frame relationships for the local regions, and are often vulnerable to the visual redundancy and noise caused by various types of video transformation and editing, such as adding image patches, adding banner, etc. From the perspective of video retrieval, a video’s key information is more often than not convoyed by geometrically centered, dynamic visual content, and static areas often reside in regions that are farther from the center and often exhibit heavy visual redundancies temporally. This phenomenon is hardly investigated by conventional retrieval methods.In this article, we propose an unsupervised video retrieval method that simultaneously models intra-frame and inter-frame contextual information for video representation with a graph topology that is constructed on top of pyramid regional feature maps. By decomposing a frame into a pyramid regional sub-graph, and transforming a video into a regional graph, we use graph convolutional networks to extract features that incorporate information from multiple types of context. Our method is unsupervised and only uses the frame features extracted by pre-trained network. We have conducted extensive experiments and have demonstrated that the proposed method outperforms state-of-the-art video retrieval methods.

论文关键词:00-01,99-00,Graph embedding,Video retrieval,Regional graph,Pyramid feature map

论文评审过程:Received 31 July 2020, Revised 24 December 2020, Accepted 26 December 2020, Available online 11 January 2021, Version of Record 11 January 2021.

论文官网地址:https://doi.org/10.1016/j.ipm.2020.102488