Multi-semantic long-range dependencies capturing for efficient video representation learning

作者：

Highlights：

•

摘要

Capturing long-range dependencies has proven effective on video understanding tasks. However, previous works address this problem in a pixel pairs manner, which might be inaccurate since pixel pairs contain too limited semantic information. Besides, considerable computations and parameters will be introduced in those methods. Following the pattern of features aggregation in Graph Convolutional Networks (GCNs), we aggregate pixels with their neighbors into semantic units, which contain stronger semantic information than pixel pairs. We designed an efficient, parameter-free, semantic units-based dependencies capturing framework, named as Multi-semantic Long-range Dependencies Capturing (MLDC) block. We verified our methods on large-scale challenging video classification benchmark, such as Kinetics. Experiments demonstrate that our method highly outperforms pixel pairs-based methods and achieves the state-of-the-art performance, without introducing any parameters and much computations.

论文关键词：Video representation learning,Long-range dependencies capturing,Video classification

论文评审过程：Received 23 June 2020, Accepted 19 July 2020, Available online 3 August 2020, Version of Record 26 August 2020.

论文官网地址：https://doi.org/10.1016/j.imavis.2020.103988