Video representation learning by identifying spatio-temporal transformations

作者：Sheng Geng, Shimin Zhao, Hu Liu

摘要

Self-supervised learning becomes a prevalent paradigm in both image and video domains due to the difficulty in obtaining a large amount of annotated data. In this paper, we adopt the self-supervised learning paradigm and propose to learn 3D video representations by identifying spatio-temporal transformations. Specifically, we choose a set of transformations and apply them to unlabelled videos to change the spatio-temporal structure of these videos. By identifying these spatio-temporal transformations, the network learns knowledge about both spatial appearance and temporal relation of video frames. In this paper, we choose the spatio-temporal rotations as the transformations. We conduct extensive experiments to validate the effectiveness of the proposed method. After fine-tuning on action recognition benchmarks, our model yields a remarkable gain of 29.6% on UCF101 and 25.1% on HMDB51 compared with models trained from scratch, which belongs to the current advanced method.

论文关键词：Self-supervised learning, 3D video representation, Unlabelled videos, Spatio-temporal transformations

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-021-02790-9