Cycle representation-disentangling network: learning to completely disentangle spatial-temporal features in video

作者：Pengfei Sun, Xin Su, Shangqi Guo, Feng Chen

摘要

Video representation learning is a significant problem of video understanding. However, the complex entangled spatiotemporal information in frames makes video representation learning a very tough task. Many studies have been conducted on decomposing video representation into dynamic and static features; existing works use either a two-stream structure or a compression strategy to learn the static and motion features from videos. However, neither approach can guarantee that networks learn features with low coupling degrees. To address this problem, we propose the exchangeable property (EP), a constraint that encourages networks to learn disentangled features from videos. Based on the EP, we propose a novel network called the Cycle Representation-Disentangling Network (CRD-Net), which adopts the strategy of exchanging features and reconstructing videos to factorize videos into stationary and temporal varying components. CRD-Net adopts a new training paradigm, as it is trained on paired videos with different static features but similar dynamic features. In addition, we introduce the pair loss and cycle loss, which forces the motion encoder to abandon time-invariant features and the consistent loss, which forces the static encoder to abandon time-variant features in the whole video. In experiments, we show the advantages of CRD-Net in completely disentangling video features and obtain better results than the state of the art on several video understanding tasks.

论文关键词：Representation-disentangling, Representation-decoupling, Video representation learning, Motion and static features

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-020-01750-z