Spatio-temporal deformable 3D ConvNets with attention for action recognition

Highlights：

• We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).

• The proposed module serves as a generic module for many 3D CNNs, and in practice it is only needed to append at the later convolution layer without increasing too much computational cost.

• Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, and thus helps extract the discriminative global information at both inter-frame level and intra-frame level.

• Experiments validate the superior performances and efficiency of the proposed approach.

摘要

•We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).•The proposed module serves as a generic module for many 3D CNNs, and in practice it is only needed to append at the later convolution layer without increasing too much computational cost.•Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, and thus helps extract the discriminative global information at both inter-frame level and intra-frame level.•Experiments validate the superior performances and efficiency of the proposed approach.

论文评审过程：Received 19 April 2019, Revised 8 August 2019, Accepted 3 September 2019, Available online 7 September 2019, Version of Record 13 September 2019.