Learning motion representation for real-time spatio-temporal action localization

Highlights：

• The interaction between the action detector and flow subnet enables the detector to learn parameters from appearance and motion simultaneously, and guiding flow subnet to compute task-specific optical flow.

• Exploiting an effective fusion method to fuse appearance and optical flow deep features in a multi-scale fashion. The multi-scale temporal and spatial features are combined interactively to model a more discriminative spatio-temporal action representation.

• The presented method achieves real-time computation at the first time with the usage of both RGB appearance and optical flow. It outperforms the state-of-the-art method [1] by 1.3% in accuracy.

摘要

•Proposing a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner.•The interaction between the action detector and flow subnet enables the detector to learn parameters from appearance and motion simultaneously, and guiding flow subnet to compute task-specific optical flow.•Exploiting an effective fusion method to fuse appearance and optical flow deep features in a multi-scale fashion. The multi-scale temporal and spatial features are combined interactively to model a more discriminative spatio-temporal action representation.•The presented method achieves real-time computation at the first time with the usage of both RGB appearance and optical flow. It outperforms the state-of-the-art method [1] by 1.3% in accuracy.