Efficient dual attention SlowFast networks for video action recognition

作者:

Highlights:

摘要

Video data mainly differ in temporal dimension compared with static image data. Various video action recognition networks choose two-stream models to learn spatial and temporal information separately and fuse them to further improve performance. We proposed a cross-modality dual attention fusion module named CMDA to explicitly exchange spatial–temporal information between two pathways in two-stream SlowFast networks. Besides, considering the computational complexity of these heavy models and the low accuracy of existing lightweight models, we proposed several two-stream efficient SlowFast networks based on well-designed efficient 2D networks, such as GhostNet, ShuffleNetV2 and so on. Experiments demonstrate that our proposed fusion model CMDA improves the performance of SlowFast, and our efficient two-stream models achieve a consistent increase in accuracy with a little overhead in FLOPs. Our code and pre-trained models will be made available at https://github.com/weidafeng/Efficient-SlowFast.

论文关键词:

论文评审过程:Received 20 November 2020, Revised 5 April 2022, Accepted 14 June 2022, Available online 21 June 2022, Version of Record 30 June 2022.

论文官网地址:https://doi.org/10.1016/j.cviu.2022.103484