A novel spatiotemporal attention enhanced discriminative network for video salient object detection

作者：Bing Liu, Kezhou Mu, Mingzhu Xu, Fangyuan Wang, Lei Feng

摘要

In contrast to image salient object detection, on which many achievements have been made, video salient object detection remains a considerable challenge. Not all features are useful in salient object detection, and some even cause interferences. In this paper, we propose a novel multiscale spatiotemporal ConvLSTM model based on an attention mechanism, which introduces space-based and channel-based attention mechanisms and improves the network’s capability to extract high-level semantic information and low-level spatial structural features. First, to obtain more effective spatiotemporal information, a ConvLSTM module embedded with an attention mechanism (CSAtt-ConvLSTM) is designed at higher layers of the network to weight salient features of the extracted spatiotemporal consistency. Second, a multiscale attention (MSA) module for distinguishing features is designed, which introduces two attention mechanisms: channel-wise attention (CA) units and spatial-wise attention (SA) units. The CA and SA units are used after high-level feature mapping obtained by the CSAtt-ConvLSTM module and shallow feature mapping, respectively, and then their outputs are fused as final output feature maps. A large number of experiments on multiple datasets verified the effectiveness of our proposed model, which reached a real-time speed on a single GPU of 20 fps.

论文关键词：Video salient object detection, Attention mechanism, Multiscale, CSAtt-ConvLSTM

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-021-02649-z