Video anomaly detection with spatio-temporal dissociation

Highlights：

• We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature spaces to detect anomaly in videos.

• We design an efficient motion autoencoder, which takes consecutive video frames as input and RGB difference as output to imitate the movement of optical flow. The proposed method is much faster than the optical flow-based motion representation learning approach, where its average running time is 32fps.

• We exploit a variance attention module to automatically assign an importance weight to the moving part of video clips, which is useful to improve the performance of the motion autoencoder.

• To learn the normality in both the spatial and motion feature spaces, we concatenate these representations extracted from the two streams at the same spatial location, and optimize the two streams and the deep K-means cluster jointly with the early fusion strategy.

• We fuse the spatio-temporal information with their distance from the deep K-means cluster in the pixel level to calculate the anomaly score. Compared with our prior frame level fusion scheme, experimental results show that the performance of the new architecture is improved.

摘要

•We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature spaces to detect anomaly in videos.•We design an efficient motion autoencoder, which takes consecutive video frames as input and RGB difference as output to imitate the movement of optical flow. The proposed method is much faster than the optical flow-based motion representation learning approach, where its average running time is 32fps.•We exploit a variance attention module to automatically assign an importance weight to the moving part of video clips, which is useful to improve the performance of the motion autoencoder.•To learn the normality in both the spatial and motion feature spaces, we concatenate these representations extracted from the two streams at the same spatial location, and optimize the two streams and the deep K-means cluster jointly with the early fusion strategy.•We fuse the spatio-temporal information with their distance from the deep K-means cluster in the pixel level to calculate the anomaly score. Compared with our prior frame level fusion scheme, experimental results show that the performance of the new architecture is improved.