Learning facial action units with spatiotemporal cues and multi-label sampling

作者:

Highlights:

摘要

Facial action units (AUs) can be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during network training, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

论文关键词:Multi-label learning,Deep learning,Spatio-temporal learning,Multi-label sampling,Facial action unit detection,Video analysis

论文评审过程:Received 16 October 2017, Revised 17 May 2018, Accepted 22 October 2018, Available online 28 October 2018, Version of Record 27 November 2018.

论文官网地址:https://doi.org/10.1016/j.imavis.2018.10.002