Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities

作者：

Highlights：

• Deep learning-based feature extractor networks for video and audio data are proposed.

• Model-level fusion of video and audio features for multimodal emotion recognition.

• Case studies for exploring the variability of emotional states in audio–visual media.

• Evaluation of the performance of the models for multimodal emotion recognition.

摘要

•Deep learning-based feature extractor networks for video and audio data are proposed.•Model-level fusion of video and audio features for multimodal emotion recognition.•Case studies for exploring the variability of emotional states in audio–visual media.•Evaluation of the performance of the models for multimodal emotion recognition.

论文关键词：Multimodal emotion recognition,Audio features,Video features,Classification,Deep learning

论文评审过程：Received 8 October 2021, Revised 31 January 2022, Accepted 9 March 2022, Available online 16 March 2022, Version of Record 23 March 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.108580