A multimodal hierarchical approach to speech emotion recognition from audio and text

作者：

Highlights：

• A deep learning-based hierarchical approach to SER is developed and successfully tested.

• Developed approach is successfully tested on both unimodal and multimodal datasets.

• Audio and lexical features are used as the modalities.

• Application of ELMo, for extracting lexical features, is successfully demonstrated.

• The proposed hierarchical approach is found to be superior to its recent counterparts.

摘要

•A deep learning-based hierarchical approach to SER is developed and successfully tested.•Developed approach is successfully tested on both unimodal and multimodal datasets.•Audio and lexical features are used as the modalities.•Application of ELMo, for extracting lexical features, is successfully demonstrated.•The proposed hierarchical approach is found to be superior to its recent counterparts.

论文关键词：Speech emotion recognition,Hierarchical approach,Multimodal,Deep learning,Lexical features

论文评审过程：Received 27 December 2020, Revised 15 July 2021, Accepted 16 July 2021, Available online 22 July 2021, Version of Record 7 August 2021.

论文官网地址：https://doi.org/10.1016/j.knosys.2021.107316