Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks

摘要

Huge amounts of multimodal content and comments in a mixture form of text, image, and emoji are continuously shared by users on various social networks. Most of the comments of the users in these networks have emotional aspects, which make the multimodal sentiment analysis (MSA) an important and attractive research topics in this area. In this paper, an ensemble transfer learning method is exploited to propose a hybrid MSA model based on weighted convolutional neural networks. The extended Dempster–Shafer (Yager) theory is also utilized in the proposed method of this paper to fuse the outputs of text and image classifiers to determine the final polarity at the decision level. The pre-trained VGG16 network is firstly used to extract visual features and fine-tune on the MVSA-Multiple and T4SA datasets for image sentiment classification. The Mask-RCNN model is then exploited to determine the objects in the images and convert them to text. The BERT model receives the output of this step along with the textual descriptions of the images for extracting the text features and embedding the words. The output of the BERT model is then imported into a weighted convolutional neural network ensemble (WCNNE). The texts are classified by several weak learners using the AdaBoost that is an ensemble learning technique in which, classifiers are trained sequentially. The combined use of several weak classifiers results in a strong classification. The WCNNE improves the performance and increases the accuracy of the results. As a fusing phase at the decision level, the outputs of the VGG16 and the WCNNE models will be finally merged using the extended Dempster-Shafer theory to obtain the correct sentiment label. The results of the experiments on the MVSA-Multiple and T4SA datasets show that the proposed model is better than the other compared methods and achieved an appropriate accuracy of 0.9348 on MVSA and 0.9689 on the T4SA datasets. Moreover, the proposed model reduces training time due to the use of transfer learning and the proposed AdaBoostCNN achieves better results compared to the single CNN.