Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN.

Mahesh G. Huddar; Sanjeev S. Sannakki; Vijay S. Rajpurohit

doi:10.9781/ijimai.2020.07.004

Authors

Mahesh G. Huddar Visvesvaraya Technological University
Sanjeev S. Sannakki KLS Gogte Institute of Technology
Vijay S. Rajpurohit KLS Gogte Institute of Technology

DOI:

https://doi.org/10.9781/ijimai.2020.07.004

Keywords:

Attention Model, Interlocutor State, Context Awareness, Emotion recognition, Multimodal, Sentiment Analysis

Abstract

The availability of an enormous quantity of multimodal data and its widespread applications, automatic sentiment analysis and emotion classification in the conversation has become an interesting research topic among the research community. The interlocutor state, context state between the neighboring utterances and multimodal fusion play an important role in multimodal sentiment analysis and emotion detection in conversation. In this article, the recurrent neural network (RNN) based method is developed to capture the interlocutor state and contextual state between the utterances. The pair-wise attention mechanism is used to understand the relationship between the modalities and their importance before fusion. First, two-two combinations of modalities are fused at a time and finally, all the modalities are fused to form the trimodal representation feature vector. The experiments are conducted on three standard datasets such as IEMOCAP, CMU-MOSEI, and CMU-MOSI. The proposed model is evaluated using two metrics such as accuracy and F1-Score and the results demonstrate that the proposed model performs better than the standard baselines.

Downloads

Download data is not yet available.

References

S. Poria, E. Cambria and A. Gelbukh, "Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544, 2015.

S. Poria, E. Cambria, D. Hazarika, N. Mazumder, A. Zadeh and L.-P. Morency, "Context-dependent sentiment analysis in user-generated videos," in Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 873-883, 2017.

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "An Ensemble Approach to Utterance Level Multimodal Sentiment Analysis," in 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), Belgaum, India, pp. 145-150, 2018.

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification," International Journal of Multimedia Information Retrieval, vol. 9, no. 2, pp. 103-112, 2020, https://doi.org/10.1007/s13735-019-00185-8

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "Attention-based word-level contextual feature extraction and cross-modality fusion for sentiment analysis and emotion classification," International Journal of Intelligent Engineering Informatics, vol. 8, no. 1, pp. 1-18, 2020.

S. Chen and Q. Jin, "Multi-modal conditional attention fusion for dimensional emotion prediction," in Proceedings of the 24th ACM international conference on Multimedia, 2016.

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "Multi‐level feature optimization and multimodal contextual fusion for sentiment analysis and emotion classification," Computational Intelligence, vol. 36, no. 2, pp. 861-881, 2020.

S. Poria, N. Majumder, R. Mihalcea and E. Hovy, "Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances," arXiv preprint arXiv:1905.02947, 2019.

B. Kratzwald, S. Ilić, M. Kraus, S. Feuerriegel and H. Prendinger, "Deep learning for affective computing: Text-based emotion recognition in decision support," Decision Support Systems, vol. 115, pp. 24-35, 2018.

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "A Survey of Computational Approaches and Challenges in Multimodal Sentiment Analysis," International Journal of Computer Sciences and Engineering, vol. 7, no. 1, pp. 876-883, 2019.

V. P. Rosas, R. Mihalcea and L.-P. Morency, "Multimodal sentiment analysis of Spanish online," IEEE Intelligent Systems, vol. 28, no. 3, pp. 38-45, 2013.

V. Perez-Rosas, R. Mihalcea and L.-P. Morency, "Utterance-Level Multimodal Sentiment Analysis," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013.

J. G. Ellis, B. Jou and S.-F. Chang, "Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News," in Proceedings of the 16th International Conference on Multimodal Interaction, Istanbul, Turkey, 2014.

F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez, G. Riccardi and F. Pianesi, "The Workshop on Computational Personality Recognition 2014," in Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, 2014.

H. Kumar and B. Harish, "Automatic Irony Detection using Feature Fusion and Ensemble Classifier," International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, no. 7, pp. 70-79, 2019.

H. Kumar, B. Harish and H. Darshan, "Sentiment Analysis on IMDb Movie Reviews Using Hybrid Feature Extraction Method," International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, no. 5, pp. 109-114, 2019.

M. G. Huddar, S. S. Sannakki and V. S. Rajpurohit, "Multimodal Emotion Recognition using Facial Expressions, Body Gestures, Speech, and Text Modalities," International Journal of Engineering and Advanced Technology (IJEAT), vol. 8, no. 5, pp. 2453-2459, 2019.

V. Rozgić, S. Ananthakrishnan, S. Saleem, R. Kumar and R. Prasad, "Ensemble of SVM trees for multimodal emotion recognition," in Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Hollywood, CA, USA, 2013.

F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie and R. Cowie, "On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues," Journal on Multimodal User Interfaces, vol. 3, no. 1-2, p. 7–19, 2010.

S. Poria, I. Chaturvedi, E. Cambria and A. Hussain, "Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis," in IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 2016

D. Datcu and L. J. Rothkrantz, "Semantic audio-visual data fusion for automatic emotion recognition," Emotion recognition: a pattern analysis approach, pp. 411-435, 2014.

J. Chung, C. Gulcehre, K. Cho and Y. Bengio, "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling," arXiv:1412.3555, 2014.

N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh and E. Cambria, "DialogueRNN: An Attentive RNN for Emotion Detection in Conversations," in Proceedings of the AAAI Conference on Artificial Intelligence, 2019.

A. Zadeh, P. P. Liang, S. Poria, E. Cambria and L.-P. Morency, "Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.

D. Hazarika, S. Poria, R. Mihalcea, E. Cambria and R. Zimmermann, "ICON: interactive conversational memory network for multimodal emotion detection," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.

D. Hazarika, S. Poria, A. Zadeh, E. Cambria, Louis-Philippe Morency and R. Zimmermann, "Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018.

D. Ghosal, M. S. Akhtar, D. Chauhan, S. Poria, A. Ekbal and P. Bhattacharyya, "Contextual inter-modal attention for multi-modal sentiment analysis," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.

M. Wan, G. Yang, S. Gai and Z. Yang, "Two-dimensional discriminant locality preserving projections (2DDLPP) and its application to feature extraction via fuzzy set," Multimedia tools and applications, vol. 76, no. 1, pp. 355-371, 2017.

M. Wan, M. Li, G. Yang, S. Gai and Z. Jin, "Feature extraction using two-dimensional maximum embedding difference," Information Sciences, vol. 274, pp. 55-69, 2014.

M. Wan, Z. Lai, G. Yang, Z. Yang, F. Zhang and H. Zheng, "Local graph embedding based on maximum margin criterion via fuzzy set," Fuzzy Sets and Systems, vol. 318, pp. 120-131, 2017.

M. Magdin, T. Sulka, J. Tomanová and M. Vozár, "Voice Analysis Using PRAAT Software and Classification of User Emotional State," IJIMAI, vol. 5, no. 6, pp. 33-42, 2019.

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee and S. S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, 2008.

A. Zadeh, R. Zellers, E. Pincus and L.-P. Morency, "Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages," Journal IEEE Intelligent Systems, vol. 31, no. 6, pp. 82-88, 2016.

F. Eyben, M. Wöllmer and B. Schuller, "Recent developments in openSMILE, the munich open-source multimedia feature extractor," in Proceedings of the 21st ACM international conference on Multimedia, Barcelona, Spain, 2013.

A. Karpathy, G. Toderici, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-scale Video Classification with Convolutional Neural Networks," in Proceedings of International Computer Vision and Pattern Recognition, 2014.

T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, 2013.

Y. W. Teh and G. E. Hinton, "Rate-coded restricted Boltzmann machines for face recognition," in Proceedings of the 13th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 2000.

S. Ji, W. Xu, M. Yang and K. Yu, "3d convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221 - 231, 2013.

D. a. B. J. Kingma, "Adam: A Method for Stochastic Optimization," arXiv preprint arXiv:1412.6980, vol. 15, 2014.

S. Poria, I. Chaturvedi, E. Cambria and A. Hussain, "Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis," in 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, Spain, 2016.

A. Zadeh, M. Chen, S. Poria, E. Cambria and L.-P. Morency, "Tensor Fusion Network for Multimodal Sentiment Analysis," in Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017.

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria and L.-P. Morency, "Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 2018.