skip to main content
research-article

A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis

Published: 13 May 2020 Publication History

Abstract

Multi-modal affect analysis (e.g., sentiment and emotion analysis) is an interdisciplinary study and has been an emerging and prominent field in Natural Language Processing and Computer Vision. The effective fusion of multiple modalities (e.g., text, acoustic, or visual frames) is a non-trivial task, as these modalities, often, carry distinct and diverse information, and do not contribute equally. The issue further escalates when these data contain noise. In this article, we study the concept of multi-task learning for multi-modal affect analysis and explore a contextual inter-modal attention framework that aims to leverage the association among the neighboring utterances and their multi-modal information. In general, sentiments and emotions have inter-dependence on each other (e.g., anger → negative or happy → positive). In our current work, we exploit the relatedness among the participating tasks in the multi-task framework. We define three different multi-task setups, each having two tasks, i.e., sentiment 8 emotion classification, sentiment classification 8 sentiment intensity prediction, and emotion classificati on 8 emotion intensity prediction. Our evaluation of the proposed system on the CMU-Multi-modal Opinion Sentiment and Emotion Intensity benchmark dataset suggests that, in comparison with the single-task learning framework, our multi-task framework yields better performance for the inter-related participating tasks. Further, comparative studies show that our proposed approach attains state-of-the-art performance for most of the cases.

References

[1]
Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN. 370--379.
[2]
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound classification using image-based deep spectrum features. In Proceedings of the 2017 Interspeech, Stockholm, Sweden. 3512--3516.
[3]
Georgios Balikas, Simon Moura, and Massih-Reza Amini. 2017. Multitask learning for fine-grained twitter sentiment analysis. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan. 1005--1008.
[4]
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY. 1--10.
[5]
Rory Beard, Ritwik Das, Raymond W. M. Ng, P. G. Keerthana Gopalakrishnan, Luka Eerens, Pawel Swietojanski, and Ondrej Miksik. 2018. Multi-modal sequence fusion via recursive attention for emotion recognition. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium. 251--259.
[6]
Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, and Walter Scheirer. 2018. Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia. 1--10.
[7]
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (Oct 2001), 5--32.
[8]
Erik Cambria, Soujanya Poria, and Amir Hussain. 2019. Speaker-independent multimodal sentiment analysis for big data. In Multimodal Analytics for Next-Generation Big Data Technologies and Applications. Springer, 13--43.
[9]
Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, and Dhruv Batra. 2019. Embodied multimodal multitask learning. CoRR abs/1902.01385 (2019). arxiv:1902.01385
[10]
Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 5651--5661.
[11]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK. 163--171.
[12]
S. Chowdhuri, T. Pankaj, and K. Zipser. 2019. MultiNet: Multi-Modal multi-task learning for autonomous driving. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV'19). 1496--1504.
[13]
Nicholas Cummins, Shahin Amiriparian, Sandra Ottl, Maurice Gerczuk, Maximilian Schmitt, and Björn Schuller. 2018. Multimodal bag-of-words for cross domains sentiment analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, Canada. 1--5.
[14]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy. 960--964.
[15]
Didan Deng, Yuqian Zhou, Jimin Pi, and Bertram E. Shi. 2018. Multimodal utterance-level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625 (2018).
[16]
Jan Milan Deriu and Mark Cieliebak. 2016. Sentiment analysis using convolutional neural networks with multi-task training and distant supervision on Italian tweets. In Proceedings of the 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, December 5--7, 2016, Napoli, Italy.
[17]
Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1832--1846.
[18]
Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion. Wiley Online Library, 45--60.
[19]
Jiamin Fu, Qirong Mao, Juanjuan Tu, and Yongzhao Zhan. 2017. Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimedia Systems 25, 5 (2017), 451--461.
[20]
Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2018. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. 3454--3466.
[21]
Devamanyu Hazarika, Sruthi Gorantla, Soujanya Poria, and Roger Zimmermann. 2018. Self-attentive feature-level fusion for multimodal emotion detection. In Proceedings of the 2018 IEEE Conference on Multimedia Information Processing and Retrieval, Miami, FL. 196--201.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV. 770--778.
[23]
Kun-Yi Huang, Chung-Hsien Wu, Qian-Bei Hong, Ming-Hsiang Su, and Yi-Hsuan Chen. 2019. Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5866--5870.
[24]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.
[25]
Chan Woo Lee, Kyu Ye Song, Jihoon Jeong, and Woo Yong Choi. 2018. Convolutional attention networks for multimodal emotion recognition from speech and text data. arXiv preprint arXiv:1805.06606 (2018).
[26]
Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain. 169--176.
[27]
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. 2016. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan. 284--288.
[28]
Mathieu Pagé Fortin and Brahim Chaib-draa. 2019. Multimodal multitask emotion recognition using images, texts and tags. In Proceedings of the ACM Workshop on Crossmodal Learning and Application. ACM, 3--10.
[29]
Bo Pang, and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan. 115--124.
[30]
Amol S. Patwardhan. 2017. Multimodal mixed emotion detection. In Proceedings of the 2nd International Conference on Communication and Electronics Systems, Coimbatore, India. 139--143.
[31]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. 1532--1543.
[32]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98--125.
[33]
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2016. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems 108 (2016), 42--49.
[34]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, Vancouver, Canada. 873--883.
[35]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, LA. 1033--1038.
[36]
Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of 2016 IEEE 16th International Conference on Data Mining. Barcelona, Spain, 439--448.
[37]
Soujanya Poria, Haiyun Peng, Amir Hussain, Newton Howard, and Erik Cambria. 2017. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 261 (2017), 217--230.
[38]
Farhad Rahdari, Esmat Rashedi, and Mahdi Eftekhari. 2019. A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology, Transactions of Electrical Engineering 43, 1 (2019), 171--189.
[39]
Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas BaltrusÌaitis, and Roland Goecke. 2016. Extending long short-term memory for multi-view structured learning. In Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, Netherland. 338--353.
[40]
Hiranmayi Ranganathan, Shayok Chakraborty, and Sethuraman Panchanathan. 2016. Multimodal emotion recognition using deep learning architectures. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY. 1--9.
[41]
Saurav Sahay, Shachi H. Kumar, Rui Xia, Jonathan Huang, and Lama Nachman. 2018. Multimodal relational tensor network for sentiment and emotion classification. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia. 20--27.
[42]
Suyash Sangwan, Dushyant Singh Chauhan, Md. Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task gated contextual cross-modal attention framework for sentiment and emotion analysis. In Neural Information Processing, Tom Gedeon, Kok Wai Wong, and Minho Lee (Eds.). Springer International Publishing, Cham, 662--669.
[43]
Maximilian Schmitt and Björn Schuller. 2017. OpenXBOW: Introducing the passau open-source crossmodal bag-of-words toolkit. The Journal of Machine Learning Research 18, 1 (2017), 3370--3374.
[44]
Imran Sheikh, Sri Harsha Dumpala, Rupayan Chakraborty, and Sunil Kumar Kopparapu. 2018. Sentiment analysis using imperfect views from spoken language and acoustic modalities. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia. 35--39.
[45]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15), San Diego, CA, USA, May 7-9, 2015.
[46]
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT. 4223--4232.
[47]
Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada. 1547--1556.
[48]
Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Björn W. Schuller, and Stefanos Zafeiriou. 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1301--1309.
[49]
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P. Xing. 2017. Select-additive learning: Improving generalization in multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo, Hong Kong. 949--954.
[50]
Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing emotions in video using multimodal DNN feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia. 11--19.
[51]
R. Xia and Y. Liu. 2017. A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on Affective Computing 8, 1 (Jan 2017), 3--14.
[52]
Nan Xu and Wenji Mao. 2017. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore. 2399--2402.
[53]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, San Francisco, CA. 13--22.
[54]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. 1103--1114.
[55]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA. 5634--5641.
[56]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. 2236--2246.
[57]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA. 5642--5649.
[58]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31, 6 (Nov 2016), 82--88.
[59]
Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, and Benyou Wang. 2018. A quantum-inspired multimodal sentiment analysis framework. Theoretical Computer Science 752 (2018), 21--40.

Cited By

View all
  • (2024)Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data AlignmentSymmetry10.3390/sym1607093416:7(934)Online publication date: 22-Jul-2024
  • (2024)Multi-Task Learning with Sequential Dependence Toward Industrial Applications: A Systematic FormulationACM Transactions on Knowledge Discovery from Data10.1145/364046818:5(1-29)Online publication date: 28-Feb-2024
  • (2024)Multimodal Fusion for Precision Personality Trait Analysis: A Comprehensive Model Integrating Video, Audio, and Text Inputs2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC)10.1109/ICSSEECC61126.2024.10649528(327-332)Online publication date: 28-Jun-2024
  • Show More Cited By

Index Terms

  1. A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 3
    June 2020
    381 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3388473
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2020
    Online AM: 07 May 2020
    Accepted: 01 January 2020
    Revised: 01 November 2019
    Received: 01 May 2019
    Published in TKDD Volume 14, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-task learning
    2. emotion analysis
    3. emotion intensity prediction
    4. inter-modal attention
    5. multi-modal analysis
    6. sentiment analysis
    7. sentiment intensity prediction

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 15 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data AlignmentSymmetry10.3390/sym1607093416:7(934)Online publication date: 22-Jul-2024
    • (2024)Multi-Task Learning with Sequential Dependence Toward Industrial Applications: A Systematic FormulationACM Transactions on Knowledge Discovery from Data10.1145/364046818:5(1-29)Online publication date: 28-Feb-2024
    • (2024)Multimodal Fusion for Precision Personality Trait Analysis: A Comprehensive Model Integrating Video, Audio, and Text Inputs2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC)10.1109/ICSSEECC61126.2024.10649528(327-332)Online publication date: 28-Jun-2024
    • (2024)Intermediate Layer Attention Mechanism for Multimodal Fusion in Personality and Affect ComputingIEEE Access10.1109/ACCESS.2024.344237712(112776-112793)Online publication date: 2024
    • (2023)Moving From Narrative to Interactive Multi-Modal Sentiment Analysis: A SurveyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3610288Online publication date: 22-Jul-2023
    • (2023)Multimodal Affective Computing With Dense Fusion Transformer for Inter- and Intra-Modality InteractionsIEEE Transactions on Multimedia10.1109/TMM.2022.321119725(6575-6587)Online publication date: 1-Jan-2023
    • (2023)AIA-Net: Adaptive Interactive Attention Network for Text–Audio Emotion RecognitionIEEE Transactions on Cybernetics10.1109/TCYB.2022.319573953:12(7659-7671)Online publication date: Dec-2023
    • (2023)MIA-Net: Multi-Modal Interactive Attention Network for Multi-Modal Affective AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2023.325901014:4(2796-2809)Online publication date: 1-Oct-2023
    • (2023)Multi-Modal Sarcasm Detection and Humor Classification in Code-Mixed ConversationsIEEE Transactions on Affective Computing10.1109/TAFFC.2021.308352214:2(1363-1375)Online publication date: 1-Apr-2023
    • (2023)A novel microseismic classification model based on bimodal neurons in an artificial neural networkTunnelling and Underground Space Technology10.1016/j.tust.2022.104791131(104791)Online publication date: Jan-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media