TY - JOUR
T1 - MULTICAUSENET temporal attention for multimodal emotion cause pair extraction
AU - Junchi, Ma
AU - Chaudhry, Hassan Nazeer
AU - Kulsoom, Farzana
AU - Guihua, Yang
AU - Khan, Sajid Ullah
AU - Biswas, Sujit
AU - Khan, Zahid Ullah
AU - Khan, Faheem
N1 - Publisher Copyright:
© Crown 2025.
PY - 2025/12
Y1 - 2025/12
N2 - In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.
AB - In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.
KW - Emotion triggers
KW - Emotion–cause pair extraction
KW - Feature fusion
KW - Graph attention networks (GATs)
KW - Multimodal emotion recognition
KW - Multimodal graphs
KW - Self and cross attention
KW - Transformers and attention mechanisms
KW - Vision transformers (ViTs)
UR - http://www.scopus.com/inward/record.url?scp=105007162701&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-01221-w
DO - 10.1038/s41598-025-01221-w
M3 - Article
C2 - 40461499
AN - SCOPUS:105007162701
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 19372
ER -