MULTICAUSENET temporal attention for multimodal emotion cause pair extraction

Ma Junchi; Hassan Nazeer Chaudhry; Farzana Kulsoom; Yang Guihua; Sajid Ullah Khan; Sujit Biswas; Zahid Ullah Khan; Faheem Khan

doi:10.1038/s41598-025-01221-w

MULTICAUSENET temporal attention for multimodal emotion cause pair extraction

Ma Junchi
, Hassan Nazeer Chaudhry
, Farzana Kulsoom
, Yang Guihua
, Sajid Ullah Khan
, Sujit Biswas
, Zahid Ullah Khan
, Faheem Khan

Information Systems

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.

Original language	English
Article number	19372
Journal	Scientific Reports
Volume	15
Issue number	1
DOIs	https://doi.org/10.1038/s41598-025-01221-w
State	Published - Dec 2025

Keywords

Emotion triggers
Emotion–cause pair extraction
Feature fusion
Graph attention networks (GATs)
Multimodal emotion recognition
Multimodal graphs
Self and cross attention
Transformers and attention mechanisms
Vision transformers (ViTs)

Access to Document

10.1038/s41598-025-01221-w

Cite this

@article{23957b0a8e0642e78194d7c9efd1721a,

title = "MULTICAUSENET temporal attention for multimodal emotion cause pair extraction",

abstract = "In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.",

keywords = "Emotion triggers, Emotion–cause pair extraction, Feature fusion, Graph attention networks (GATs), Multimodal emotion recognition, Multimodal graphs, Self and cross attention, Transformers and attention mechanisms, Vision transformers (ViTs)",

author = "Ma Junchi and Chaudhry, \{Hassan Nazeer\} and Farzana Kulsoom and Yang Guihua and Khan, \{Sajid Ullah\} and Sujit Biswas and Khan, \{Zahid Ullah\} and Faheem Khan",

note = "Publisher Copyright: {\textcopyright} Crown 2025.",

year = "2025",

month = dec,

doi = "10.1038/s41598-025-01221-w",

language = "English",

volume = "15",

journal = "Scientific Reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - MULTICAUSENET temporal attention for multimodal emotion cause pair extraction

AU - Junchi, Ma

AU - Chaudhry, Hassan Nazeer

AU - Kulsoom, Farzana

AU - Guihua, Yang

AU - Khan, Sajid Ullah

AU - Biswas, Sujit

AU - Khan, Zahid Ullah

AU - Khan, Faheem

N1 - Publisher Copyright: © Crown 2025.

PY - 2025/12

Y1 - 2025/12

N2 - In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.

AB - In the realm of emotion recognition, understanding the intricate relationships between emotions and their underlying causes remains a significant challenge. This paper presents MultiCauseNet, a novel framework designed to effectively extract emotion-cause pairs by leveraging multimodal data, including text, audio, and video. The proposed approach integrates advanced multimodal feature extraction techniques with attention mechanisms to enhance the understanding of emotional contexts. The key text, audio, and video features are extracted using BERT, Wav2Vec, and Vision transformers (ViTs), which are then employed to construct a comprehensive multimodal graph. The graph encodes the relationships between emotions and potential causes, and Graph Attention Networks (GATs) are used to weigh and prioritize relevant features across the modalities. To further improve performance, Transformers are employed to model intra-modal and inter-modal dependencies through self-attention and cross-attention mechanisms. This enables a more robust multimodal information fusion, capturing the global context of emotional interactions. This dynamic attention mechanism enables MultiCauseNet to capture complex interactions between emotional triggers and causes, improving extraction accuracy. Experiments on emotion benchmark datasets, including IEMOCAP and MELD achieved a WFI score of 73.02 and 53.67 respectively. The results for cause pair analysis are evaluated on ECF and ConvECPE with a Cause recognition F1 score of 65.12 and 84.51, and a Pair extraction F1 score of 55.12 and 51.34.

KW - Emotion triggers

KW - Emotion–cause pair extraction

KW - Feature fusion

KW - Graph attention networks (GATs)

KW - Multimodal emotion recognition

KW - Multimodal graphs

KW - Self and cross attention

KW - Transformers and attention mechanisms

KW - Vision transformers (ViTs)

UR - https://www.scopus.com/pages/publications/105007162701

U2 - 10.1038/s41598-025-01221-w

DO - 10.1038/s41598-025-01221-w

M3 - Article

C2 - 40461499

AN - SCOPUS:105007162701

SN - 2045-2322

VL - 15

JO - Scientific Reports

JF - Scientific Reports

IS - 1

M1 - 19372

ER -

MULTICAUSENET temporal attention for multimodal emotion cause pair extraction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this