Bilateral collaborative streams with multi-modal attention network for accurate polyp segmentation

Rahim Khan; Nada Alzaben; Yousef Ibrahim Daradkeh; Mi Young Lee; Inam Ullah

doi:10.1038/s41598-025-15401-1

Bilateral collaborative streams with multi-modal attention network for accurate polyp segmentation

Rahim Khan
, Nada Alzaben
, Yousef Ibrahim Daradkeh
, Mi Young Lee
, Inam Ullah

Computer Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

Accurate segmentation of colorectal polyps in colonoscopy images represents a critical prerequisite for early cancer detection and prevention. However, existing segmentation approaches struggle with the inherent diversity of polyp presentations, variations in size, morphology, and texture, while maintaining the computational efficiency required for clinical deployment. To address these challenges, we propose a novel dual-stream architecture, Bilateral Convolutional Multi-Attention Network (BiCoMA). The proposed network integrates both global contextual information and local spatial details through parallel processing streams that leverage the complementary strengths of convolutional neural networks and vision transformers. The architecture employs a hybrid backbone where the convolutional stream utilizes ConvNeXt V2 Large to extract high-resolution spatial features, while the transformer stream employs Pyramid Vision Transformer to model global dependencies and long-range contextual relationships. Our model employs Spatial Refinement (SR) modules to process high-resolution convolutional features and, thereby preserving critical boundary information through asymmetric convolutional processing. Channel Refinement (CR) and Non-Local Attention (NLA) mechanisms are integrated into transformer features and to enhance discriminative capacity and capture global contextual relationships. These refined multi-scale representations from both streams are progressively integrated through a hierarchical decoder incorporating a Pyramidal Attention Block (PAB) for multi-scale feature processing and Convolutional Block Attention Modules (CBAM) for enhanced feature discrimination. The fusion strategy employs systematic upsampling with lateral connections, enabling adequate information flow from semantic understanding to fine-grained spatial localization. Channel alignment operations through and convolutions ensure computational efficiency while preserving essential semantic information. The proposed BiCoMA architecture achieves state-of-the-art performance across five benchmark datasets (Endoscene, ClinicDB, ColonDB, ETIS, and Kvasir-SEG), demonstrating superior generalization capabilities and practical computational requirements for real-time clinical applications.

Original language	English
Article number	34182
Journal	Scientific Reports
Volume	15
Issue number	1
DOIs	https://doi.org/10.1038/s41598-025-15401-1
State	Published - Dec 2025

Keywords

Hybrid CNN-transformer
Multi-attention
Multi-scale attention
Polyp segmentation
Semantic fusion
Visual intelligence

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1038/s41598-025-15401-1

Cite this

@article{57893978611248809c32f217379ed1b3,

title = "Bilateral collaborative streams with multi-modal attention network for accurate polyp segmentation",

abstract = "Accurate segmentation of colorectal polyps in colonoscopy images represents a critical prerequisite for early cancer detection and prevention. However, existing segmentation approaches struggle with the inherent diversity of polyp presentations, variations in size, morphology, and texture, while maintaining the computational efficiency required for clinical deployment. To address these challenges, we propose a novel dual-stream architecture, Bilateral Convolutional Multi-Attention Network (BiCoMA). The proposed network integrates both global contextual information and local spatial details through parallel processing streams that leverage the complementary strengths of convolutional neural networks and vision transformers. The architecture employs a hybrid backbone where the convolutional stream utilizes ConvNeXt V2 Large to extract high-resolution spatial features, while the transformer stream employs Pyramid Vision Transformer to model global dependencies and long-range contextual relationships. Our model employs Spatial Refinement (SR) modules to process high-resolution convolutional features and, thereby preserving critical boundary information through asymmetric convolutional processing. Channel Refinement (CR) and Non-Local Attention (NLA) mechanisms are integrated into transformer features and to enhance discriminative capacity and capture global contextual relationships. These refined multi-scale representations from both streams are progressively integrated through a hierarchical decoder incorporating a Pyramidal Attention Block (PAB) for multi-scale feature processing and Convolutional Block Attention Modules (CBAM) for enhanced feature discrimination. The fusion strategy employs systematic upsampling with lateral connections, enabling adequate information flow from semantic understanding to fine-grained spatial localization. Channel alignment operations through and convolutions ensure computational efficiency while preserving essential semantic information. The proposed BiCoMA architecture achieves state-of-the-art performance across five benchmark datasets (Endoscene, ClinicDB, ColonDB, ETIS, and Kvasir-SEG), demonstrating superior generalization capabilities and practical computational requirements for real-time clinical applications.",

keywords = "Hybrid CNN-transformer, Multi-attention, Multi-scale attention, Polyp segmentation, Semantic fusion, Visual intelligence",

author = "Rahim Khan and Nada Alzaben and Daradkeh, \{Yousef Ibrahim\} and Lee, \{Mi Young\} and Inam Ullah",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2025.",

year = "2025",

month = dec,

doi = "10.1038/s41598-025-15401-1",

language = "English",

volume = "15",

journal = "Scientific Reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Bilateral collaborative streams with multi-modal attention network for accurate polyp segmentation

AU - Khan, Rahim

AU - Alzaben, Nada

AU - Daradkeh, Yousef Ibrahim

AU - Lee, Mi Young

AU - Ullah, Inam

N1 - Publisher Copyright: © The Author(s) 2025.

PY - 2025/12

Y1 - 2025/12

N2 - Accurate segmentation of colorectal polyps in colonoscopy images represents a critical prerequisite for early cancer detection and prevention. However, existing segmentation approaches struggle with the inherent diversity of polyp presentations, variations in size, morphology, and texture, while maintaining the computational efficiency required for clinical deployment. To address these challenges, we propose a novel dual-stream architecture, Bilateral Convolutional Multi-Attention Network (BiCoMA). The proposed network integrates both global contextual information and local spatial details through parallel processing streams that leverage the complementary strengths of convolutional neural networks and vision transformers. The architecture employs a hybrid backbone where the convolutional stream utilizes ConvNeXt V2 Large to extract high-resolution spatial features, while the transformer stream employs Pyramid Vision Transformer to model global dependencies and long-range contextual relationships. Our model employs Spatial Refinement (SR) modules to process high-resolution convolutional features and, thereby preserving critical boundary information through asymmetric convolutional processing. Channel Refinement (CR) and Non-Local Attention (NLA) mechanisms are integrated into transformer features and to enhance discriminative capacity and capture global contextual relationships. These refined multi-scale representations from both streams are progressively integrated through a hierarchical decoder incorporating a Pyramidal Attention Block (PAB) for multi-scale feature processing and Convolutional Block Attention Modules (CBAM) for enhanced feature discrimination. The fusion strategy employs systematic upsampling with lateral connections, enabling adequate information flow from semantic understanding to fine-grained spatial localization. Channel alignment operations through and convolutions ensure computational efficiency while preserving essential semantic information. The proposed BiCoMA architecture achieves state-of-the-art performance across five benchmark datasets (Endoscene, ClinicDB, ColonDB, ETIS, and Kvasir-SEG), demonstrating superior generalization capabilities and practical computational requirements for real-time clinical applications.

AB - Accurate segmentation of colorectal polyps in colonoscopy images represents a critical prerequisite for early cancer detection and prevention. However, existing segmentation approaches struggle with the inherent diversity of polyp presentations, variations in size, morphology, and texture, while maintaining the computational efficiency required for clinical deployment. To address these challenges, we propose a novel dual-stream architecture, Bilateral Convolutional Multi-Attention Network (BiCoMA). The proposed network integrates both global contextual information and local spatial details through parallel processing streams that leverage the complementary strengths of convolutional neural networks and vision transformers. The architecture employs a hybrid backbone where the convolutional stream utilizes ConvNeXt V2 Large to extract high-resolution spatial features, while the transformer stream employs Pyramid Vision Transformer to model global dependencies and long-range contextual relationships. Our model employs Spatial Refinement (SR) modules to process high-resolution convolutional features and, thereby preserving critical boundary information through asymmetric convolutional processing. Channel Refinement (CR) and Non-Local Attention (NLA) mechanisms are integrated into transformer features and to enhance discriminative capacity and capture global contextual relationships. These refined multi-scale representations from both streams are progressively integrated through a hierarchical decoder incorporating a Pyramidal Attention Block (PAB) for multi-scale feature processing and Convolutional Block Attention Modules (CBAM) for enhanced feature discrimination. The fusion strategy employs systematic upsampling with lateral connections, enabling adequate information flow from semantic understanding to fine-grained spatial localization. Channel alignment operations through and convolutions ensure computational efficiency while preserving essential semantic information. The proposed BiCoMA architecture achieves state-of-the-art performance across five benchmark datasets (Endoscene, ClinicDB, ColonDB, ETIS, and Kvasir-SEG), demonstrating superior generalization capabilities and practical computational requirements for real-time clinical applications.

KW - Hybrid CNN-transformer

KW - Multi-attention

KW - Multi-scale attention

KW - Polyp segmentation

KW - Semantic fusion

KW - Visual intelligence

UR - https://www.scopus.com/pages/publications/105017684329

U2 - 10.1038/s41598-025-15401-1

DO - 10.1038/s41598-025-15401-1

M3 - Article

C2 - 41034439

AN - SCOPUS:105017684329

SN - 2045-2322

VL - 15

JO - Scientific Reports

JF - Scientific Reports

IS - 1

M1 - 34182

ER -

Bilateral collaborative streams with multi-modal attention network for accurate polyp segmentation

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this