MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer

Muhammad Waqas Ahmed; Touseef Sadiq; Hameedur Rahman; Sulaiman Abdullah Alateyah; Mohammed Alnusayri; Mohammed Alatiyyah; Dina Abdulaziz AlHammadi

doi:10.7717/peerj-cs.2796

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer

Muhammad Waqas Ahmed, Touseef Sadiq, Hameedur Rahman, Sulaiman Abdullah Alateyah, Mohammed Alnusayri, Mohammed Alatiyyah, Dina Abdulaziz AlHammadi

Computer Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

Original language	English
Article number	e2796
Journal	PeerJ Computer Science
Volume	11
DOIs	https://doi.org/10.7717/peerj-cs.2796
State	Published - 2025

Keywords

Deep learning
Multimodal
Patterns recognition
Scene classification
Vision Transformer

Access to Document

10.7717/peerj-cs.2796

Cite this

@article{d47a0762b86541db8b61de5985abb09e,

title = "MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer",

abstract = "This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.",

keywords = "Deep learning, Multimodal, Patterns recognition, Scene classification, Vision Transformer",

author = "Ahmed, \{Muhammad Waqas\} and Touseef Sadiq and Hameedur Rahman and Alateyah, \{Sulaiman Abdullah\} and Mohammed Alnusayri and Mohammed Alatiyyah and AlHammadi, \{Dina Abdulaziz\}",

year = "2025",

doi = "10.7717/peerj-cs.2796",

language = "English",

volume = "11",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - MAPE-ViT

T2 - multimodal scene understanding with novel wavelet-augmented Vision Transformer

AU - Ahmed, Muhammad Waqas

AU - Sadiq, Touseef

AU - Rahman, Hameedur

AU - Alateyah, Sulaiman Abdullah

AU - Alnusayri, Mohammed

AU - Alatiyyah, Mohammed

AU - AlHammadi, Dina Abdulaziz

PY - 2025

Y1 - 2025

N2 - This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

AB - This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

KW - Deep learning

KW - Multimodal

KW - Patterns recognition

KW - Scene classification

KW - Vision Transformer

UR - https://www.scopus.com/pages/publications/105007293048

U2 - 10.7717/peerj-cs.2796

DO - 10.7717/peerj-cs.2796

M3 - Article

AN - SCOPUS:105007293048

SN - 2376-5992

VL - 11

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e2796

ER -

MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this