TY - JOUR
T1 - MAPE-ViT
T2 - multimodal scene understanding with novel wavelet-augmented Vision Transformer
AU - Ahmed, Muhammad Waqas
AU - Sadiq, Touseef
AU - Rahman, Hameedur
AU - Alateyah, Sulaiman Abdullah
AU - Alnusayri, Mohammed
AU - Alatiyyah, Mohammed
AU - AlHammadi, Dina Abdulaziz
N1 - Publisher Copyright:
© (2025), (PeerJ Inc.). All rights reserved.
PY - 2025
Y1 - 2025
N2 - This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.
AB - This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.
KW - Deep learning
KW - Multimodal
KW - Patterns recognition
KW - Scene classification
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=105007293048&partnerID=8YFLogxK
U2 - 10.7717/peerj-cs.2796
DO - 10.7717/peerj-cs.2796
M3 - Article
AN - SCOPUS:105007293048
SN - 2376-5992
VL - 11
JO - PeerJ Computer Science
JF - PeerJ Computer Science
M1 - e2796
ER -