MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer

Muhammad Waqas Ahmed, Touseef Sadiq, Hameedur Rahman, Sulaiman Abdullah Alateyah, Mohammed Alnusayri, Mohammed Alatiyyah, Dina Abdulaziz AlHammadi

Research output: Contribution to journalArticlepeer-review

Abstract

This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.

Original languageEnglish
Article numbere2796
JournalPeerJ Computer Science
Volume11
DOIs
StatePublished - 2025

Keywords

  • Deep learning
  • Multimodal
  • Patterns recognition
  • Scene classification
  • Vision Transformer

Fingerprint

Dive into the research topics of 'MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer'. Together they form a unique fingerprint.

Cite this