Abstract
This article introduces Multimodal Adaptive Patch Embedding with Vision Transformer (MAPE-ViT), a novel approach for RGB-D scene classification that effectively addresses fundamental challenges of sensor misalignment, depth noise, and object boundary preservation. Our framework integrates maximally stable extremal regions (MSER) with wavelet coefficients to create comprehensive patch embedding that capture both local and global image features. These MSER-guided patches, incorporating original pixels and multi-scale wavelet information, serve as input to a Vision Transformer, which leverages its attention mechanisms to extract high-level semantic features. The feature discrimination capability is further enhanced through optimization using the Gray Wolf algorithm. The processed features then flow into a dual-stream architecture, where an extreme learning machine handles multi-object classification, while conditional random fields (CRF) manage scene-level categorization. Extensive experimental results demonstrate the effectiveness of our approach, showing significant improvements in classification accuracy compared to existing methods. Our system provides a robust solution for RGB-D scene understanding, particularly in challenging conditions where traditional approaches struggle with sensor artifacts and noise.
| Original language | English |
|---|---|
| Article number | e2796 |
| Journal | PeerJ Computer Science |
| Volume | 11 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Deep learning
- Multimodal
- Patterns recognition
- Scene classification
- Vision Transformer
Fingerprint
Dive into the research topics of 'MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver