TY - JOUR
T1 - Multimodal scene recognition using semantic segmentation and deep learning integration
AU - Naseer, Aysha
AU - Alnusayri, Mohammed
AU - Alhasson, Haifa F.
AU - Alatiyyah, Mohammed
AU - AlHammadi, Dina Abdulaziz
AU - Jalal, Ahmad
AU - Park, Jeongmin
N1 - Publisher Copyright:
Copyright 2025 Naseer et al. Distributed under Creative Commons CC-BY 4.0
PY - 2025
Y1 - 2025
N2 - Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73% accuracy on the RGB-D scene dataset and 90.53% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.
AB - Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73% accuracy on the RGB-D scene dataset and 90.53% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.
KW - Artificial intelligence
KW - Features optimization
KW - Image analysis
KW - Machine learning
KW - Scene modeling
KW - Spatial pyramid pooling
KW - Voxel grid representation
UR - http://www.scopus.com/inward/record.url?scp=105005201498&partnerID=8YFLogxK
U2 - 10.7717/peerj-cs.2858
DO - 10.7717/peerj-cs.2858
M3 - Article
AN - SCOPUS:105005201498
SN - 2376-5992
VL - 11
JO - PeerJ Computer Science
JF - PeerJ Computer Science
M1 - e2858
ER -