Multimodal scene recognition using semantic segmentation and deep learning integration

Aysha Naseer; Mohammed Alnusayri; Haifa F. Alhasson; Mohammed Alatiyyah; Dina Abdulaziz AlHammadi; Ahmad Jalal; Jeongmin Park

doi:10.7717/peerj-cs.2858

Multimodal scene recognition using semantic segmentation and deep learning integration

Aysha Naseer, Mohammed Alnusayri, Haifa F. Alhasson, Mohammed Alatiyyah, Dina Abdulaziz AlHammadi, Ahmad Jalal, Jeongmin Park

Computer Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73% accuracy on the RGB-D scene dataset and 90.53% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.

Original language	English
Article number	e2858
Journal	PeerJ Computer Science
Volume	11
DOIs	https://doi.org/10.7717/peerj-cs.2858
State	Published - 2025

Keywords

Artificial intelligence
Features optimization
Image analysis
Machine learning
Scene modeling
Spatial pyramid pooling
Voxel grid representation

Access to Document

10.7717/peerj-cs.2858

Cite this

@article{92a3ae2fe25145a7a6ece12ec57c36e1,

title = "Multimodal scene recognition using semantic segmentation and deep learning integration",

abstract = "Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73\% accuracy on the RGB-D scene dataset and 90.53\% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.",

keywords = "Artificial intelligence, Features optimization, Image analysis, Machine learning, Scene modeling, Spatial pyramid pooling, Voxel grid representation",

author = "Aysha Naseer and Mohammed Alnusayri and Alhasson, \{Haifa F.\} and Mohammed Alatiyyah and AlHammadi, \{Dina Abdulaziz\} and Ahmad Jalal and Jeongmin Park",

year = "2025",

doi = "10.7717/peerj-cs.2858",

language = "English",

volume = "11",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - Multimodal scene recognition using semantic segmentation and deep learning integration

AU - Naseer, Aysha

AU - Alnusayri, Mohammed

AU - Alhasson, Haifa F.

AU - Alatiyyah, Mohammed

AU - AlHammadi, Dina Abdulaziz

AU - Jalal, Ahmad

AU - Park, Jeongmin

PY - 2025

Y1 - 2025

N2 - Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73% accuracy on the RGB-D scene dataset and 90.53% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.

AB - Semantic modeling and recognition of indoor scenes present a significant challenge due to the complex composition of generic scenes, which contain a variety of features including themes and objects, makes semantic modeling and indoor scene recognition difficult. The gap between high-level scene interpretation and low-level visual features increases the complexity of scene recognition. In order to overcome these obstacles, this study presents a novel multimodal deep learning technique that enhances scene recognition accuracy and robustness by combining depth information with conventional red-green-blue (RGB) image data. Convolutional neural networks (CNNs) and spatial pyramid pooling (SPP) are used for analysis after a depth-aware segmentation methodology is used to identify several objects in an image. This allows for more precise image classification. The effectiveness of this method is demonstrated by experimental findings, which show 91.73% accuracy on the RGB-D scene dataset and 90.53% accuracy on the NYU Depth v2 dataset. These results demonstrate how the multimodal approach can improve scene detection and classification, with potential uses in fields including robotics, sports analysis, and security systems.

KW - Artificial intelligence

KW - Features optimization

KW - Image analysis

KW - Machine learning

KW - Scene modeling

KW - Spatial pyramid pooling

KW - Voxel grid representation

UR - http://www.scopus.com/inward/record.url?scp=105005201498&partnerID=8YFLogxK

U2 - 10.7717/peerj-cs.2858

DO - 10.7717/peerj-cs.2858

M3 - Article

AN - SCOPUS:105005201498

SN - 2376-5992

VL - 11

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e2858

ER -

Multimodal scene recognition using semantic segmentation and deep learning integration

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this