Vision Transformer with Scale-Invariant Features for Human Activity Recognition in Smart Surveillance-Based Fall Detection

Ebtisam Abdullah Alabdulqader; Asma Aldrees; Ghada Atteia; Arwa Allinjawi; Shtwai Alsubai; Amjad Qashlan; Younhyun Jung

doi:10.1142/S0219843625400213

Vision Transformer with Scale-Invariant Features for Human Activity Recognition in Smart Surveillance-Based Fall Detection

Ebtisam Abdullah Alabdulqader
, Asma Aldrees
, Ghada Atteia
, Arwa Allinjawi
, Shtwai Alsubai
, Amjad Qashlan
, Younhyun Jung

Computer Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Human activity recognition (HAR) plays a vital role in smart surveillance, healthcare monitoring, and human–computer interaction. Among its applications, fall detection for elderly people is particularly important due to its potential to reduce severe health risks through timely intervention. However, achieving accurate HAR in real-world surveillance scenarios is challenging because of noise, occlusions, and complex environments. In this study, we leverage the capabilities of the Vision Transformer (ViT) combined with scale-invariant feature transform (SIFT) descriptors for robust fall detection and HAR. The framework is evaluated on a dataset that is acquired through video recordings simulating surveillance conditions, focusing on daily human activities such as walking and falling. Comparative experiments against baseline models, including multilayer perceptrons, convolutional neural networks, long short-term memory, EfficientNetB4, Inception, ResNet, and Xception, reveal the superiority of the proposed approach, achieving 99.69% accuracy. This research highlights the potential of ViT-based vision backbones as a building block for multisensor smart surveillance pipelines, where integration with inertial, audio, or radar sensors can further enhance robustness. This research contributes to the development of smart surveillance systems for elderly assistance, real-time monitoring, and public safety applications.

Original language	English
Article number	2540021
Journal	International Journal of Humanoid Robotics
DOIs	https://doi.org/10.1142/S0219843625400213
State	Accepted/In press - 2026

Keywords

computer vision
deep learning for action recognition
multisensor fusion
scale invariant feature transform (SIFT)
Smart surveillance
vision transformer model

Access to Document

10.1142/S0219843625400213

Cite this

@article{b3ca1c863f6c47d497fa40f57049fddf,

title = "Vision Transformer with Scale-Invariant Features for Human Activity Recognition in Smart Surveillance-Based Fall Detection",

abstract = "Human activity recognition (HAR) plays a vital role in smart surveillance, healthcare monitoring, and human–computer interaction. Among its applications, fall detection for elderly people is particularly important due to its potential to reduce severe health risks through timely intervention. However, achieving accurate HAR in real-world surveillance scenarios is challenging because of noise, occlusions, and complex environments. In this study, we leverage the capabilities of the Vision Transformer (ViT) combined with scale-invariant feature transform (SIFT) descriptors for robust fall detection and HAR. The framework is evaluated on a dataset that is acquired through video recordings simulating surveillance conditions, focusing on daily human activities such as walking and falling. Comparative experiments against baseline models, including multilayer perceptrons, convolutional neural networks, long short-term memory, EfficientNetB4, Inception, ResNet, and Xception, reveal the superiority of the proposed approach, achieving 99.69\% accuracy. This research highlights the potential of ViT-based vision backbones as a building block for multisensor smart surveillance pipelines, where integration with inertial, audio, or radar sensors can further enhance robustness. This research contributes to the development of smart surveillance systems for elderly assistance, real-time monitoring, and public safety applications.",

keywords = "computer vision, deep learning for action recognition, multisensor fusion, scale invariant feature transform (SIFT), Smart surveillance, vision transformer model",

author = "Alabdulqader, \{Ebtisam Abdullah\} and Asma Aldrees and Ghada Atteia and Arwa Allinjawi and Shtwai Alsubai and Amjad Qashlan and Younhyun Jung",

note = "Publisher Copyright: {\textcopyright} 2025 World Scientific Publishing Company.",

year = "2026",

doi = "10.1142/S0219843625400213",

language = "English",

journal = "International Journal of Humanoid Robotics",

issn = "0219-8436",

publisher = "World Scientific Publishing Co. Pte Ltd",

}

TY - JOUR

T1 - Vision Transformer with Scale-Invariant Features for Human Activity Recognition in Smart Surveillance-Based Fall Detection

AU - Alabdulqader, Ebtisam Abdullah

AU - Aldrees, Asma

AU - Atteia, Ghada

AU - Allinjawi, Arwa

AU - Alsubai, Shtwai

AU - Qashlan, Amjad

AU - Jung, Younhyun

PY - 2026

Y1 - 2026

N2 - Human activity recognition (HAR) plays a vital role in smart surveillance, healthcare monitoring, and human–computer interaction. Among its applications, fall detection for elderly people is particularly important due to its potential to reduce severe health risks through timely intervention. However, achieving accurate HAR in real-world surveillance scenarios is challenging because of noise, occlusions, and complex environments. In this study, we leverage the capabilities of the Vision Transformer (ViT) combined with scale-invariant feature transform (SIFT) descriptors for robust fall detection and HAR. The framework is evaluated on a dataset that is acquired through video recordings simulating surveillance conditions, focusing on daily human activities such as walking and falling. Comparative experiments against baseline models, including multilayer perceptrons, convolutional neural networks, long short-term memory, EfficientNetB4, Inception, ResNet, and Xception, reveal the superiority of the proposed approach, achieving 99.69% accuracy. This research highlights the potential of ViT-based vision backbones as a building block for multisensor smart surveillance pipelines, where integration with inertial, audio, or radar sensors can further enhance robustness. This research contributes to the development of smart surveillance systems for elderly assistance, real-time monitoring, and public safety applications.

AB - Human activity recognition (HAR) plays a vital role in smart surveillance, healthcare monitoring, and human–computer interaction. Among its applications, fall detection for elderly people is particularly important due to its potential to reduce severe health risks through timely intervention. However, achieving accurate HAR in real-world surveillance scenarios is challenging because of noise, occlusions, and complex environments. In this study, we leverage the capabilities of the Vision Transformer (ViT) combined with scale-invariant feature transform (SIFT) descriptors for robust fall detection and HAR. The framework is evaluated on a dataset that is acquired through video recordings simulating surveillance conditions, focusing on daily human activities such as walking and falling. Comparative experiments against baseline models, including multilayer perceptrons, convolutional neural networks, long short-term memory, EfficientNetB4, Inception, ResNet, and Xception, reveal the superiority of the proposed approach, achieving 99.69% accuracy. This research highlights the potential of ViT-based vision backbones as a building block for multisensor smart surveillance pipelines, where integration with inertial, audio, or radar sensors can further enhance robustness. This research contributes to the development of smart surveillance systems for elderly assistance, real-time monitoring, and public safety applications.

KW - computer vision

KW - deep learning for action recognition

KW - multisensor fusion

KW - scale invariant feature transform (SIFT)

KW - Smart surveillance

KW - vision transformer model

UR - https://www.scopus.com/pages/publications/105026784056

U2 - 10.1142/S0219843625400213

DO - 10.1142/S0219843625400213

M3 - Article

AN - SCOPUS:105026784056

SN - 0219-8436

JO - International Journal of Humanoid Robotics

JF - International Journal of Humanoid Robotics

M1 - 2540021

ER -

Vision Transformer with Scale-Invariant Features for Human Activity Recognition in Smart Surveillance-Based Fall Detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this