A hybrid framework for malware determination: generative pre-trained transformer-inspired approach

Tariq Ahamed Ahanger; Munish Bhatia; Abdulrahman Alabduljabbar; Abdullah Albanyan; Imdad Fazal Din

doi:10.1007/s10586-024-05034-w

A hybrid framework for malware determination: generative pre-trained transformer-inspired approach

Tariq Ahamed Ahanger, Munish Bhatia, Abdulrahman Alabduljabbar, Abdullah Albanyan, Imdad Fazal Din

Research output: Contribution to journal › Article › peer-review

Abstract

Cyber attacks and malware incidents have surged in prevalence, posing significant risks across various system domains. As a result, the development of automated machine learning techniques for robust malware defense has become increasingly vital. Among the prominent methodologies, two deep learning architectures for malware detection stand out: Generative Pre-trained Transformer 3 (GPT-3) and Stacked Bidirectional Long Short-Term Memory (SBiLSTM). These language models are engineered by parsing both malicious and benign Portable Executable (PE) files, specifically focusing on the.text segment which contains assembly instructions. Each instruction is treated as a discrete phrase, with the.text segments considered as individual documents. The categorization process classifies each phrase as either safe or vulnerable based on the underlying data source. This approach led to the creation of three distinct datasets. The first dataset comprises complete documents, which are analyzed using an SBiLSTM-based Document Level Analysis framework. The second dataset is constructed from individual sentences, which are processed through SBiLSTM Sentence Level Analysis mechanisms. Additionally, both the Domain-Specific Language (DSL) model and the General Language Model (GLM) based on GPT-3 are employed for enhanced contextual understanding. Ultimately, a pre-trained model is proposed, leveraging a dataset enriched with unlabeled assembly instructions. The efficacy of this malware detection framework is benchmarked against leading-edge research in the field. Notably, the detection performance of the GPT-3 integrated mechanism has shown substantial improvement, underscoring its potential as a formidable tool in the ongoing battle against malware threats.

Original language	English
Article number	371
Journal	Cluster Computing
Volume	28
Issue number	6
DOIs	https://doi.org/10.1007/s10586-024-05034-w
State	Published - Oct 2025

Keywords

GPT3
Malware identification
Stacked LSTM
Static assessment

Access to Document

10.1007/s10586-024-05034-w

Cite this

@article{3c9178a68b1343c0b5e31a93df773cf3,

title = "A hybrid framework for malware determination: generative pre-trained transformer-inspired approach",

abstract = "Cyber attacks and malware incidents have surged in prevalence, posing significant risks across various system domains. As a result, the development of automated machine learning techniques for robust malware defense has become increasingly vital. Among the prominent methodologies, two deep learning architectures for malware detection stand out: Generative Pre-trained Transformer 3 (GPT-3) and Stacked Bidirectional Long Short-Term Memory (SBiLSTM). These language models are engineered by parsing both malicious and benign Portable Executable (PE) files, specifically focusing on the.text segment which contains assembly instructions. Each instruction is treated as a discrete phrase, with the.text segments considered as individual documents. The categorization process classifies each phrase as either safe or vulnerable based on the underlying data source. This approach led to the creation of three distinct datasets. The first dataset comprises complete documents, which are analyzed using an SBiLSTM-based Document Level Analysis framework. The second dataset is constructed from individual sentences, which are processed through SBiLSTM Sentence Level Analysis mechanisms. Additionally, both the Domain-Specific Language (DSL) model and the General Language Model (GLM) based on GPT-3 are employed for enhanced contextual understanding. Ultimately, a pre-trained model is proposed, leveraging a dataset enriched with unlabeled assembly instructions. The efficacy of this malware detection framework is benchmarked against leading-edge research in the field. Notably, the detection performance of the GPT-3 integrated mechanism has shown substantial improvement, underscoring its potential as a formidable tool in the ongoing battle against malware threats.",

keywords = "GPT3, Malware identification, Stacked LSTM, Static assessment",

author = "Ahanger, \{Tariq Ahamed\} and Munish Bhatia and Abdulrahman Alabduljabbar and Abdullah Albanyan and \{Fazal Din\}, Imdad",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.",

year = "2025",

month = oct,

doi = "10.1007/s10586-024-05034-w",

language = "English",

volume = "28",

journal = "Cluster Computing",

issn = "1386-7857",

publisher = "Kluwer Academic Publishers",

number = "6",

}

TY - JOUR

T1 - A hybrid framework for malware determination

T2 - generative pre-trained transformer-inspired approach

AU - Ahanger, Tariq Ahamed

AU - Bhatia, Munish

AU - Alabduljabbar, Abdulrahman

AU - Albanyan, Abdullah

AU - Fazal Din, Imdad

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.

PY - 2025/10

Y1 - 2025/10

N2 - Cyber attacks and malware incidents have surged in prevalence, posing significant risks across various system domains. As a result, the development of automated machine learning techniques for robust malware defense has become increasingly vital. Among the prominent methodologies, two deep learning architectures for malware detection stand out: Generative Pre-trained Transformer 3 (GPT-3) and Stacked Bidirectional Long Short-Term Memory (SBiLSTM). These language models are engineered by parsing both malicious and benign Portable Executable (PE) files, specifically focusing on the.text segment which contains assembly instructions. Each instruction is treated as a discrete phrase, with the.text segments considered as individual documents. The categorization process classifies each phrase as either safe or vulnerable based on the underlying data source. This approach led to the creation of three distinct datasets. The first dataset comprises complete documents, which are analyzed using an SBiLSTM-based Document Level Analysis framework. The second dataset is constructed from individual sentences, which are processed through SBiLSTM Sentence Level Analysis mechanisms. Additionally, both the Domain-Specific Language (DSL) model and the General Language Model (GLM) based on GPT-3 are employed for enhanced contextual understanding. Ultimately, a pre-trained model is proposed, leveraging a dataset enriched with unlabeled assembly instructions. The efficacy of this malware detection framework is benchmarked against leading-edge research in the field. Notably, the detection performance of the GPT-3 integrated mechanism has shown substantial improvement, underscoring its potential as a formidable tool in the ongoing battle against malware threats.

AB - Cyber attacks and malware incidents have surged in prevalence, posing significant risks across various system domains. As a result, the development of automated machine learning techniques for robust malware defense has become increasingly vital. Among the prominent methodologies, two deep learning architectures for malware detection stand out: Generative Pre-trained Transformer 3 (GPT-3) and Stacked Bidirectional Long Short-Term Memory (SBiLSTM). These language models are engineered by parsing both malicious and benign Portable Executable (PE) files, specifically focusing on the.text segment which contains assembly instructions. Each instruction is treated as a discrete phrase, with the.text segments considered as individual documents. The categorization process classifies each phrase as either safe or vulnerable based on the underlying data source. This approach led to the creation of three distinct datasets. The first dataset comprises complete documents, which are analyzed using an SBiLSTM-based Document Level Analysis framework. The second dataset is constructed from individual sentences, which are processed through SBiLSTM Sentence Level Analysis mechanisms. Additionally, both the Domain-Specific Language (DSL) model and the General Language Model (GLM) based on GPT-3 are employed for enhanced contextual understanding. Ultimately, a pre-trained model is proposed, leveraging a dataset enriched with unlabeled assembly instructions. The efficacy of this malware detection framework is benchmarked against leading-edge research in the field. Notably, the detection performance of the GPT-3 integrated mechanism has shown substantial improvement, underscoring its potential as a formidable tool in the ongoing battle against malware threats.

KW - GPT3

KW - Malware identification

KW - Stacked LSTM

KW - Static assessment

UR - https://www.scopus.com/pages/publications/105008081069

U2 - 10.1007/s10586-024-05034-w

DO - 10.1007/s10586-024-05034-w

M3 - Article

AN - SCOPUS:105008081069

SN - 1386-7857

VL - 28

JO - Cluster Computing

JF - Cluster Computing

IS - 6

M1 - 371

ER -

A hybrid framework for malware determination: generative pre-trained transformer-inspired approach

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this