Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction

Muna Elsadig; Ashraf Osman Ibrahim; Shakila Basheer; Manal Abdullah Alohali; Sara Alshunaifi; Haya Alqahtani; Nihal Alharbi; Wamda Nagmeldin

doi:10.3390/electronics11223647

Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction

Muna Elsadig
, Ashraf Osman Ibrahim
, Shakila Basheer
, Manal Abdullah Alohali
, Sara Alshunaifi
, Haya Alqahtani
, Nihal Alharbi
, Wamda Nagmeldin

Research output: Contribution to journal › Article › peer-review

53 Scopus citations

Abstract

Recently, phishing attacks have been a crucial threat to cyberspace security. Phishing is a form of fraud that attracts people and businesses to access malicious uniform resource locators (URLs) and submit their sensitive information such as passwords, credit card ids, and personal information. Enormous intelligent attacks are launched dynamically with the aim of tricking users into thinking they are accessing a reliable website or online application to acquire account information. Researchers in cyberspace are motivated to create intelligent models and offer secure services on the web as phishing grows more intelligent and malicious every day. In this paper, a novel URL phishing detection technique based on BERT feature extraction and a deep learning method is introduced. BERT was used to extract the URLs’ text from the Phishing Site Predict dataset. Then, the natural language processing (NLP) algorithm was applied to the unique data column and extracted a huge number of useful data features in terms of meaningful text information. Next, a deep convolutional neural network method was utilised to detect phishing URLs. It was used to constitute words or n-grams in order to extract higher-level features. Then, the data were classified into legitimate and phishing URLs. To evaluate the proposed method, a famous public phishing website URLs dataset was used, with a total of 549,346 entries. However, three scenarios were developed to compare the outcomes of the proposed method by using similar datasets. The feature extraction process depends on natural language processing techniques. The experiments showed that the proposed method had achieved 96.66% accuracy in the results, and then the obtained results were compared to other literature review works. The results showed that the proposed method was efficient and valid in detecting phishing websites’ URLs.

Original language	English
Article number	3647
Journal	Electronics (Switzerland)
Volume	11
Issue number	22
DOIs	https://doi.org/10.3390/electronics11223647
State	Published - Nov 2022
Externally published	Yes

Keywords

deep neural network
nature language processing
phishing detection
website URL classification

Access to Document

10.3390/electronics11223647

Cite this

@article{16e2b8ad8db44b82b69b99e0b1c2f002,

title = "Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction",

abstract = "Recently, phishing attacks have been a crucial threat to cyberspace security. Phishing is a form of fraud that attracts people and businesses to access malicious uniform resource locators (URLs) and submit their sensitive information such as passwords, credit card ids, and personal information. Enormous intelligent attacks are launched dynamically with the aim of tricking users into thinking they are accessing a reliable website or online application to acquire account information. Researchers in cyberspace are motivated to create intelligent models and offer secure services on the web as phishing grows more intelligent and malicious every day. In this paper, a novel URL phishing detection technique based on BERT feature extraction and a deep learning method is introduced. BERT was used to extract the URLs{\textquoteright} text from the Phishing Site Predict dataset. Then, the natural language processing (NLP) algorithm was applied to the unique data column and extracted a huge number of useful data features in terms of meaningful text information. Next, a deep convolutional neural network method was utilised to detect phishing URLs. It was used to constitute words or n-grams in order to extract higher-level features. Then, the data were classified into legitimate and phishing URLs. To evaluate the proposed method, a famous public phishing website URLs dataset was used, with a total of 549,346 entries. However, three scenarios were developed to compare the outcomes of the proposed method by using similar datasets. The feature extraction process depends on natural language processing techniques. The experiments showed that the proposed method had achieved 96.66\% accuracy in the results, and then the obtained results were compared to other literature review works. The results showed that the proposed method was efficient and valid in detecting phishing websites{\textquoteright} URLs.",

keywords = "deep neural network, nature language processing, phishing detection, website URL classification",

author = "Muna Elsadig and Ibrahim, \{Ashraf Osman\} and Shakila Basheer and Alohali, \{Manal Abdullah\} and Sara Alshunaifi and Haya Alqahtani and Nihal Alharbi and Wamda Nagmeldin",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors.",

year = "2022",

month = nov,

doi = "10.3390/electronics11223647",

language = "English",

volume = "11",

journal = "Electronics (Switzerland)",

issn = "2079-9292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "22",

}

TY - JOUR

T1 - Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction

AU - Elsadig, Muna

AU - Ibrahim, Ashraf Osman

AU - Basheer, Shakila

AU - Alohali, Manal Abdullah

AU - Alshunaifi, Sara

AU - Alqahtani, Haya

AU - Alharbi, Nihal

AU - Nagmeldin, Wamda

PY - 2022/11

Y1 - 2022/11

N2 - Recently, phishing attacks have been a crucial threat to cyberspace security. Phishing is a form of fraud that attracts people and businesses to access malicious uniform resource locators (URLs) and submit their sensitive information such as passwords, credit card ids, and personal information. Enormous intelligent attacks are launched dynamically with the aim of tricking users into thinking they are accessing a reliable website or online application to acquire account information. Researchers in cyberspace are motivated to create intelligent models and offer secure services on the web as phishing grows more intelligent and malicious every day. In this paper, a novel URL phishing detection technique based on BERT feature extraction and a deep learning method is introduced. BERT was used to extract the URLs’ text from the Phishing Site Predict dataset. Then, the natural language processing (NLP) algorithm was applied to the unique data column and extracted a huge number of useful data features in terms of meaningful text information. Next, a deep convolutional neural network method was utilised to detect phishing URLs. It was used to constitute words or n-grams in order to extract higher-level features. Then, the data were classified into legitimate and phishing URLs. To evaluate the proposed method, a famous public phishing website URLs dataset was used, with a total of 549,346 entries. However, three scenarios were developed to compare the outcomes of the proposed method by using similar datasets. The feature extraction process depends on natural language processing techniques. The experiments showed that the proposed method had achieved 96.66% accuracy in the results, and then the obtained results were compared to other literature review works. The results showed that the proposed method was efficient and valid in detecting phishing websites’ URLs.

AB - Recently, phishing attacks have been a crucial threat to cyberspace security. Phishing is a form of fraud that attracts people and businesses to access malicious uniform resource locators (URLs) and submit their sensitive information such as passwords, credit card ids, and personal information. Enormous intelligent attacks are launched dynamically with the aim of tricking users into thinking they are accessing a reliable website or online application to acquire account information. Researchers in cyberspace are motivated to create intelligent models and offer secure services on the web as phishing grows more intelligent and malicious every day. In this paper, a novel URL phishing detection technique based on BERT feature extraction and a deep learning method is introduced. BERT was used to extract the URLs’ text from the Phishing Site Predict dataset. Then, the natural language processing (NLP) algorithm was applied to the unique data column and extracted a huge number of useful data features in terms of meaningful text information. Next, a deep convolutional neural network method was utilised to detect phishing URLs. It was used to constitute words or n-grams in order to extract higher-level features. Then, the data were classified into legitimate and phishing URLs. To evaluate the proposed method, a famous public phishing website URLs dataset was used, with a total of 549,346 entries. However, three scenarios were developed to compare the outcomes of the proposed method by using similar datasets. The feature extraction process depends on natural language processing techniques. The experiments showed that the proposed method had achieved 96.66% accuracy in the results, and then the obtained results were compared to other literature review works. The results showed that the proposed method was efficient and valid in detecting phishing websites’ URLs.

KW - deep neural network

KW - nature language processing

KW - phishing detection

KW - website URL classification

UR - https://www.scopus.com/pages/publications/85142414994

U2 - 10.3390/electronics11223647

DO - 10.3390/electronics11223647

M3 - Article

AN - SCOPUS:85142414994

SN - 2079-9292

VL - 11

JO - Electronics (Switzerland)

JF - Electronics (Switzerland)

IS - 22

M1 - 3647

ER -

Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this