Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

Abdullah Y. Muaad; Hanumanthappa Jayappa Davanagere; D. S. Guru; J. V.Bibal Benifa; Channabasava Chola; Hussain Alsalman; Abdu H. Gumaei; Mugahed A. Al-Antari

doi:10.1155/2022/3720358

Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

Abdullah Y. Muaad
, Hanumanthappa Jayappa Davanagere
, D. S. Guru
, J. V.Bibal Benifa
, Channabasava Chola
, Hussain Alsalman
, Abdu H. Gumaei
, Mugahed A. Al-Antari

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.

Original language	English
Article number	3720358
Journal	Mathematical Problems in Engineering
Volume	2022
DOIs	https://doi.org/10.1155/2022/3720358
State	Published - 2022
Externally published	Yes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

SDG 3 Good Health and Well-being

Access to Document

10.1155/2022/3720358

Cite this

@article{eaa103880f9944c7834378f140602a07,

title = "Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques",

abstract = "With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.",

author = "Muaad, \{Abdullah Y.\} and Davanagere, \{Hanumanthappa Jayappa\} and Guru, \{D. S.\} and Benifa, \{J. V.Bibal\} and Channabasava Chola and Hussain Alsalman and Gumaei, \{Abdu H.\} and Al-Antari, \{Mugahed A.\}",

note = "Publisher Copyright: {\textcopyright} 2022 Abdullah Y. Muaad et al.",

year = "2022",

doi = "10.1155/2022/3720358",

language = "English",

volume = "2022",

journal = "Mathematical Problems in Engineering",

issn = "1024-123X",

publisher = "John Wiley and Sons Ltd",

}

TY - JOUR

T1 - Arabic Document Classification

T2 - Performance Investigation of Preprocessing and Representation Techniques

AU - Muaad, Abdullah Y.

AU - Davanagere, Hanumanthappa Jayappa

AU - Guru, D. S.

AU - Benifa, J. V.Bibal

AU - Chola, Channabasava

AU - Alsalman, Hussain

AU - Gumaei, Abdu H.

AU - Al-Antari, Mugahed A.

PY - 2022

Y1 - 2022

N2 - With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.

AB - With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.

UR - https://www.scopus.com/pages/publications/85129945239

U2 - 10.1155/2022/3720358

DO - 10.1155/2022/3720358

M3 - Article

AN - SCOPUS:85129945239

SN - 1024-123X

VL - 2022

JO - Mathematical Problems in Engineering

JF - Mathematical Problems in Engineering

M1 - 3720358

ER -

Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

Abstract

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this