Short text classification for Arabic social media tweets

Samah M. Alzanin; Aqil M. Azmi; Hatim A. Aboalsamh

doi:10.1016/j.jksuci.2022.03.020

Short text classification for Arabic social media tweets

Samah M. Alzanin
, Aqil M. Azmi
, Hatim A. Aboalsamh

King Saud University

Research output: Contribution to journal › Article › peer-review

52 Scopus citations

Abstract

With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F₁ scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).

Original language	English
Pages (from-to)	6595-6604
Number of pages	10
Journal	Journal of King Saud University - Computer and Information Sciences
Volume	34
Issue number	9
DOIs	https://doi.org/10.1016/j.jksuci.2022.03.020
State	Published - Oct 2022
Externally published	Yes

Keywords

Arabic tweet
Gaussian Naive Bayes
Heterogeneous data
Random forest
SVM
Short text classification

Access to Document

10.1016/j.jksuci.2022.03.020

Cite this

@article{b4c5ae128983400498415f7371797e79,

title = "Short text classification for Arabic social media tweets",

abstract = "With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09\% and 98.14\%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95\% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).",

keywords = "Arabic tweet, Gaussian Naive Bayes, Heterogeneous data, Random forest, SVM, Short text classification",

author = "Alzanin, \{Samah M.\} and Azmi, \{Aqil M.\} and Aboalsamh, \{Hatim A.\}",

note = "Publisher Copyright: {\textcopyright} 2022 The Authors",

year = "2022",

month = oct,

doi = "10.1016/j.jksuci.2022.03.020",

language = "English",

volume = "34",

pages = "6595--6604",

journal = "Journal of King Saud University - Computer and Information Sciences",

issn = "1319-1578",

publisher = "Springer International Publishing",

number = "9",

}

TY - JOUR

T1 - Short text classification for Arabic social media tweets

AU - Alzanin, Samah M.

AU - Azmi, Aqil M.

AU - Aboalsamh, Hatim A.

PY - 2022/10

Y1 - 2022/10

N2 - With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).

AB - With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).

KW - Arabic tweet

KW - Gaussian Naive Bayes

KW - Heterogeneous data

KW - Random forest

KW - SVM

KW - Short text classification

UR - https://www.scopus.com/pages/publications/85127347590

U2 - 10.1016/j.jksuci.2022.03.020

DO - 10.1016/j.jksuci.2022.03.020

M3 - Article

AN - SCOPUS:85127347590

SN - 1319-1578

VL - 34

SP - 6595

EP - 6604

JO - Journal of King Saud University - Computer and Information Sciences

JF - Journal of King Saud University - Computer and Information Sciences

IS - 9

ER -

Short text classification for Arabic social media tweets

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this