TY - JOUR
T1 - Short text classification for Arabic social media tweets
AU - Alzanin, Samah M.
AU - Azmi, Aqil M.
AU - Aboalsamh, Hatim A.
N1 - Publisher Copyright:
© 2022 The Authors
PY - 2022/10
Y1 - 2022/10
N2 - With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).
AB - With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf, achieving macro-F1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit).
KW - Arabic tweet
KW - Gaussian Naive Bayes
KW - Heterogeneous data
KW - Random forest
KW - Short text classification
KW - SVM
UR - http://www.scopus.com/inward/record.url?scp=85127347590&partnerID=8YFLogxK
U2 - 10.1016/j.jksuci.2022.03.020
DO - 10.1016/j.jksuci.2022.03.020
M3 - Article
AN - SCOPUS:85127347590
SN - 1319-1578
VL - 34
SP - 6595
EP - 6604
JO - Journal of King Saud University - Computer and Information Sciences
JF - Journal of King Saud University - Computer and Information Sciences
IS - 9
ER -