TY - JOUR
T1 - Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents
AU - Alhawarat, M.
AU - Hegazi, M.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/3
Y1 - 2018/7/3
N2 - Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.
AB - Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.
KW - Arabic language
KW - Clustering text documents
KW - K-means
KW - latent Dirichlet allocation (LDA)
KW - topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85049453671&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2018.2852648
DO - 10.1109/ACCESS.2018.2852648
M3 - Article
AN - SCOPUS:85049453671
SN - 2169-3536
VL - 6
SP - 42740
EP - 42749
JO - IEEE Access
JF - IEEE Access
M1 - 8402221
ER -