Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

M. Alhawarat; M. Hegazi

doi:10.1109/ACCESS.2018.2852648

Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

M. Alhawarat
, M. Hegazi

Computer Sciences

Prince Sattam Bin Abdulaziz University

Research output: Contribution to journal › Article › peer-review

96 Scopus citations

Abstract

Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.

Original language	English
Article number	8402221
Pages (from-to)	42740-42749
Number of pages	10
Journal	IEEE Access
Volume	6
DOIs	https://doi.org/10.1109/ACCESS.2018.2852648
State	Published - 3 Jul 2018

Keywords

Arabic language
Clustering text documents
K-means
latent Dirichlet allocation (LDA)
topic modeling

Access to Document

10.1109/ACCESS.2018.2852648

Cite this

@article{3b8c157f1f96474685193740682717a7,

title = "Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents",

abstract = "Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.",

keywords = "Arabic language, Clustering text documents, K-means, latent Dirichlet allocation (LDA), topic modeling",

author = "M. Alhawarat and M. Hegazi",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2018",

month = jul,

day = "3",

doi = "10.1109/ACCESS.2018.2852648",

language = "English",

volume = "6",

pages = "42740--42749",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

AU - Alhawarat, M.

AU - Hegazi, M.

PY - 2018/7/3

Y1 - 2018/7/3

N2 - Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.

AB - Clustering Arabic text documents is of high importance for many natural language technologies. This paper uses a combined method to cluster Arabic text documents. Mainly, we use generative models and clustering techniques. The study uses latent Dirichlet allocation and k-means clustering algorithm and applies them to a news data set used in previous similar studies. The aim of this paper is twofold: it first shows that normalizing the weights in the vector space, for the document-term matrix of the text documents, dramatically improves the quality of clusters and hence the accuracy of clustering when using k-means algorithm. The results are compared to a recent study on clustering Arabic text documents. Second, it shows that the combined method is superior in terms of clustering quality for Arabic text documents according to external measures, such as purity, F-measure, entropy, accuracy, and other measures. It is shown in this paper that the purity of the combined method is 0.933 compared to 0.82 for k-means algorithm, and these figures are higher in comparison to a recent similar study. This is also confirmed by the other used validation measures. The correctness of the combined method is then confirmed using different Arabic data sets.

KW - Arabic language

KW - Clustering text documents

KW - K-means

KW - latent Dirichlet allocation (LDA)

KW - topic modeling

UR - https://www.scopus.com/pages/publications/85049453671

U2 - 10.1109/ACCESS.2018.2852648

DO - 10.1109/ACCESS.2018.2852648

M3 - Article

AN - SCOPUS:85049453671

SN - 2169-3536

VL - 6

SP - 42740

EP - 42749

JO - IEEE Access

JF - IEEE Access

M1 - 8402221

ER -

Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this