An effective dimension reduction algorithm for clustering Arabic text

A. A. Mohamed

doi:10.1016/j.eij.2019.05.002

An effective dimension reduction algorithm for clustering Arabic text

A. A. Mohamed

Prince Sattam Bin Abdulaziz University

Research output: Contribution to journal › Article › peer-review

25 Scopus citations

Abstract

Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.

Original language	English
Pages (from-to)	1-5
Number of pages	5
Journal	Egyptian Informatics Journal
Volume	21
Issue number	1
DOIs	https://doi.org/10.1016/j.eij.2019.05.002
State	Published - Mar 2020
Externally published	Yes

Keywords

Arabic NLP
Clustering
Dimensionality reduction
NMF
PCA
SVD

Access to Document

10.1016/j.eij.2019.05.002

Cite this

@article{4edad7a7497642f9b937e1fea6281319,

title = "An effective dimension reduction algorithm for clustering Arabic text",

abstract = "Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.",

keywords = "Arabic NLP, Clustering, Dimensionality reduction, NMF, PCA, SVD",

author = "Mohamed, \{A. A.\}",

note = "Publisher Copyright: {\textcopyright} 2019",

year = "2020",

month = mar,

doi = "10.1016/j.eij.2019.05.002",

language = "English",

volume = "21",

pages = "1--5",

journal = "Egyptian Informatics Journal",

issn = "1110-8665",

publisher = "Elsevier B.V.",

number = "1",

}

TY - JOUR

T1 - An effective dimension reduction algorithm for clustering Arabic text

AU - Mohamed, A. A.

PY - 2020/3

Y1 - 2020/3

N2 - Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.

AB - Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.

KW - Arabic NLP

KW - Clustering

KW - Dimensionality reduction

KW - NMF

KW - PCA

KW - SVD

UR - https://www.scopus.com/pages/publications/85066333933

U2 - 10.1016/j.eij.2019.05.002

DO - 10.1016/j.eij.2019.05.002

M3 - Article

AN - SCOPUS:85066333933

SN - 1110-8665

VL - 21

SP - 1

EP - 5

JO - Egyptian Informatics Journal

JF - Egyptian Informatics Journal

IS - 1

ER -

An effective dimension reduction algorithm for clustering Arabic text

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this