An effective dimension reduction algorithm for clustering Arabic text

  • A. A. Mohamed

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.

Original languageEnglish
Pages (from-to)1-5
Number of pages5
JournalEgyptian Informatics Journal
Volume21
Issue number1
DOIs
StatePublished - Mar 2020
Externally publishedYes

Keywords

  • Arabic NLP
  • Clustering
  • Dimensionality reduction
  • NMF
  • PCA
  • SVD

Fingerprint

Dive into the research topics of 'An effective dimension reduction algorithm for clustering Arabic text'. Together they form a unique fingerprint.

Cite this