Abstract
Text clustering is a challenging task in natural language processing due to the very high dimensional space produced by this process (i.e. curse of dimensionality problem). Since these texts contain considerable amounts of ambiguities and redundancies, they produce different noise effects. For an efficient and accurate clustering algorithm, we need to extract the main concepts of the text by eliminating the noise and reducing the high dimensionality of the data. This paper compares among three of the famous dimension reduction algorithms for text clustering to show the pros and cons of each one, namely Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). It presents an effective dimension reduction algorithm for Arabic text clustering using PCA. For that purpose, a series of the experiments has been conducted using two linguistic corpora for both English and Arabic and analyzed the results from a clustering quality point of view. The experiments have shown that PCA improves the quality of the clustering process and that it gives more interpretable results with less time needed for the clustering process for both Arabic and English documents.
| Original language | English |
|---|---|
| Pages (from-to) | 1-5 |
| Number of pages | 5 |
| Journal | Egyptian Informatics Journal |
| Volume | 21 |
| Issue number | 1 |
| DOIs | |
| State | Published - Mar 2020 |
| Externally published | Yes |
Keywords
- Arabic NLP
- Clustering
- Dimensionality reduction
- NMF
- PCA
- SVD