Big data clustering techniques based on Spark: a literature review

Mozamel M. Saeed; Zaher Al Aghbari; Mohammed Alsharidah

doi:10.7717/PEERJ-CS.321

Big data clustering techniques based on Spark: a literature review

Mozamel M. Saeed
, Zaher Al Aghbari
, Mohammed Alsharidah

Computer Sciences

Research output: Contribution to journal › Article › peer-review

35 Scopus citations

Abstract

A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.

Original language	English
Pages (from-to)	1-28
Number of pages	28
Journal	PeerJ Computer Science
Volume	6
DOIs	https://doi.org/10.7717/PEERJ-CS.321
State	Published - 2020

Keywords

Big Data
Big Data clustering
Spark
Spark-based clustering

Access to Document

10.7717/PEERJ-CS.321

Cite this

@article{67e42f1748284418adf09bbf6c41eb3a,

title = "Big data clustering techniques based on Spark: a literature review",

abstract = "A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.",

keywords = "Big Data, Big Data clustering, Spark, Spark-based clustering",

author = "Saeed, \{Mozamel M.\} and Aghbari, \{Zaher Al\} and Mohammed Alsharidah",

year = "2020",

doi = "10.7717/PEERJ-CS.321",

language = "English",

volume = "6",

pages = "1--28",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - Big data clustering techniques based on Spark

T2 - a literature review

AU - Saeed, Mozamel M.

AU - Aghbari, Zaher Al

AU - Alsharidah, Mohammed

PY - 2020

Y1 - 2020

N2 - A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.

AB - A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.

KW - Big Data

KW - Big Data clustering

KW - Spark

KW - Spark-based clustering

UR - https://www.scopus.com/pages/publications/85098147323

U2 - 10.7717/PEERJ-CS.321

DO - 10.7717/PEERJ-CS.321

M3 - Article

AN - SCOPUS:85098147323

SN - 2376-5992

VL - 6

SP - 1

EP - 28

JO - PeerJ Computer Science

JF - PeerJ Computer Science

ER -

Big data clustering techniques based on Spark: a literature review

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this