TY - JOUR
T1 - Big data clustering techniques based on Spark
T2 - a literature review
AU - Saeed, Mozamel M.
AU - Aghbari, Zaher Al
AU - Alsharidah, Mohammed
N1 - Publisher Copyright:
© Copyright 2020 Saeed et al.
PY - 2020
Y1 - 2020
N2 - A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.
AB - A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.
KW - Big Data
KW - Big Data clustering
KW - Spark
KW - Spark-based clustering
UR - http://www.scopus.com/inward/record.url?scp=85098147323&partnerID=8YFLogxK
U2 - 10.7717/PEERJ-CS.321
DO - 10.7717/PEERJ-CS.321
M3 - Article
AN - SCOPUS:85098147323
SN - 2376-5992
VL - 6
SP - 1
EP - 28
JO - PeerJ Computer Science
JF - PeerJ Computer Science
ER -