Overview
Over the past few years, the development of technology has lead to the rapid increas- ing in the amount of data Twenty years ago, almost data were just from sciences, but the explosion of Internet takes human to new era of data They are now exist- ing everywhere around us For instance, the growth of smart city and Internet of Things (IoT) devices such as aerial (remote sensing), cameras, microphones, radio- frequency identification (RFID) readers, wireless sensor networks, etc create a lot of data every second Besides, the increasing in mobile and personal devices such as smart phones or tablets also make the data bigger every day, especially through photo and video sharing in popular social networks like Facebook, Twitter, YouTube, etc. Many researches have shown that the amount of data created each year is growing faster than ever before and they estimate that by 2020, every human on the planet will be creating 1.7 megabytes of information each second; and in only a year, the accu- mulated world data will grow to 44 zettabytes 1 [46] Another research from IDC predict that the amount of global data captured in 2025 will reach 163 zettabytes, a tenfold increase compared to 2016 [55]
Consequently, researchers now have to face new hard situation: solving prob- lems for data that have big amount in volume, variety, velocity, veracity and value. (Figure1.1 2 ) For the demand of understanding and explaining these data in order to solve reality problems, it is very hard for human if there is no help from machine. That’s why machine learning plays an important role in this decade as well as in the future By applying machine learning combined with artificial intelligence (AI), sci- entists can create systems having the ability to automatically learn and improve from experience without being explicitly programmed.
For each specific purpose, machine learning is divided into two categories: su- pervised and unsupervised Supervised learning is a kind of training model where the training sets go along with provided target labels, the system will learn from these
1 One zettabyte is equivalent to one billion gigabytes.
2 Image source: https://www.edureka.co/blog/what-is-big-data/
FIGURE1.1: Big Data properties training sets and then is used to predict or classify future instances In contrast, unsu- pervised machine learning approaches extract information from data sets where such explicit labels are not available The importance of this field is expected to grow as it is estimated that 85% of global data in 2025 will be unlabeled [55] In particular, data clustering - the tasks of grouping together similar objects into clusters — seems to be a fruitful approach for analyzing that data [13] Applications are broad and include fields such as computer vision [61], information retrieval [35], computational ge- ometry [36] and recommendation systems [41] Furthermore, clustering techniques can also be used to learn data representations that are used in downstream prediction tasks such as classification and regression [16] Machine learning categories can be described briefly in Figure1.2 3
In general, clustering is one of the most popular techniques in machine learning and is used widely in large-scale data analysis The target of clustering is partition- ing a set of objects into groups such that objects in same group are similar to each other and objects in different groups are dissimilar to each other This technique, due to its importance and application in reality, has a lot of investigations and various algorithms For example, we can use BIRCH [68], CURE [27] which are belonging to hierarchical clustering, also known as connectivity-based clustering, for solving problems based on the idea of objects being more related to nearby objects than to objects farther away If the problems are closely related to statistics, we can use distribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DB- CLASD [53] For matter based on density clustering in which the data that is in the
3 Image source: https://towardsdatascience.com/supervised-vs-unsupervised-learning
FIGURE1.2: Machine Learning: Supervised vs Unsupervised region with high density of the data space is considered to belong to the same clus- ter [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known density- based clustering algorithm, or OPTICS [4] - an improvement of DBSCAN And one of the most common approaches for clustering is based on partition in which the basic idea is to assign the centers of data points to some random objects, the actual cen- ters will be reveal through several iterations until a stop condition is satisfied Some common algorithms of this kind are k-means [45], k-medoids [49], CLARA [37], CLARANS [48] For a more detail, we refer readers to the survey of clustering algorithms by R.Xu (2005) [65] and by D Xu (2015) [64]
In fact, there are a lot of clustering algorithms and improvements that can be used in applications Each one has its own benefits and drawbacks as well The question of choosing a suitable clustering algorithm is an important and difficult problem that users must deal with when they have to solve situations with specific configurations and settings There are some research about this such as in [14], [39], [67] where explain about the quality of clusters in some circumstances However, in the scope of this thesis, we do not cover this issue and various clustering algorithms, instead of this, we fix and select one of the most popular clustering algorithm - the k- means clustering We will use this algorithm throughout of this report and investigate methods that can deal with k-means clustering for large-scale data set.
Moreover, to design a complete solution that can cluster and analyze large-scale data is still a challenge for data scientists Many methods have been proposed for several years to deal with machine learning for big data One of the simplest way is depending on infrastructure and hardware: the more powerful and modern machine we have, the more complicated and larger amount of data we can solve This solution is quite easy but costs a lot of money and few people can afford this Another option is finding suitable algorithms to reduce the computational complexity from the input size that may contain millions or billions of data points There are some approach methods such as data compression in [69], [1], data deduplication [19], dimension
Chapter 1 Introduction 4 reduction [25], [60], [51], etc For a survey about this, readers can find more useful information in [54] Among big data reduction methods, data sampling is one of the popular options that are closely related to machine learning and data mining for researchers The key idea of data sampling is that instead of solving problems on the full data with large-scale size, we can find the answer for the subset of this data;this result is then used as the baseline for finding the actual solution for original data set This leads us to a new difficulty: finding a subset that must be small enough for effectively reducing computational complexity but must keep all representative characteristics of original data And, this difficulty is the motivation for us to do this research and this thesis as well.
The Scope of Research
In this thesis, we will propose a solution for a problem of clustering large datasets.
We use the word "large" to indicate the data that has "big" in volume, not the whole characteristics of big data described in previous section with 5 V’s (Volume, Variety, Value, Velocity and Veracity) (Figure1.1) However, the Volume, in other words, the data size, is one of the most non-trivial difficulties that most researchers have to face when solving a big data related problem.
For clustering algorithm, even though there are a lot of investigations and meth- ods, we consider fixed clustering problems with the prototypical k-means clustering.
We select this because k-means is the most well-known clustering algorithm and is widely applied in reality as well as in industry or scientific research.
While there is a wealth of prior work on clustering of small and medium sized data sets, there are unique challenges in the massive data setting The traditional algorithms have a super-linear computational complexity on the size of the data set making them infeasible when there are many data points In the scope of this thesis, we apply data sampling to deal with the massive data setting A very basic approach of this method is random sampling or uniform sampling In fact, while uniform sampling is feasible for some problems, there are instances where it performs very poorly due to the naive nature of the sampling strategy For example, real-world data is often imbalanced and contains clusters of different sizes As a consequence, a small fraction of data points can be very important and have an enormous impact on the objective function Such imbalanced data sets are problematic for methods based on uniform sampling since, with high probability These methods only sample points in large clusters and the information in small clusters is discarded [13]
The idea of finding a relevant subset from original data to decrease the com- putational cost brings scientists to the concept of coreset, which was first applied in geometric approximation by Agarwal et al in 2004 [2], [3] The problem of coreset constructions for k-median and k-means clustering was then stated and investigated by Har-Peled et al in [28], [29] Since that time, many coreset construction algo- rithms have been proposed for a wide variety of clustering problems In this thesis,
Chapter 1 Introduction 5 by using state-of-the-art coreset methods, we propose two methods for coreset con- structions for k-means clustering.
In addition to thinking algorithms in machine learning for solving big data problems, data scientists also invent, create and then apply framework for fasten- ing the process of executing big data Some most popular open-source and free-to- use frameworks are Apache Hadoop, Apache Spark, Apache Storm, Apache Samza,Apache Flink, etc Each one is designed with different architectures and has its own strength For a survey and more details about these frameworks, readers can find more from [34] In this thesis, along with data sampling via coresets, we will apply the built-in k-means clustering algorithm of Apache Spark to shorten the runtime of the whole problem.
Research Contributions
Scientific Significance
In this thesis, based on prior works about coresets, we propose new algorithms of coreset constructions for k-means clustering
• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros & Guillaume in [58], we propose an FFT-based coreset construction This part is explained and proved clearly in Chapter3
• By based on the lightweight coreset of Bachem, Lucic and Krause [12], we propose a general model for the α - lightweight coreset, then we propose a general lightweight coreset construction that is very fast and practical This is proved is Chapter4
Practical Significance
• Due to its high runtime, our proposed FFT-based coreset construction is very hard to be used in reality However, through experiments with some state-of- the-arts coreset constructions, this proposed algorithm is showed that it is able to produce one of the best sample coresets that can be used in experiments.
• Our proposed α - lightweight coreset model is a generalization of the tradi- tional lightweight coreset This proposal can be used for various practical cases, especially for situations that need to focus on multiplicative errors or additive errors of the samples.
Organization of Thesis
The remaining of this thesis is organized as follows.
• Chapter 2 This chapter is an overview over prior works related to this thesis, including the k-means and k-means++ algorithms, the definition of coresets, a short brief about Apache Spark and a theorem about bounds for sample com- plexity.
• Chapter 3 We introduce about farthest-first-traversal algorithm as well as the ProTraS for finding coresets Then we propose an FFT-based algorithm for coreset construction.
• Chapter 4 This chapter is about lightweight coreset and our general lightweight coreset model We also prove the correctness of this model and propose a gen- eral algorithm for thisα - lightweight coreset.
• Chapter 5 This chapter shows the experimental running for clustering large datasets We use theα - lightweight coreset for sampling process and kMeans++ for clustering on Apache Spark framework.
• Chapter 6 We have the thesis conclusion and an ending here.
Publications relevant to this Thesis
• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang A Comparative Study of the Use of Coresets for Clustering Large Datasets pp 45-55 LNCS
11814 Future Data and Security Engineering FDSE 2019.
• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang A Farthest-First-Traversal-based Sampling Algorithm for k-clustering International Con- ference on Ubiquitous Information Management and Communication IMCOM2020.
In this chapter, we provide a short introduction about background and prior works related to this thesis.
Chapter 2 Background and Related Works 8
k-Means and k-Means++ Clustering
k-Means Clustering
The k-means clustering is one of the oldest and most important questions in machine learning Given an integerkand a data setX⊂R d , the goal is to choosekcenters so as to minimize the total squared distance between each point and its closest center. The k-means clustering can be described as follows
Let X⊂R d , the k-means clustering problems is to find a set Q⊂R d with|Q|=k such that the functionφ X (Q)is minimized, where φ X (Q) = ∑ x∈X d(x,Q) 2 = ∑ x∈X minq∈Q||x−q|| 2
In 1957, an algorithm, now often referred to simply as “k-means”, was proposed by S Lloyd of Bell Labs; it was then published in 1982 [42] Lloyd’s algorithm begins with k arbitrary “centers,” typically chosen uniformly at random from the data points Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it These last two steps are repeated until the process stabilizes [6]
The Lloyd’s algorithm is described in Algorithm1
Algorithm 1k-Means Clustering - Lloyd’s Algorithm [42]
Require: data setX, number of clustersk
The algorithm was then developed by Inaba et al [33], Matousek [47], Vega et al [63], etc However, one of the most advanced improvement of k-means is the k-means++ by Author and Vassilvitskii in [6] We will give an overview about this algorithm in next section.
Chapter 2 Background and Related Works 9
k-Means++ Clustering
In the Algorithm 1, the initial set of cluster centers (line 1) is based on random sampling where k points are selected uniformly at random from the data set This simple approach was fast and easy to implement However, there are many natural examples for which the algorithm generates arbitrarily bad clusterings This happens due to the conflict placement of the starting centers, and in particular, it can hold with high probability even if the centers are chosen uniformly at random from the data points [6]
To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithm named k-means++ which uses adaptive seeding based on a technique called D 2 - sampling to create its initial seed set before running Lloyd’s algorithm to conver- gence [8] Given an existing set of centersS, theD 2 - sampling strategy, as the name suggests, samples each pointx∈X with probability proportional to the squared dis- tance to the selected centers, i.e., p(x|S) = d(x,S) 2
TheD 2 - sampling is described in Algorithm2
Require: data setX, number of clustersk
Ensure: initial setSused for k-means
The setS from this algorithm will be used to replace line 1 of the original k- means in Algorithm1.
Chapter 2 Background and Related Works 10
Coresets
Definition
In this thesis, we apply data sampling via coresets to deal with the massive data set- ting In computational geometry, a coreset is a small set of points that approximates the shape of a larger point set, in the sense that applying some geometric measure to the two sets results in approximately equal numbers [Wikipedia] In the usage of clustering problem terms, a coreset is a weighted subset of the data such that the quality of any clustering evaluated on coreset closely approximates the quality on the full data set.
In most cases, it is not easy to find this most relevant subset Consequently, attention has shifted to developing approximation algorithms The goal now is to compute an (1+ε)-approximation subset, for some 00, v>0 andδ >0 Fix a countably infinite domain X and let p(.)be any probability distribution over X,
LetFbe a set of functions from X to [0,1] with Pdim(F) =d and denote by C a sample of m points from X sampled independently according to p(.)
Then, for m≥ c α 2 v dlog1 v+log1 δ where c is an absolute constant, it holds with probability at least1−δ that
≤α where d v (a,b) = |a−b| a+b+vOver all choices ofFwith Pdim(F) =d, this bound on m is tight.
In this chapter, we propose a coreset construction for k-means and k-median cluster- ing by based on FFT algorithm This algorithm, even though has high runtime with a lot of computations, can be considered as one of the best relevant coreset that may be built for research and scientific purposes.
• Firstly, we show that the Farthest-First-Traversal algorithm (FFT) can yield a (k,ε)-coreset for both k-median and k-means clustering
• Secondly, we illustrate some existing limitations of ProTraS [58], the state-of- the-art coreset construction that based on FFT
• From that, by based on FFT combined with good points from ProTraS, we propose an algorithm for coreset construction of both k-means and k-median clustering
• We compare this proposed coreset with other state-of-the-art sample coresets from Lightweight Coreset of Bachem et al [12], Adaptive Sampling of Feld- man et al in [23] and Uniform Sampling as baseline to show that this proposed coreset can be considered as the best suitable subset of any original full data.
Even though this thesis is mainly about coresets for k-means clustering However, in this section, we also prove results about coresets for k-median clustering Therefore,the FFT-based coresets can be applied not only for k-means but also for k-median clustering.
Farthest-First-Traversal Algorithm
We start this chapter with a short introduction about Farthest-First-Traversal (FFT) algorithm In computational geometry, the FFT of a metric space is a set of points selected sequently; after the first point is chosen arbitrarily, each next successive point is located as the farthest one from the set of previously-selected points The first use of the FFT was by Rosenkrantz, Stearns & Lewis [59] in connection with heuristics for the traveling salesman problem Then, Gonzalez [26] used it as part of a greedy approximation algorithm for the problem of finding k clusters that minimize the maximum diameter of a cluster Later, Arthur & Vassilvitskii [6] use a FFT-like algorithm to propose k-means++ algorithm.
The FFT is described in Algorithm3
Algorithm 3Farthest-First-Traversal algorithm
As mentioned, FFT algorithm can be used to solve many complicated problems in data mining and machine learning, in next section, we will prove that the process of FFT can yield coresets for both k-median and k-means clustering Then, we can find a coreset by applying FFT algorithm.
FFT-based Coresets for k-Median and k-Means Clustering
Firstly, we define some expressions used in this section.
Let X⊂R d be a data set and x∈X Let C⊂X be a subset of X For each c∈C, we denote
T(c)= the number of items from X whose closest point in C is c, i.e.
• d c =max t∈T (c) d(t,c)= the largest distance between all points in T(c)to c
Theorem 2 (FFT-based Coresets for k-Median Clustering)
There exists an ε >0 such that the subset receiving from FFT algorithm on dataset X is a(k,ε)-coreset of X for k-median clustering.
Proof: k-median is a variation of k-means where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median With k-median, the functionφ ofX andCare defined as φ X (Q) = ∑ x∈X d(x,Q) = ∑ x∈X minq∈Q||x−q|| φ C (Q) = ∑ c∈C w c d(c,Q) = ∑ c∈C w c min q∈Q||c−q||
Using triangle inequality,∀x∈X andc∈C, we have d(x,Q)≤d(x,c) +d(c,Q)
Sum the inequality for allx∈T(c),
Denote|X|=nand |C|=m We now prove that the value ofε will approach zero when the size of coreset increases, i.e lim m→n ε =0
From the definition ofT(c), we have
Hence, c∈C ∑ w c =n−m Sincew i >0,∀i=1,2, ,m, and by applying Cauchy - Schwarz inequality, we have
So,subset C from FFT on data X is a(k,ε)-coreset of X for k-median clustering
Theorem 3 (FFT-based Coresets for k-Means Clustering)
There exists an ε >0 such that the subset receiving from FFT algorithm on dataset X is a(k,ε)-coreset of X for k-means clustering.
Proof: With k-means, the functionφ of X and C are φ X (Q) = ∑ x∈X d(x,Q) 2 = ∑ x∈X minq∈Q||x−q|| 2 φ C (Q) = ∑ c∈C w c d(c,Q) 2 = ∑ c∈C w c min q∈Q||c−q|| 2
We denoted max =maxx∈Xd(x,Q)
Using triangle inequality,∀x∈X andc∈C, we have d(x,Q)≤d(x,c) +d(c,Q) d(x,Q)≤d c +d(c,Q),∀x∈T(c) d(x,Q) 2 ≤ d c +d(c,Q)2 d(x,Q) 2 ≤d c 2 +d(c,Q) 2 +2d c d max Sum the inequality for allx∈T(c), then for allc∈C, x∈T ∑ (c) d(x,Q) 2 ≤w c d c 2 +w c d(c,Q) 2 +2w c d c d max c∈C ∑ ∑ x∈T (c) d(x,Q) 2 ≤ ∑ c∈C w c d c 2 +∑ c∈C w c d(c,Q) 2 +2∑ c∈C w c d c d max
Similarly, from triangle inequality, d(c,Q)≤d(x,c) +d(x,Q) d(c,Q)≤d c +d(x,Q),∀x∈T(c) d(c,Q) 2 ≤d c 2 +d(x,Q) 2 +2d c d max Sum the inequality for allx∈T(c), then for allc∈C, this implies w c d(c,Q) 2 ≤w c d c 2 + ∑ x∈T (c) d(x,Q) 2 +2w c d c d max c∈C ∑ w c d(c,Q) 2 ≤ ∑ c∈C w c d c 2 +∑ c∈C ∑ x∈T (c) d(x,Q) 2 +2∑ c∈C w c d c d max
∆= ∑ c∈C w c d c d c +2d max and choose ε = ∆ φ X (Q) =∑c∈Cw c d c d c +2dmax φ X (Q) (3.6)
Then combine with (3.4) and (3.5), we have
To finish the proof, we also need to prove the value of thisεwill approach zero when the size of coreset increases
Similarly to the proof of theorem2for the case of k-median clustering, denote|X|=n and|C|=m, we have c∈C ∑ w c =n−m Sincew i >0,∀i=1,2, ,m, and by applying Cauchy - Schwarz inequality, we have
∑c∈Cd c 2 d c +2d max 2 φ X (Q) m→nlimε≤ lim m→n n−mq
So,subset C from FFT on data X is a(k,ε)-coreset of X for k-means clustering
From theorem 2 for k-median and theorem3 for k-means clustering, we con- clude this section by a 2-in-1 theorem as follows
Theorem 4 (FFT-based Coresets for both k-Median and k-Means Clustering)The sample from applying FFT algorithm on data set X is a(k,ε)-coreset of X for both k-median and k-means clustering
ProTraS algorithm and limitations
ProTraS algorithm
In 2017, they proposed DENDIS [56] and DIDES [57] which are iterative algorithms based on the hybridization of distance and density concepts They differ in the pri- ority given to distance or density, and in the stopping criterion defined accordingly. However, they have drawbacks In 2018, by based on the FFT algorithm and the good points from DENDIS and DIDES, Ros and Guillaume proposed a new algo- rithm named ProTraS [58] that is both easy to tune and scalable ProTraS algorithm is based on the sampling cost which is computed according to the within group dis- tance and to the representativeness of the sample item This algorithm is designed to produce a(k,ε)-coreset and use the approximation level,ε, as the stopping crite- rion This algorithm is then used and provides good respect in some research such as in [62] by Le Hong Trang et al.
The original ProTraS is described in Algorithm4
Drawbacks in ProTraS
ProTraS is a good method to build a coreset, but it still has some drawbacks In Algorithm4, the sampling cost formulation at line18and stopping criterion at line24 lead to some problems in reality dataset such as unable to distinguish data sets in same shape but different elasticity Figure 3.1 shows an example of two datasets: original-R15 1 and the bigger - the scaling x10 of R15 With the same shape and the same size, these two samples must create two coresets with the same size In fact, forε=0.2, Algorithm4creates coreset with size 128 for original-R15 and coreset with size 412 for scaling-R15 The error is from the cost function of Algorithm4. According to line18of Algorithm4, the cost function can be expressed as cost = ∑ y k ∈C p k n = ∑ y k ∈C w k d k n (3.7)
1 R15 is a sample dataset from https://cs.joensuu.fi/sipu/datasets
1: Select an initial patternx init ∈T
12: Stored max (y k ),x max (y k )whered max (y k ) =d(x max (y k ),y k )
FIGURE3.1: Original-R15 (small cluster) and scaling-R15
The ProTraS will stop when cost 0and k∈N Let X ⊂R d be a set of points with mean àX Denote p(x)as the probability distribution on X with p(x) = 1
Let C be the subset of X by sampling |C|=m points from X where each point x∈C has weight w C x = m.p(x) 1 and is sampled with probability p(x)
We have for any constant c, if m satisfies m≥ c ε 2 dklogk+log1 δ
Then, with probability at least1−δ, the set C is an(α,ε,k)-lightweight coreset of X for k-means clustering.
Proof: We have to prove that setCsatisfies the definition4of(α,ε,k)-lightweight coreset, i.e we need to prove
|φ X (Q)−φ C (Q)| ≤α ε φ X (Q) + (1−α)ε φ X ({à X }) (4.3) From definition of k-means clustering, φ X (Q) = ∑ x∈X d(x,Q) 2 φ X ({à X }) = ∑ x∈X d(x,à X ) 2 φ C (Q) = ∑ x∈C w C x d(x,Q) 2
Then d(x,Q) 2 H(Q) ≤g(x) LetGbe the average of allg(x)forx∈X, we obtain
By applying lemma1with the relation amongg(x),G,|X|and p(x), we obtain f Q (x) = d(x,Q) 2
Apply theorem1for the Bounds on the Sample Complexity of Learning with param- eters as follows
Then, there existδ >0, constantcand m≥ c ε 2 dklogk+log1 δ
Such that with probability at least 1−δ, we have d1 k
12(2α+1) (4.5) where d v (a,b) = |a−b| a+b+v Since k is the number of clusters k≥1 =⇒ 0≤ 1 k ≤1
Then, combine with inequality (4.5), we receive
|C| ∑ x∈C f Q ∗ (x) ≤ α(1−α)ε 4(2α+1) Substitute f Q ∗ (x)in this inequality with f Q ∗ (x) = α(1−α)
Multiply both sides withX.H(Q), we obtain
|X| φ X ({à X }) and w C x = 1 m.p(x) Then, the inequality (4.8) now is equivalent to
Let X ⊂R d be a set of points with meanà X Denote
|X| φ X ({à X }) For all x∈X and Q⊂R d , it holds that d(x,Q) 2
Proof: Reminder 1 Cauchy - Schwarz inequality foraandb,
Reminder 2 Basic fraction inequality, fora,b,c,d>0, a+b c+d ≤ a c+b d (4.10)
By the triangle inequality, we have d(à X ,Q)≤d(x,à X ) +d(x,Q) Apply Cauchy-Schwarz inequality (4.9), then d(à X ,Q) 2 ≤ d(x,à X ) +d(x,Q)2
≤2d(x,à X ) 2 +2d(x,Q) 2 Averaging across allx∈X, we obtain x∈X ∑ d(à X ,Q) 2 ≤ ∑ x∈X
Similarity, by the triangle inequality and apply Cauchy-Schwarz inequality (4.9), d(x,Q)≤d(x,à X ) +d(à X ,Q)
|X|φ X (Q) (4.12) Divide both sides of (4.12) byH(Q)to obtain that d(x,Q) 2 H(Q) ≤ 2d(x,àX) 2 + |X 4 | φX({à X }) + |X 4 | φX(Q) α
|X|φX(Q) + 1−α |X| φX({à X }) Apply basic fraction inequality (4.10) to the right hand side, d(x,Q) 2
In this chapter, we use our proposed model for theα - lightweight coreset in Chap- ter4and combine with the Apache Spark, the framework for big data processing, for solving the whole problem of this thesis - clustering large datasets This chapter will include
• We describe the process of clustering large datasets with the use of the α - lightweight coreset in Chapter 4 and built-in library of Apache Spark This process also includes a data generalization program to cluster large data from the clustering solution on its subset.
• We do experiments with above method and test how good it is when compare to clustering directly on original dataset.
• We use Adjusted Rand Index to evaluate the comparison and give discussions at the end of this chapter.
Chapter 5 Clustering Large Datasets via Coresets and Spark 46
Processing Method
Data Generalization
When the dataset size is big enough, the processing of k-means clustering, even with the help of framework for big data processing such as Apache Hadoop, Spark, etc., will still be very slow, Therefore, instead of solving the problems directly on original dataset, we will find the solution on its coresets which are proved to be relevant to the full dataset Since the size of coresets is much smaller than of original dataset, the k-means clustering will be much faster and easier Then, by the solutions on these coresets, we generalize to the full dataset and retrieve the final answer on the original dataset.
For the data generalization process, we apply a very visual and easy-to-apply method: the cluster index of any point not belonging to the coreset is similar to its nearest point in coreset Moreover, in Chapter4, we have proved that the(α,ε,k)
- lightweight coreset can be used for "traditional" coreset, the data generalization method can be stated as follows
Given dataset X⊂R d and C is the(α,ε,k)- lightweight coreset of X
Then, for each x ∗ ∈X\C, the cluster label of x ∗ is the same as the cluster label of c ∗ with c ∗ ∈C and d(x ∗ ,c ∗ ) =min c∈Cd(x ∗ ,c)
Built-in k-Means clustering in Spark
(This section is based on the original article on Spark website 1 ) k-meansis one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters Thespark.mllibimplementa- tion includes a parallelized variant of thek-means++method calledkmeans
The implementation inspark.mllibhas the following parameters
• k is the number of desired clusters Note that it is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
• maxIterationsis the maximum number of iterations to run.
• initializationModespecifies either random initialization or initialization via k- means||.
1 https://spark.apache.org/docs/latest/mllib-clustering.html
Chapter 5 Clustering Large Datasets via Coresets and Spark 47
• initializationStepsdetermines the number of steps in the k-means|| algorithm.
• epsilon determines the distance threshold within which we consider k-means to have converged.
• initialModelis an optional set of cluster centers used for initialization If this parameter is supplied, only one run is performed.
In general, it is quite easy to applykmeans in Spark, the program written in Python just need a few lines of code as follows
LISTING5.1: Python code forkmeans in Spark
Realistic Method
Withkmeans in Spark and the Algorithm7for constructing a(α,ε,k)- lightweight coreset, clustering large-scale datasets will be executed step by step as follows
1 Step 1 Apply Algorithm 7 to generate a (α,ε,k) - lightweight coreset of original data set
2 Step 2 Use kmeans in Spark to cluster the sample receiving from Step 1. After this step, we have the labels of all data in the sample.
3 Step 3 Based on the results fromstep 2, and by using methods mentioned in section5.1.1, we generalize to full data set to get all data labels Now, all data are assign cluster labels and we finish the clustering process.
Chapter 5 Clustering Large Datasets via Coresets and Spark 48
Experiments
Experimental Method
To evaluate how well this clustering process works, we make a comparison among various approaches with the results from thekmeans in Apache Spark applied di- rectly on the original data The labels and running time from full data are then used as the baseline to other methods In this thesis, we use three different approaches as follows
• Method 1 (denoted asUniform): Uniform Sampling combined withkmeans in Spark : a naive approach to coreset constructions which is based on uniform sub-sampling of the data The Uniform Sampling will replace Algorithm7in Step 1of section5.1.3
• Method 2 (denoted as Lightweight): Traditional Lightweight Coreset: Algo- rithm6of lightweight coreset combined withkmeans in Spark
• Method 3 (denoted as Alpha34): α - Lightweight Coreset with kmeans in
Spark For this experiment, we chooseα = 3 4 The probability distribution on
To compare the quality of the predicted labels from Methods 2-3 and 4 with the labels on full data set from Method 1, we use the Adjusted Rand Index (ARI) proposed by Huber and Arabie [32], another version of Rand Index by William Rand [52].
Since all methods are based on sampling, in order to do the experiments more precisely, each configuration for each method is run 20 times, we use the average value to represent the results of each experiment.
All experiments were implemented in Python and run on an Intel Core i7 ma- chine with 8×2.8GHz processors and 16 GB memory.
Experimental Data Sets
We use 5 large datasets with multiple dimensions from data clustering repository of the computing school of Eastern Finland University 2 , and from GitHub clustering benchmark 3 These datasets are described in Table5.1.
2 https://cs.joensuu.fi/sipu/datasets
3 https://github.com/deric/clustering-benchmark
Chapter 5 Clustering Large Datasets via Coresets and Spark 49
TABLE5.1: Data sets for Experiments
Data Name Size No of Clusters No of Dimensions
Results
The relationship between the ARI and coreset size is shown in Figure5.1, 5.2, 5.3, 5.4and5.5 for each dataset described in table5.1 For all results, when the coreset sizes increase, ARI’s also have higher values This means that all algorithms create better coresets when more data points are chosen Table5.2, 5.3, 5.4, 5.5and 5.6 show the full experimental results including
• Coreset Runtime: the time to execute the coreset construction algorithm
• Spark Runtime: the time to cluster the samples by k-means of Spark
• DataGen Runtime: the time for data generalization process
• Total Runtime: total time of the clustering process from finding coresets to the clustering for original dataset
The results can be briefly summarized as
• In most cases, Uniform Sampling creates the lowest ARI, it means Uniform Sampling is the worst coreset construction This is understandable since Uni- form Sampling is the simplest and naivest method.
• Lightweight Coresets and the(α= 3 4 )-lightweight coreset creates coresets hav- ing nearly equal ARI in most cases These results are better than the Uniform Sampling.
• These three methods are extremely fast with nearly same runtime.
• During the process of clustering large scale dataset, the Coreset Runtime is very short (less than 0.1s), the DataGen processes need about 2-3s and the most expensive time is the time for clustering in Spark.
• In these Figure5.1, 5.2,5.3,5.4and5.5, the red big points represent the Run- time of clustering full dataset with Spark This shows that the clustering viaCoresets and Spark is much more faster than solving the problem directly.
Chapter 5 Clustering Large Datasets via Coresets and Spark 50
FIGURE5.1: ARI and Runtime of Birch1 in relation to full data
FIGURE5.2: ARI and Runtime of Birch2 in relation to full data
Chapter 5 Clustering Large Datasets via Coresets and Spark 51
FIGURE5.3: ARI and Runtime of Birch3 in relation to full data
FIGURE5.4: ARI and Runtime of ConfLongDemo in relation to full data
FIGURE5.5: ARI and Runtime of KDDCupBio in relation to full data
Chapter 5 Clustering Large Datasets via Coresets and Spark 52
TABLE5.2: Experimental Results for dataset Birch1
Algorithm Coreset Coreset Spark DataGen Total ARI
Size Runtime Runtime Runtime Runtime
Chapter 5 Clustering Large Datasets via Coresets and Spark 53
TABLE5.3: Experimental Results for dataset Birch2
Algorithm Coreset Coreset Spark DataGen Total ARI
Size Runtime Runtime Runtime Runtime
Chapter 5 Clustering Large Datasets via Coresets and Spark 54
TABLE5.4: Experimental Results for dataset Birch3
Algorithm Coreset Coreset Spark DataGen Total ARI
Size Runtime Runtime Runtime Runtime
Chapter 5 Clustering Large Datasets via Coresets and Spark 55
TABLE5.5: Experimental Results for dataset ConfLongDemo
Algorithm Coreset Coreset Spark DataGen Total ARI
Size Runtime Runtime Runtime Runtime
Chapter 5 Clustering Large Datasets via Coresets and Spark 56
TABLE5.6: Experimental Results for dataset KDDCup Bio
Algorithm Coreset Coreset Spark DataGen Total ARI
Size Runtime Runtime Runtime Runtime
Alpha34 16000 0.0178 9.6949 48.0922 57.8050 0.3195Lightweight 16000 0.3706 9.2364 47.5339 57.1409 0.3101Uniform 16000 0.3806 9.0320 47.7005 57.1130 0.2993Alpha34 32000 0.0306 16.0353 44.5965 60.6624 0.3314Lightweight 32000 0.4077 16.1055 44.6503 61.1635 0.3309Uniform 32000 0.4074 15.0087 44.6620 60.0780 0.3111
In this thesis, we solve the problem of clustering large scale datasets The approaches we use are based on the data sampling via coresets and Apache Spark During investi- gating data sampling via coresets, we propose two coreset constructions for k-means clustering The whole thesis can be summarized as follows
In Chapter2, we introduce and give an overview about background and related works If k-means clustering is a very classical term of machine learning, "coreset" seems to be more fascinating The meaning of coreset is that instead of solving problems on big data which cost a lot of computations, one can find a subset so that the solutions on this subset can be approximate to the solutions of the original dataset We also provide a brief introduction about Apache Spark in this chapter and some definitions as well as theorems that are useful and related to this thesis.
In Chapter 3, we have proved that the Farthest-First-Traversal algorithm itself is a very good method to find coresets for both k-median and k-means problems. Based on this, we propose a novel algorithm of coreset constructions for k-means and k-median that depends on the number of coreset size The disadvantage of this proposed algorithm as well as all other FFT-related algorithms when comparing with sampling methods is the speed and runtime This is obviously since all FFT-related algorithms need to execute each point in full data set.
However, the results from this algorithm contains not only the coreset of full data with high correctness but also provides very useful characteristics of data: max distance and number of elements of each representative in the coreset These in- formation can be used for further purposes in some researches and applications that need to estimate distribution, density or structure of original data set Moreover, unlike other sampling-based methods which create different samples each time we re-run the process, the coreset from proposed algorithm is unique and unchanged, this means that we only need to run one time and the received subset is truly a core- set of the full data for k-means and k-median clustering.
In Chapter4, based on prior work about Lightweight Coreset [12], we propose a general lightweight coreset, namedα - lightweight coreset which allows both multi- plicative and additive errors Unlike traditional lightweight coresets where both mul- tiplicative and additive errors are treated the same and have equal quantities, theα -
Chapter 6 Conclusions 58 lightweight coreset allows to adjust the proportion between these two errors: α = 1 2 for traditional lightweight coreset,α larger means more focusing on multiplicative, or reversely,α smaller means need more concentrating on additive errors.
In this chapter, we also propose and prove the general algorithm for the α - lightweight coreset construction Since this approach is a sampling-based method, the algorithm can execute extremely fast and can be used in practice.
In Chapter5, we apply the proposed method in Chapter4for solving the main problem of this thesis - clustering large scale datasets To solve it properly and faster, we apply a framework for big data, Apache Spark This approach enforces the prob- lem to run smoothly and quickly Fortunately, Spark is a wonderful framework with various libraries and built-in functions for machine learning and data mining With Spark, the k-means++ clustering is now so easy to implement and deploy.
To evaluate the method, we do experiments with some large scale sample data sets And the results have shown that data sampling through uniform sampling should be replaced by theα - lightweight coreset With nearly equal in time running, but the results from both traditional lightweight coreset andα - lightweight coreset (in this experiment, we choose α = 3 4 ) outperform the results from random uniform sampling.
Overall, in this thesis, we propose and prove two methods for coreset construc- tions for k-means clustering The FFT-based coreset constructions seem to be very slow, but the accuracy can be considered as one of the best relevant subset to any original data This method should be used for research and science and can be used as a baseline to compare with future proposed methods In the other hand, our second coreset construction, theα- lightweight coreset, is based on sampling-based method. This approach can find a coreset very quickly, but the accuracy of this method is not as high as FFT-based coreset However, these methods are obviously better than random uniform sampling - a very naive and widely-used method.
Finally, each method mentioned in this paper has its own advantages and dis- advantages The options ’Slow but more accuracy’ or ’Fast but less correct’ will be weighed before applying any of these algorithms in practice.