Luận văn thạc sĩ Khoa học máy tính: Phân cụm các tập dữ liệu có kích thước lớn dựa vào lấy mẫu và nền tảng Spank

OBLIGATIONS AND CONTENTS:• Study and research about clustering problems, data sampling methods, datageneralization and Apache Spark framework for big data.• Based on Data Sampling, we pr

Overview

Over the past few years, the development of technology has lead to the rapid increasing in the amount of data Twenty years ago, almost data were just from sciences, but the explosion of Internet takes human to new era of data They are now existing everywhere around us For instance, the growth of smart city and Internet of Things (IoT) devices such as aerial (remote sensing), cameras, microphones, radio- frequency identification (RFID) readers, wireless sensor networks, etc create a lot of data every second Besides, the increasing in mobile and personal devices such as smart phones or tablets also make the data bigger every day, especially through photo and video sharing in popular social networks like Facebook, Twitter, YouTube, etc. Many researches have shown that the amount of data created each year is growing faster than ever before and they estimate that by 2020, every human on the planet will be creating 1.7 megabytes of information each second; and in only a year, the accu- mulated world data will grow to 44 zettabytes 1 [46] Another research from IDC predict that the amount of global data captured in 2025 will reach 163 zettabytes, a tenfold increase compared to 2016 [55]

Consequently, researchers now have to face new hard situation: solving problems for data that have big amount in volume, variety, velocity, veracity and value. (Figure1.1 2 ) For the demand of understanding and explaining these data in order to solve reality problems, it is very hard for human if there is no help from machine. That’s why machine learning plays an important role in this decade as well as in the future By applying machine learning combined with artificial intelligence (AI), scientists can create systems having the ability to automatically learn and improve from experience without being explicitly programmed.

For each specific purpose, machine learning is divided into two categories: supervised and unsupervised Supervised learning is a kind of training model where the training sets go along with provided target labels, the system will learn from these

1 One zettabyte is equivalent to one billion gigabytes.

2 Image source: https://www.edureka.co/blog/what-is-big-data/

FIGURE1.1: Big Data properties training sets and then is used to predict or classify future instances In contrast, unsupervised machine learning approaches extract information from data sets where such explicit labels are not available The importance of this field is expected to grow as it is estimated that 85% of global data in 2025 will be unlabeled [55] In particular, data clustering - the tasks of grouping together similar objects into clusters — seems to be a fruitful approach for analyzing that data [13] Applications are broad and include fields such as computer vision [61], information retrieval [35], computational geometry [36] and recommendation systems [41] Furthermore, clustering techniques can also be used to learn data representations that are used in downstream prediction tasks such as classification and regression [16] Machine learning categories can be described briefly in Figure1.2 3

In general, clustering is one of the most popular techniques in machine learning and is used widely in large-scale data analysis The target of clustering is partition- ing a set of objects into groups such that objects in same group are similar to each other and objects in different groups are dissimilar to each other This technique, due to its importance and application in reality, has a lot of investigations and various algorithms For example, we can use BIRCH [68], CURE [27] which are belonging to hierarchical clustering, also known as connectivity-based clustering, for solving problems based on the idea of objects being more related to nearby objects than to objects farther away If the problems are closely related to statistics, we can use distribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DB- CLASD [53] For matter based on density clustering in which the data that is in the

3 Image source: https://towardsdatascience.com/supervised-vs-unsupervised-learning

FIGURE1.2: Machine Learning: Supervised vs Unsupervised region with high density of the data space is considered to belong to the same cluster [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known density- based clustering algorithm, or OPTICS [4] - an improvement of DBSCAN And one of the most common approaches for clustering is based on partition in which the basic idea is to assign the centers of data points to some random objects, the actual centers will be reveal through several iterations until a stop condition is satisfied Some common algorithms of this kind are k-means [45], k-medoids [49], CLARA [37], CLARANS [48] For a more detail, we refer readers to the survey of clustering algorithms by R.Xu (2005) [65] and by D Xu (2015) [64]

In fact, there are a lot of clustering algorithms and improvements that can be used in applications Each one has its own benefits and drawbacks as well The question of choosing a suitable clustering algorithm is an important and difficult problem that users must deal with when they have to solve situations with specific configurations and settings There are some research about this such as in [14], [39], [67] where explain about the quality of clusters in some circumstances However, in the scope of this thesis, we do not cover this issue and various clustering algorithms, instead of this, we fix and select one of the most popular clustering algorithm - the k- means clustering We will use this algorithm throughout of this report and investigate methods that can deal with k-means clustering for large-scale data set.

Moreover, to design a complete solution that can cluster and analyze large-scale data is still a challenge for data scientists Many methods have been proposed for several years to deal with machine learning for big data One of the simplest way is depending on infrastructure and hardware: the more powerful and modern machine we have, the more complicated and larger amount of data we can solve This solution is quite easy but costs a lot of money and few people can afford this Another option is finding suitable algorithms to reduce the computational complexity from the input size that may contain millions or billions of data points There are some approach methods such as data compression in [69], [1], data deduplication [19], dimension

Chapter 1 Introduction 4 reduction [25], [60], [51], etc For a survey about this, readers can find more useful information in [54] Among big data reduction methods, data sampling is one of the popular options that are closely related to machine learning and data mining for researchers The key idea of data sampling is that instead of solving problems on the full data with large-scale size, we can find the answer for the subset of this data;this result is then used as the baseline for finding the actual solution for original data set This leads us to a new difficulty: finding a subset that must be small enough for effectively reducing computational complexity but must keep all representative characteristics of original data And, this difficulty is the motivation for us to do this research and this thesis as well.

The Scope of Research

In this thesis, we will propose a solution for a problem of clustering large datasets.

We use the word "large" to indicate the data that has "big" in volume, not the whole characteristics of big data described in previous section with 5 V’s (Volume, Variety, Value, Velocity and Veracity) (Figure1.1) However, the Volume, in other words, the data size, is one of the most non-trivial difficulties that most researchers have to face when solving a big data related problem.

For clustering algorithm, even though there are a lot of investigations and methods, we consider fixed clustering problems with the prototypical k-means clustering.

We select this because k-means is the most well-known clustering algorithm and is widely applied in reality as well as in industry or scientific research.

While there is a wealth of prior work on clustering of small and medium sized data sets, there are unique challenges in the massive data setting The traditional algorithms have a super-linear computational complexity on the size of the data set making them infeasible when there are many data points In the scope of this thesis, we apply data sampling to deal with the massive data setting A very basic approach of this method is random sampling or uniform sampling In fact, while uniform sampling is feasible for some problems, there are instances where it performs very poorly due to the naive nature of the sampling strategy For example, real-world data is often imbalanced and contains clusters of different sizes As a consequence, a small fraction of data points can be very important and have an enormous impact on the objective function Such imbalanced data sets are problematic for methods based on uniform sampling since, with high probability These methods only sample points in large clusters and the information in small clusters is discarded [13]

The idea of finding a relevant subset from original data to decrease the computational cost brings scientists to the concept of coreset, which was first applied in geometric approximation by Agarwal et al in 2004 [2], [3] The problem of coreset constructions for k-median and k-means clustering was then stated and investigated by Har-Peled et al in [28], [29] Since that time, many coreset construction algorithms have been proposed for a wide variety of clustering problems In this thesis,

Chapter 1 Introduction 5 by using state-of-the-art coreset methods, we propose two methods for coreset constructions for k-means clustering.

In addition to thinking algorithms in machine learning for solving big data problems, data scientists also invent, create and then apply framework for fasten- ing the process of executing big data Some most popular open-source and free-to- use frameworks are Apache Hadoop, Apache Spark, Apache Storm, Apache Samza,Apache Flink, etc Each one is designed with different architectures and has its own strength For a survey and more details about these frameworks, readers can find more from [34] In this thesis, along with data sampling via coresets, we will apply the built-in k-means clustering algorithm of Apache Spark to shorten the runtime of the whole problem.

Research Contributions

Scientific Significance

In this thesis, based on prior works about coresets, we propose new algorithms of coreset constructions for k-means clustering

• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros & Guillaume in [58], we propose an FFT-based coreset construction This part is explained and proved clearly in Chapter3

• By based on the lightweight coreset of Bachem, Lucic and Krause [12], we propose a general model for the α - lightweight coreset, then we propose a general lightweight coreset construction that is very fast and practical This is proved is Chapter4

Practical Significance

• Due to its high runtime, our proposed FFT-based coreset construction is very hard to be used in reality However, through experiments with some state-of- the-arts coreset constructions, this proposed algorithm is showed that it is able to produce one of the best sample coresets that can be used in experiments.

• Our proposed α - lightweight coreset model is a generalization of the traditional lightweight coreset This proposal can be used for various practical cases, especially for situations that need to focus on multiplicative errors or additive errors of the samples.

Organization of Thesis

The remaining of this thesis is organized as follows.

• Chapter 2 This chapter is an overview over prior works related to this thesis, including the k-means and k-means++ algorithms, the definition of coresets, a short brief about Apache Spark and a theorem about bounds for sample complexity.

• Chapter 3 We introduce about farthest-first-traversal algorithm as well as the ProTraS for finding coresets Then we propose an FFT-based algorithm for coreset construction.

• Chapter 4 This chapter is about lightweight coreset and our general lightweight coreset model We also prove the correctness of this model and propose a general algorithm for thisα - lightweight coreset.

• Chapter 5 This chapter shows the experimental running for clustering large datasets We use theα - lightweight coreset for sampling process and kMeans++ for clustering on Apache Spark framework.

• Chapter 6 We have the thesis conclusion and an ending here.

Publications relevant to this Thesis

• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang A Comparative Study of the Use of Coresets for Clustering Large Datasets pp 45-55 LNCS

11814 Future Data and Security Engineering FDSE 2019.

• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang A Farthest-First-Traversal-based Sampling Algorithm for k-clustering International Con- ference on Ubiquitous Information Management and Communication IMCOM2020.

In this chapter, we provide a short introduction about background and prior works related to this thesis.

Chapter 2 Background and Related Works 8

k-Means and k-Means++ Clustering

k-Means Clustering

The k-means clustering is one of the oldest and most important questions in machine learning Given an integerkand a data setX⊂R d , the goal is to choosekcenters so as to minimize the total squared distance between each point and its closest center. The k-means clustering can be described as follows

Let X⊂R d , the k-means clustering problems is to find a set Q⊂R d with|Q|=k such that the functionφ X (Q)is minimized, where φ X (Q) = ∑ x∈X d(x,Q) 2 = ∑ x∈X minq∈Q||x−q|| 2

In 1957, an algorithm, now often referred to simply as “k-means”, was proposed by S Lloyd of Bell Labs; it was then published in 1982 [42] Lloyd’s algorithm begins with k arbitrary “centers,” typically chosen uniformly at random from the data points Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it These last two steps are repeated until the process stabilizes [6]

The Lloyd’s algorithm is described in Algorithm1

Algorithm 1k-Means Clustering - Lloyd’s Algorithm [42]

Require: data setX, number of clustersk

The algorithm was then developed by Inaba et al [33], Matousek [47], Vega et al [63], etc However, one of the most advanced improvement of k-means is the k-means++ by Author and Vassilvitskii in [6] We will give an overview about this algorithm in next section.

k-Means++ Clustering

In the Algorithm 1, the initial set of cluster centers (line 1) is based on random sampling where k points are selected uniformly at random from the data set This simple approach was fast and easy to implement However, there are many natural examples for which the algorithm generates arbitrarily bad clusterings This happens due to the conflict placement of the starting centers, and in particular, it can hold with high probability even if the centers are chosen uniformly at random from the data points [6]

To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithm named k-means++ which uses adaptive seeding based on a technique called D 2 - sampling to create its initial seed set before running Lloyd’s algorithm to conver- gence [8] Given an existing set of centersS, theD 2 - sampling strategy, as the name suggests, samples each pointx∈X with probability proportional to the squared distance to the selected centers, i.e., p(x|S) = d(x,S) 2

TheD 2 - sampling is described in Algorithm2

Require: data setX, number of clustersk

Ensure: initial setSused for k-means

The setS from this algorithm will be used to replace line 1 of the original k- means in Algorithm1.

Coresets

Definition

In this thesis, we apply data sampling via coresets to deal with the massive data setting In computational geometry, a coreset is a small set of points that approximates the shape of a larger point set, in the sense that applying some geometric measure to the two sets results in approximately equal numbers [Wikipedia] In the usage of clustering problem terms, a coreset is a weighted subset of the data such that the quality of any clustering evaluated on coreset closely approximates the quality on the full data set.

In most cases, it is not easy to find this most relevant subset Consequently, attention has shifted to developing approximation algorithms The goal now is to compute an (1+ε)-approximation subset, for some 00, v>0 andδ >0 Fix a countably infinite domain X and let p(.)be any probability distribution over X,

LetFbe a set of functions from X to [0,1] with Pdim(F) =d and denote by C a sample of m points from X sampled independently according to p(.)

Then, for m≥ c α 2 v dlog1 v+log1 δ where c is an absolute constant, it holds with probability at least1−δ that

≤α where d v (a,b) = |a−b| a+b+vOver all choices ofFwith Pdim(F) =d, this bound on m is tight.

In this chapter, we propose a coreset construction for k-means and k-median clustering by based on FFT algorithm This algorithm, even though has high runtime with a lot of computations, can be considered as one of the best relevant coreset that may be built for research and scientific purposes.

• Firstly, we show that the Farthest-First-Traversal algorithm (FFT) can yield a (k,ε)-coreset for both k-median and k-means clustering

• Secondly, we illustrate some existing limitations of ProTraS [58], the state-of- the-art coreset construction that based on FFT

• From that, by based on FFT combined with good points from ProTraS, we propose an algorithm for coreset construction of both k-means and k-median clustering

• We compare this proposed coreset with other state-of-the-art sample coresets from Lightweight Coreset of Bachem et al [12], Adaptive Sampling of Feld- man et al in [23] and Uniform Sampling as baseline to show that this proposed coreset can be considered as the best suitable subset of any original full data.

Even though this thesis is mainly about coresets for k-means clustering However, in this section, we also prove results about coresets for k-median clustering Therefore,the FFT-based coresets can be applied not only for k-means but also for k-median clustering.

Farthest-First-Traversal Algorithm

We start this chapter with a short introduction about Farthest-First-Traversal (FFT) algorithm In computational geometry, the FFT of a metric space is a set of points selected sequently; after the first point is chosen arbitrarily, each next successive point is located as the farthest one from the set of previously-selected points The first use of the FFT was by Rosenkrantz, Stearns & Lewis [59] in connection with heuristics for the traveling salesman problem Then, Gonzalez [26] used it as part of a greedy approximation algorithm for the problem of finding k clusters that minimize the maximum diameter of a cluster Later, Arthur & Vassilvitskii [6] use a FFT-like algorithm to propose k-means++ algorithm.

The FFT is described in Algorithm3

Algorithm 3Farthest-First-Traversal algorithm

As mentioned, FFT algorithm can be used to solve many complicated problems in data mining and machine learning, in next section, we will prove that the process of FFT can yield coresets for both k-median and k-means clustering Then, we can find a coreset by applying FFT algorithm.

FFT-based Coresets for k-Median and k-Means Clustering

Firstly, we define some expressions used in this section.

Let X⊂R d be a data set and x∈X Let C⊂X be a subset of X For each c∈C, we denote

T(c)= the number of items from X whose closest point in C is c, i.e.

• d c =max t∈T (c) d(t,c)= the largest distance between all points in T(c)to c

Theorem 2 (FFT-based Coresets for k-Median Clustering)

There exists an ε >0 such that the subset receiving from FFT algorithm on dataset X is a(k,ε)-coreset of X for k-median clustering.

Proof: k-median is a variation of k-means where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median With k-median, the functionφ ofX andCare defined as φ X (Q) = ∑ x∈X d(x,Q) = ∑ x∈X minq∈Q||x−q|| φ C (Q) = ∑ c∈C w c d(c,Q) = ∑ c∈C w c min q∈Q||c−q||

Using triangle inequality,∀x∈X andc∈C, we have d(x,Q)≤d(x,c) +d(c,Q)

Sum the inequality for allx∈T(c),

Denote|X|=nand |C|=m We now prove that the value ofε will approach zero when the size of coreset increases, i.e lim m→n ε =0

From the definition ofT(c), we have

Hence, c∈C ∑ w c =n−m Sincew i >0,∀i=1,2, ,m, and by applying Cauchy - Schwarz inequality, we have

So,subset C from FFT on data X is a(k,ε)-coreset of X for k-median clustering

Theorem 3 (FFT-based Coresets for k-Means Clustering)

There exists an ε >0 such that the subset receiving from FFT algorithm on dataset X is a(k,ε)-coreset of X for k-means clustering.

Proof: With k-means, the functionφ of X and C are φ X (Q) = ∑ x∈X d(x,Q) 2 = ∑ x∈X minq∈Q||x−q|| 2 φ C (Q) = ∑ c∈C w c d(c,Q) 2 = ∑ c∈C w c min q∈Q||c−q|| 2

We denoted max =maxx∈Xd(x,Q)

Using triangle inequality,∀x∈X andc∈C, we have d(x,Q)≤d(x,c) +d(c,Q) d(x,Q)≤d c +d(c,Q),∀x∈T(c) d(x,Q) 2 ≤ d c +d(c,Q)2 d(x,Q) 2 ≤d c 2 +d(c,Q) 2 +2d c d max Sum the inequality for allx∈T(c), then for allc∈C, x∈T ∑ (c) d(x,Q) 2 ≤w c d c 2 +w c d(c,Q) 2 +2w c d c d max c∈C ∑ ∑ x∈T (c) d(x,Q) 2 ≤ ∑ c∈C w c d c 2 +∑ c∈C w c d(c,Q) 2 +2∑ c∈C w c d c d max

Similarly, from triangle inequality, d(c,Q)≤d(x,c) +d(x,Q) d(c,Q)≤d c +d(x,Q),∀x∈T(c) d(c,Q) 2 ≤d c 2 +d(x,Q) 2 +2d c d max Sum the inequality for allx∈T(c), then for allc∈C, this implies w c d(c,Q) 2 ≤w c d c 2 + ∑ x∈T (c) d(x,Q) 2 +2w c d c d max c∈C ∑ w c d(c,Q) 2 ≤ ∑ c∈C w c d c 2 +∑ c∈C ∑ x∈T (c) d(x,Q) 2 +2∑ c∈C w c d c d max

∆= ∑ c∈C w c d c d c +2d max and choose ε = ∆ φ X (Q) =∑c∈Cw c d c d c +2dmax φ X (Q) (3.6)

Then combine with (3.4) and (3.5), we have

To finish the proof, we also need to prove the value of thisεwill approach zero when the size of coreset increases

Similarly to the proof of theorem2for the case of k-median clustering, denote|X|=n and|C|=m, we have c∈C ∑ w c =n−m Sincew i >0,∀i=1,2, ,m, and by applying Cauchy - Schwarz inequality, we have

∑c∈Cd c 2 d c +2d max 2 φ X (Q) m→nlimε≤ lim m→n n−mq

So,subset C from FFT on data X is a(k,ε)-coreset of X for k-means clustering

From theorem 2 for k-median and theorem3 for k-means clustering, we con- clude this section by a 2-in-1 theorem as follows

Theorem 4 (FFT-based Coresets for both k-Median and k-Means Clustering)The sample from applying FFT algorithm on data set X is a(k,ε)-coreset of X for both k-median and k-means clustering

ProTraS algorithm and limitations

ProTraS algorithm

In 2017, they proposed DENDIS [56] and DIDES [57] which are iterative algorithms based on the hybridization of distance and density concepts They differ in the pri- ority given to distance or density, and in the stopping criterion defined accordingly. However, they have drawbacks In 2018, by based on the FFT algorithm and the good points from DENDIS and DIDES, Ros and Guillaume proposed a new algorithm named ProTraS [58] that is both easy to tune and scalable ProTraS algorithm is based on the sampling cost which is computed according to the within group distance and to the representativeness of the sample item This algorithm is designed to produce a(k,ε)-coreset and use the approximation level,ε, as the stopping criterion This algorithm is then used and provides good respect in some research such as in [62] by Le Hong Trang et al.

The original ProTraS is described in Algorithm4

Drawbacks in ProTraS

ProTraS is a good method to build a coreset, but it still has some drawbacks In Algorithm4, the sampling cost formulation at line18and stopping criterion at line24 lead to some problems in reality dataset such as unable to distinguish data sets in same shape but different elasticity Figure 3.1 shows an example of two datasets: original-R15 1 and the bigger - the scaling x10 of R15 With the same shape and the same size, these two samples must create two coresets with the same size In fact, forε=0.2, Algorithm4creates coreset with size 128 for original-R15 and coreset with size 412 for scaling-R15 The error is from the cost function of Algorithm4. According to line18of Algorithm4, the cost function can be expressed as cost = ∑ y k ∈C p k n = ∑ y k ∈C w k d k n (3.7)

1 R15 is a sample dataset from https://cs.joensuu.fi/sipu/datasets

1: Select an initial patternx init ∈T

12: Stored max (y k ),x max (y k )whered max (y k ) =d(x max (y k ),y k )

FIGURE3.1: Original-R15 (small cluster) and scaling-R15

The ProTraS will stop when cost 0and k∈N Let X ⊂R d be a set of points with mean àX Denote p(x)as the probability distribution on X with p(x) = 1

Let C be the subset of X by sampling |C|=m points from X where each point x∈C has weight w C x = m.p(x) 1 and is sampled with probability p(x)

We have for any constant c, if m satisfies m≥ c ε 2 dklogk+log1 δ

Then, with probability at least1−δ, the set C is an(α,ε,k)-lightweight coreset of X for k-means clustering.

Proof: We have to prove that setCsatisfies the definition4of(α,ε,k)-lightweight coreset, i.e we need to prove

|φ X (Q)−φ C (Q)| ≤α ε φ X (Q) + (1−α)ε φ X ({à X }) (4.3) From definition of k-means clustering, φ X (Q) = ∑ x∈X d(x,Q) 2 φ X ({à X }) = ∑ x∈X d(x,à X ) 2 φ C (Q) = ∑ x∈C w C x d(x,Q) 2

Then d(x,Q) 2 H(Q) ≤g(x) LetGbe the average of allg(x)forx∈X, we obtain

By applying lemma1with the relation amongg(x),G,|X|and p(x), we obtain f Q (x) = d(x,Q) 2

Apply theorem1for the Bounds on the Sample Complexity of Learning with parameters as follows

Then, there existδ >0, constantcand m≥ c ε 2 dklogk+log1 δ

Such that with probability at least 1−δ, we have d1 k

12(2α+1) (4.5) where d v (a,b) = |a−b| a+b+v Since k is the number of clusters k≥1 =⇒ 0≤ 1 k ≤1

Then, combine with inequality (4.5), we receive

|C| ∑ x∈C f Q ∗ (x) ≤ α(1−α)ε 4(2α+1) Substitute f Q ∗ (x)in this inequality with f Q ∗ (x) = α(1−α)

Multiply both sides withX.H(Q), we obtain

|X| φ X ({à X }) and w C x = 1 m.p(x) Then, the inequality (4.8) now is equivalent to

Let X ⊂R d be a set of points with meanà X Denote

|X| φ X ({à X }) For all x∈X and Q⊂R d , it holds that d(x,Q) 2

Proof: Reminder 1 Cauchy - Schwarz inequality foraandb,

Reminder 2 Basic fraction inequality, fora,b,c,d>0, a+b c+d ≤ a c+b d (4.10)

By the triangle inequality, we have d(à X ,Q)≤d(x,à X ) +d(x,Q) Apply Cauchy-Schwarz inequality (4.9), then d(à X ,Q) 2 ≤ d(x,à X ) +d(x,Q)2

≤2d(x,à X ) 2 +2d(x,Q) 2 Averaging across allx∈X, we obtain x∈X ∑ d(à X ,Q) 2 ≤ ∑ x∈X

Similarity, by the triangle inequality and apply Cauchy-Schwarz inequality (4.9), d(x,Q)≤d(x,à X ) +d(à X ,Q)

|X|φ X (Q) (4.12) Divide both sides of (4.12) byH(Q)to obtain that d(x,Q) 2 H(Q) ≤ 2d(x,àX) 2 + |X 4 | φX({à X }) + |X 4 | φX(Q) α

|X|φX(Q) + 1−α |X| φX({à X }) Apply basic fraction inequality (4.10) to the right hand side, d(x,Q) 2

In this chapter, we use our proposed model for theα - lightweight coreset in Chap- ter4and combine with the Apache Spark, the framework for big data processing, for solving the whole problem of this thesis - clustering large datasets This chapter will include

• We describe the process of clustering large datasets with the use of the α - lightweight coreset in Chapter 4 and built-in library of Apache Spark This process also includes a data generalization program to cluster large data from the clustering solution on its subset.

• We do experiments with above method and test how good it is when compare to clustering directly on original dataset.

• We use Adjusted Rand Index to evaluate the comparison and give discussions at the end of this chapter.

Chapter 5 Clustering Large Datasets via Coresets and Spark 46

Processing Method

Data Generalization

When the dataset size is big enough, the processing of k-means clustering, even with the help of framework for big data processing such as Apache Hadoop, Spark, etc., will still be very slow, Therefore, instead of solving the problems directly on original dataset, we will find the solution on its coresets which are proved to be relevant to the full dataset Since the size of coresets is much smaller than of original dataset, the k-means clustering will be much faster and easier Then, by the solutions on these coresets, we generalize to the full dataset and retrieve the final answer on the original dataset.

For the data generalization process, we apply a very visual and easy-to-apply method: the cluster index of any point not belonging to the coreset is similar to its nearest point in coreset Moreover, in Chapter4, we have proved that the(α,ε,k)

- lightweight coreset can be used for "traditional" coreset, the data generalization method can be stated as follows

Given dataset X⊂R d and C is the(α,ε,k)- lightweight coreset of X

Then, for each x ∗ ∈X\C, the cluster label of x ∗ is the same as the cluster label of c ∗ with c ∗ ∈C and d(x ∗ ,c ∗ ) =min c∈Cd(x ∗ ,c)

Built-in k-Means clustering in Spark

(This section is based on the original article on Spark website 1 ) k-meansis one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters Thespark.mllibimplementa- tion includes a parallelized variant of thek-means++method calledkmeans

The implementation inspark.mllibhas the following parameters

• k is the number of desired clusters Note that it is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.

• maxIterationsis the maximum number of iterations to run.

• initializationModespecifies either random initialization or initialization via k- means||.

1 https://spark.apache.org/docs/latest/mllib-clustering.html

• initializationStepsdetermines the number of steps in the k-means|| algorithm.

• epsilon determines the distance threshold within which we consider k-means to have converged.

• initialModelis an optional set of cluster centers used for initialization If this parameter is supplied, only one run is performed.

In general, it is quite easy to applykmeans in Spark, the program written in Python just need a few lines of code as follows

LISTING5.1: Python code forkmeans in Spark

Realistic Method

Withkmeans in Spark and the Algorithm7for constructing a(α,ε,k)- lightweight coreset, clustering large-scale datasets will be executed step by step as follows

1 Step 1 Apply Algorithm 7 to generate a (α,ε,k) - lightweight coreset of original data set

2 Step 2 Use kmeans in Spark to cluster the sample receiving from Step 1. After this step, we have the labels of all data in the sample.

3 Step 3 Based on the results fromstep 2, and by using methods mentioned in section5.1.1, we generalize to full data set to get all data labels Now, all data are assign cluster labels and we finish the clustering process.

Experiments

Experimental Method

To evaluate how well this clustering process works, we make a comparison among various approaches with the results from thekmeans in Apache Spark applied directly on the original data The labels and running time from full data are then used as the baseline to other methods In this thesis, we use three different approaches as follows

• Method 1 (denoted asUniform): Uniform Sampling combined withkmeans in Spark : a naive approach to coreset constructions which is based on uniform sub-sampling of the data The Uniform Sampling will replace Algorithm7in Step 1of section5.1.3

• Method 2 (denoted as Lightweight): Traditional Lightweight Coreset: Algo- rithm6of lightweight coreset combined withkmeans in Spark

• Method 3 (denoted as Alpha34): α - Lightweight Coreset with kmeans in

Spark For this experiment, we chooseα = 3 4 The probability distribution on

To compare the quality of the predicted labels from Methods 2-3 and 4 with the labels on full data set from Method 1, we use the Adjusted Rand Index (ARI) proposed by Huber and Arabie [32], another version of Rand Index by William Rand [52].

Since all methods are based on sampling, in order to do the experiments more precisely, each configuration for each method is run 20 times, we use the average value to represent the results of each experiment.

All experiments were implemented in Python and run on an Intel Core i7 machine with 8×2.8GHz processors and 16 GB memory.

Experimental Data Sets

We use 5 large datasets with multiple dimensions from data clustering repository of the computing school of Eastern Finland University 2 , and from GitHub clustering benchmark 3 These datasets are described in Table5.1.

2 https://cs.joensuu.fi/sipu/datasets

3 https://github.com/deric/clustering-benchmark

TABLE5.1: Data sets for Experiments

Data Name Size No of Clusters No of Dimensions

Results

The relationship between the ARI and coreset size is shown in Figure5.1, 5.2, 5.3, 5.4and5.5 for each dataset described in table5.1 For all results, when the coreset sizes increase, ARI’s also have higher values This means that all algorithms create better coresets when more data points are chosen Table5.2, 5.3, 5.4, 5.5and 5.6 show the full experimental results including

• Coreset Runtime: the time to execute the coreset construction algorithm

• Spark Runtime: the time to cluster the samples by k-means of Spark

• DataGen Runtime: the time for data generalization process

• Total Runtime: total time of the clustering process from finding coresets to the clustering for original dataset

The results can be briefly summarized as

• In most cases, Uniform Sampling creates the lowest ARI, it means Uniform Sampling is the worst coreset construction This is understandable since Uni- form Sampling is the simplest and naivest method.

• Lightweight Coresets and the(α= 3 4 )-lightweight coreset creates coresets having nearly equal ARI in most cases These results are better than the Uniform Sampling.

• These three methods are extremely fast with nearly same runtime.

• During the process of clustering large scale dataset, the Coreset Runtime is very short (less than 0.1s), the DataGen processes need about 2-3s and the most expensive time is the time for clustering in Spark.

• In these Figure5.1, 5.2,5.3,5.4and5.5, the red big points represent the Run- time of clustering full dataset with Spark This shows that the clustering viaCoresets and Spark is much more faster than solving the problem directly.

FIGURE5.1: ARI and Runtime of Birch1 in relation to full data

FIGURE5.4: ARI and Runtime of ConfLongDemo in relation to full data

FIGURE5.5: ARI and Runtime of KDDCupBio in relation to full data

TABLE5.2: Experimental Results for dataset Birch1

Algorithm Coreset Coreset Spark DataGen Total ARI

Size Runtime Runtime Runtime Runtime

TABLE5.5: Experimental Results for dataset ConfLongDemo

TABLE5.6: Experimental Results for dataset KDDCup Bio

Alpha34 16000 0.0178 9.6949 48.0922 57.8050 0.3195Lightweight 16000 0.3706 9.2364 47.5339 57.1409 0.3101Uniform 16000 0.3806 9.0320 47.7005 57.1130 0.2993Alpha34 32000 0.0306 16.0353 44.5965 60.6624 0.3314Lightweight 32000 0.4077 16.1055 44.6503 61.1635 0.3309Uniform 32000 0.4074 15.0087 44.6620 60.0780 0.3111

In this thesis, we solve the problem of clustering large scale datasets The approaches we use are based on the data sampling via coresets and Apache Spark During investi- gating data sampling via coresets, we propose two coreset constructions for k-means clustering The whole thesis can be summarized as follows

In Chapter2, we introduce and give an overview about background and related works If k-means clustering is a very classical term of machine learning, "coreset" seems to be more fascinating The meaning of coreset is that instead of solving problems on big data which cost a lot of computations, one can find a subset so that the solutions on this subset can be approximate to the solutions of the original dataset We also provide a brief introduction about Apache Spark in this chapter and some definitions as well as theorems that are useful and related to this thesis.

In Chapter 3, we have proved that the Farthest-First-Traversal algorithm itself is a very good method to find coresets for both k-median and k-means problems. Based on this, we propose a novel algorithm of coreset constructions for k-means and k-median that depends on the number of coreset size The disadvantage of this proposed algorithm as well as all other FFT-related algorithms when comparing with sampling methods is the speed and runtime This is obviously since all FFT-related algorithms need to execute each point in full data set.

However, the results from this algorithm contains not only the coreset of full data with high correctness but also provides very useful characteristics of data: max distance and number of elements of each representative in the coreset These information can be used for further purposes in some researches and applications that need to estimate distribution, density or structure of original data set Moreover, unlike other sampling-based methods which create different samples each time we re-run the process, the coreset from proposed algorithm is unique and unchanged, this means that we only need to run one time and the received subset is truly a coreset of the full data for k-means and k-median clustering.

In Chapter4, based on prior work about Lightweight Coreset [12], we propose a general lightweight coreset, namedα - lightweight coreset which allows both multiplicative and additive errors Unlike traditional lightweight coresets where both multiplicative and additive errors are treated the same and have equal quantities, theα -

Chapter 6 Conclusions 58 lightweight coreset allows to adjust the proportion between these two errors: α = 1 2 for traditional lightweight coreset,α larger means more focusing on multiplicative, or reversely,α smaller means need more concentrating on additive errors.

In this chapter, we also propose and prove the general algorithm for the α - lightweight coreset construction Since this approach is a sampling-based method, the algorithm can execute extremely fast and can be used in practice.

In Chapter5, we apply the proposed method in Chapter4for solving the main problem of this thesis - clustering large scale datasets To solve it properly and faster, we apply a framework for big data, Apache Spark This approach enforces the problem to run smoothly and quickly Fortunately, Spark is a wonderful framework with various libraries and built-in functions for machine learning and data mining With Spark, the k-means++ clustering is now so easy to implement and deploy.

To evaluate the method, we do experiments with some large scale sample data sets And the results have shown that data sampling through uniform sampling should be replaced by theα - lightweight coreset With nearly equal in time running, but the results from both traditional lightweight coreset andα - lightweight coreset (in this experiment, we choose α = 3 4 ) outperform the results from random uniform sampling.

Overall, in this thesis, we propose and prove two methods for coreset constructions for k-means clustering The FFT-based coreset constructions seem to be very slow, but the accuracy can be considered as one of the best relevant subset to any original data This method should be used for research and science and can be used as a baseline to compare with future proposed methods In the other hand, our second coreset construction, theα- lightweight coreset, is based on sampling-based method. This approach can find a coreset very quickly, but the accuracy of this method is not as high as FFT-based coreset However, these methods are obviously better than random uniform sampling - a very naive and widely-used method.

Finally, each method mentioned in this paper has its own advantages and dis- advantages The options ’Slow but more accuracy’ or ’Fast but less correct’ will be weighed before applying any of these algorithms in practice.

Tiêu đề	Phân cụm các tập dữ liệu có kích thước lớn dựa vào lấy mẫu và nền tảng Spark
Tác giả	Nguyễn Lê Hoàng
Người hướng dẫn	PGS. TS. Đặng Trần Khánh, TS. Lê Hồng Trang
Trường học	University of Technology
Chuyên ngành	Computer Science
Thể loại	Master Thesis
Năm xuất bản	2020
Thành phố	Ho Chi Minh City

Định dạng
Số trang	82
Dung lượng	2,58 MB