Luận văn thạc sĩ Khoa học máy tính: Phân cụm các tập dữ liệu có kích thước lớn dựa vào lấy mẫu và nền tảng Spank

Nội dung

Ngành : Khoa học Máy tính Mã số : 8480101


TP HỒ CHÍ MINH, tháng 01 năm 2020

Computer ScienceNo.: 8480101


Ho Chi Minh City, January 2020

Cán bộ hướng dẫn khoa học: PGS TS ĐẶNG TRẦN KHÁNH ………

Cán bộ đồng hướng dẫn: TS LÊ HỒNG TRANG ………

Cán bộ chấm nhận xét 1: TS PHAN TRỌNG NHÂN ………

Cán bộ chấm nhận xét 2: PGS TS HUỲNH TRUNG HIẾU ………

Luận văn thạc sĩ được bảo vệ tại Trường Đại học Bách Khoa, ĐHQG TP HCM ngày 30 tháng 12 năm 2019


Xác nhận của Chủ tịch Hội đồng đánh giá LV và Trưởng Khoa quản lý chuyên ngành sau khi luận văn đã được sửa chữa (nếu có)




• and co-supervisor: DR LE HONG TRANG

Examiner Board

• Examiner 1: DR PHAN TRONG NHAN


This thesis is reviewed and defended at University of Technology, VNU-HCMCon December 30, 2019

The members of Thesis Defense Committee are:1 ASSOC PROF DR NGUYEN THANH BINH2 DR NGUYEN AN KHUONG



Confirmation from President of Thesis Defense Committee and Dean of Faculty ofComputer Science and Engineering

President of Thesis DefenseCommittee

Dean of Faculty of Computer Scienceand Engineering

Assoc.Prof.Dr Nguyen Thanh Binh(signed)

Assoc.Prof.Dr Pham Tran Vu(signed)

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc


Họ tên học viên : NGUYỄN LÊ HOÀNG MSHV : 1770472

Ngày, tháng, năm sinh : 12/03/1988 Nơi sinh : TP HCM Ngành : Khoa học Máy tính Mã số : 8480101



- Tìm hiểu và nghiên cứu về các bài toán gom cụm, các phương pháp lấy mẫu, tổng quát hoá dữ liệu và nền tảng Apache Spark cho dữ liệu lớn

- Dựa vào phương pháp lấy mẫu, chúng tôi đề xuất và chứng minh các giải thuật xây dựng tập coreset để tìm ra tập hợp con phù hợp nhất vừa có thể để giảm chi phí tính toán và vừa có thể được sử dụng như là tập hợp con đại diện cho tập dữ liệu gốc trong các bài toán gom cụm

- Thử nghiệm và đánh giá các phương pháp đề xuất




Student: NGUYEN LE HOANGDate of Birth: March 12, 1988Major: Computer Science


• Study and research about clustering problems, data sampling methods, datageneralization and Apache Spark framework for big data.

• Based on Data Sampling, we propose and prove algorithms for coreset structions in order to find the most suitable subsets that can both be used toreduce the computational cost and be used as the representative subsets of thefull original datasets in clustering problems.

con-• Do experiments and evaluate the proposed methods.III START DATE: January 4, 2019

IV END DATE: December 7, 2019


Ho Chi Minh City, December 7, 2019

Supervisor Dean of Faculty of Computer Scienceand Engineering

Assoc.Prof.Dr Dang Tran Khanh(signed)

Assoc.Prof.Dr Pham Tran Vu(signed)

I am very grateful to my supervisor, Assoc Prof Dr DANG TRAN KHANHand co-supervisor Dr LE HONG TRANG for the guidance, inspiration and con-structive suggestions that help me in the preparation of this graduation thesis.

I would like to thank my family very much, especially to my parents, who havealways been by my side and supported me whatever I want.

Ho Chi Minh City, December 7, 2019.

Since the development of technology, data has become one of the most essentialfactors in 21st century However, the explosion of Internet has transformed thesedata to big ones which are very hard to handle and execute In this thesis, we proposesolutions for clustering large-scale data, a vital problem in machine learning and awidely-applied matter in industry.

To solve this problem, we use the data sampling methods which are based onthe concept of coresets - the subsets of data that must be small enough to reducecomputational complexity but must keep all representative characteristics of originalone In other words, now we can scale down big datasets to the much smaller onesthat can be clustered efficiently while these results can be considered as the solutionsfor the whole original datasets Besides, in order to make the solving process forlarge-scale datasets much more faster, we apply the open framework for big data -Apache Spark.

In the scope of this thesis, we propose and prove two methods for coreset structions for k-means clustering We also do some experiments and evaluate theseproposed algorithms to estimate the advantages and disadvantages of each one Thisthesis can be divided into four parts as follows:

con-• Chapter1and Chapter2are the introduction and overview about coresets andrelated background These chapters also provide a brief about Apache Spark,some definitions as well as theorems that are used in this thesis.

• In Chapter 3, we propose and prove the first coreset construction which isbased on the Farthest-First-Traversal algorithm and ProTraS algorithm [58]for k-median and k-means clustering We also evaluate this method at the endof this chapter.

• In Chapter 4, based on prior work about Lightweight Coreset [12], we pose and prove the correctness of the second coreset construction, the α -lightweight coreset for k-means clustering, a general and adjustable-parameterform of lightweight coreset.

pro-• In Chapter 5, we apply the α - lightweight coreset and the data generalizationmethod for solving the whole problem of this thesis - clustering large scaledatasets We also apply Apache Spark to solve the problem faster To evalu-ate the correctness, we do experiments with some large scale benchmark datasamples.

Tóm tắt

Với sự phát triển của công nghệ, dữ liệu đã trở thành một trong những yếu tố quan trọng nhất của thế kỷ 21 Tuy nhiên, sự bùng nổ của Internet đã biến đổi những dữ liệu này thành những dữ liệu vô cùng lớn khiến cho việc xử lý và khai thác trở nên cực kỳ khó khăn Trong đề tài này, chúng tôi sẽ đề xuất giải pháp để giải quyết bài toán gom cụm cho dữ liệu có kích thước lớn, đây được xem là một bài toán rất quan trọng của máy học (machine learning) và cũng là một bài toán được áp dụng rộng rãi trong công nghiệp

Để giải bài toán, chúng tôi sử dụng phương pháp lấy mẫu được dựa

trên khái niệm về tập coreset – được định nghĩa là một tập con nhưng thoả

mãn hai điều kiện: phải đủ nhỏ để giảm độ phức tạp trong tính toán nhưng phải mang đầy đủ các đặc trưng đại diện của tập gốc Nói cách khác, chúng ta bây giờ có thể thu nhỏ tập dữ liệu lớn thành một tập nhỏ hơn để có thể phân cụm hiệu quả trong khi kết quả thu được trên tập con cũng được xem là kết quả của cả tập gốc Bên cạnh đó, để quá trình xử lý trong tập dữ liệu có kích thước lớn nhanh hơn, chúng tôi cũng sử dụng nền tảng xử lý dữ

liệu lớn Apache Spark.

Trong phạm vi của luận văn này, chúng tôi đề xuất và chứng minh hai phương pháp để xây dựng tập cốt coreset cho bài toán gôm cụm k-means

Chúng tôi cũng thực thi các thử nghiệm và đánh giá các giải thuật được đề xuất để tìm các ưu và khuyết của mỗi phương pháp Luận văn được chia thành 4 phần chính như sau:

• Chương 1 và chương 2 giới thiệu các khái niệm về tập coreset và các kiến

thức liên quan trong Trong các chương này, chúng tôi cũng tóm tắt ngắn

gọn về Apache Spark và các định lý được sử dụng trong luận văn.

• Trong chương 3, chúng tôi đề xuất và chứng minh phương pháp đầu tiên

để xây dựng tập coreset dựa trên giải thuật Farthest-First-Traversal và giải thuật ProTraS [58] cho bài toán gôm cụm k-median và k-means Chúng tôi

cũng tiến hành đánh giá giải thuật này trong cuối chương.

• Trong chương 4, dựa trên các công trình về lightweight coreset [12],

chúng tôi đề xuất và chứng minh tính đúng đắn của phương pháp xây

dựng coreset thứ hai, - lightweight coreset, cho bài toán gôm cụm means, đây được xem là một dạng tổng quát và có thể điều chỉnh hệ số của lightweight coreset.

k-• Trong chương 5, chúng tôi sử dụng phương pháp - lightweight coreset

cùng với phương pháp tổng quát hoá dữ liệu để giải quyết tổng thể bài toán – gôm cụm trên tập dữ liệu có kích thước lớn Chúng tôi cũng sử

dụng nền tảng Apache Spark để bài toán được giải quyết nhanh hơn Để

đánh giá độ chính xác, chúng tôi tiến hành thử nghiệm và so sánh kết quả

trên các tập mẫu benchmark có kích thước lớn


Declaration of Authorship

I, NGUYEN LE HOANG, declare that this thesis - Clustering Large Datasetsbased on Data Sampling and Spark, and the work presented in this thesis are myown I confirm that:

• This work was done wholly or mainly while in candidature for a Master ofScience atUniversity of Technology, VNU-HCMC.

• No part of this thesis has previously been submitted for any degree or any otherqualification at this University or any other institution.

• Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I havemade clear exactly what was done by others and what I have contributed my-self.


1.5 Publications relevant to this Thesis 6

2 Background and Related Works 72.1 k-Means and k-Means++ Clustering 8

2.1.1 k-Means Clustering 8

2.1.2 k-Means++ Clustering 9

2.2 Coresets 10

2.2.1 Definition 10

2.2.2 Some Coreset Constructions 11

2.3 Apache Spark 12

2.3.1 What is Apache Spark? 12

2.3.2 Why Apache Spark? 13

2.4 Bounds on Sample Complexity of Learning 14

3 FFT-based Coresets 153.1 Farthest-First-Traversal Algorithm 16

3.2 FFT-based Coresets for k-Median and k-Means Clustering 16

3.3 ProTraS algorithm and limitations 21

3.5.2 Results and Discussion 29

4 General Lightweight Coresets 334.1 Lightweight Coreset 34

4.1.1 Definition 34

4.1.2 Algorithm 34

4.2 The α-Lightweight Coreset 35

4.2.1 Definition 35

List of Figures

1.1 Big Data properties 2

1.2 Machine Learning: Supervised vs Unsupervised 3

2.1 Spark Logo - https://spark.apache.org 12

2.2 The Components of Spark 13

3.1 Original-R15 (small cluster) and scaling-R15 23

3.2 Some data sets for experiments 27

3.3 ARI in relation to subsample size for datasets D1 - D8 31

3.4 ARI in relation to subsample size for datasets D9 - D16 32

5.1 ARI and Runtime of Birch1 in relation to full data 50

5.2 ARI and Runtime of Birch2 in relation to full data 50

5.3 ARI and Runtime of Birch3 in relation to full data 51

5.4 ARI and Runtime of ConfLongDemo in relation to full data 51

5.5 ARI and Runtime of KDDCupBio in relation to full data 51

List of Tables

3.1 Data sets for Experiments 27

3.2 Experimental Results - Adjusted Rand Index Comparison 30

3.3 Experimental Results - Time Comparison 30

5.1 Data sets for Experiments 49

5.2 Experimental Results for dataset Birch1 52

5.3 Experimental Results for dataset Birch2 53

5.4 Experimental Results for dataset Birch3 54

5.5 Experimental Results for dataset ConfLongDemo 55

5.6 Experimental Results for dataset KDDCup Bio 56

List of Algorithms

1 k-Means Clustering - Lloyd’s Algorithm [42] 8

2 D2Sampling for k-Means++ [6] 9

increas-Consequently, researchers now have to face new hard situation: solving lems for data that have big amount in volume, variety, velocity, veracity and value.(Figure1.12) For the demand of understanding and explaining these data in orderto solve reality problems, it is very hard for human if there is no help from machine.That’s why machine learning plays an important role in this decade as well as in thefuture By applying machine learning combined with artificial intelligence (AI), sci-entists can create systems having the ability to automatically learn and improve fromexperience without being explicitly programmed.

prob-For each specific purpose, machine learning is divided into two categories: pervised and unsupervised Supervised learning is a kind of training model where thetraining sets go along with provided target labels, the system will learn from these

su-1One zettabyte is equivalent to one billion gigabytes.

2Image source: https://www.edureka.co/blog/what-is-big-data/

Trang 18

Chapter 1 Introduction 2

FIGURE1.1: Big Data properties

training sets and then is used to predict or classify future instances In contrast, pervised machine learning approaches extract information from data sets where suchexplicit labels are not available The importance of this field is expected to grow as itis estimated that 85% of global data in 2025 will be unlabeled [55] In particular, dataclustering - the tasks of grouping together similar objects into clusters — seems to bea fruitful approach for analyzing that data [13] Applications are broad and includefields such as computer vision [61], information retrieval [35], computational ge-ometry [36] and recommendation systems [41] Furthermore, clustering techniquescan also be used to learn data representations that are used in downstream predictiontasks such as classification and regression [16] Machine learning categories can bedescribed briefly in Figure1.23.

unsu-In general, clustering is one of the most popular techniques in machine learningand is used widely in large-scale data analysis The target of clustering is partition-ing a set of objects into groups such that objects in same group are similar to eachother and objects in different groups are dissimilar to each other This technique, dueto its importance and application in reality, has a lot of investigations and variousalgorithms For example, we can use BIRCH [68], CURE [27] which are belongingto hierarchical clustering, also known as connectivity-based clustering, for solvingproblems based on the idea of objects being more related to nearby objects than toobjects farther away If the problems are closely related to statistics, we can usedistribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DB-CLASD [53] For matter based on density clustering in which the data that is in the

3Image source: https://towardsdatascience.com/supervised-vs-unsupervised-learning

Trang 19

Chapter 1 Introduction 3

FIGURE1.2: Machine Learning: Supervised vs Unsupervised

region with high density of the data space is considered to belong to the same ter [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known density-based clustering algorithm, or OPTICS [4] - an improvement of DBSCAN And oneof the most common approaches for clustering is based on partition in which the basicidea is to assign the centers of data points to some random objects, the actual cen-ters will be reveal through several iterations until a stop condition is satisfied Somecommon algorithms of this kind are k-means [45], k-medoids [49], CLARA [37],CLARANS [48] For a more detail, we refer readers to the survey of clusteringalgorithms by R.Xu (2005) [65] and by D Xu (2015) [64]

clus-In fact, there are a lot of clustering algorithms and improvements that can beused in applications Each one has its own benefits and drawbacks as well Thequestion of choosing a suitable clustering algorithm is an important and difficultproblem that users must deal with when they have to solve situations with specificconfigurations and settings There are some research about this such as in [14], [39],[67] where explain about the quality of clusters in some circumstances However, inthe scope of this thesis, we do not cover this issue and various clustering algorithms,instead of this, we fix and select one of the most popular clustering algorithm - the k-means clustering We will use this algorithm throughout of this report and investigatemethods that can deal with k-means clustering for large-scale data set.

Moreover, to design a complete solution that can cluster and analyze large-scaledata is still a challenge for data scientists Many methods have been proposed forseveral years to deal with machine learning for big data One of the simplest way isdepending on infrastructure and hardware: the more powerful and modern machinewe have, the more complicated and larger amount of data we can solve This solutionis quite easy but costs a lot of money and few people can afford this Another optionis finding suitable algorithms to reduce the computational complexity from the inputsize that may contain millions or billions of data points There are some approachmethods such as data compression in [69], [1], data deduplication [19], dimension

Trang 20

Chapter 1 Introduction 4

reduction [25], [60], [51], etc For a survey about this, readers can find more usefulinformation in [54] Among big data reduction methods, data sampling is one ofthe popular options that are closely related to machine learning and data mining forresearchers The key idea of data sampling is that instead of solving problems onthe full data with large-scale size, we can find the answer for the subset of this data;this result is then used as the baseline for finding the actual solution for original dataset This leads us to a new difficulty: finding a subset that must be small enoughfor effectively reducing computational complexity but must keep all representativecharacteristics of original data And, this difficulty is the motivation for us to do thisresearch and this thesis as well.

1.2The Scope of Research

In this thesis, we will propose a solution for a problem of clustering large datasets.We use the word "large" to indicate the data that has "big" in volume, not the wholecharacteristics of big data described in previous section with 5 V’s (Volume, Variety,Value, Velocity and Veracity) (Figure1.1) However, the Volume, in other words, thedata size, is one of the most non-trivial difficulties that most researchers have to facewhen solving a big data related problem.

For clustering algorithm, even though there are a lot of investigations and ods, we consider fixed clustering problems with the prototypical k-means clustering.We select this because k-means is the most well-known clustering algorithm and iswidely applied in reality as well as in industry or scientific research.

meth-While there is a wealth of prior work on clustering of small and medium sizeddata sets, there are unique challenges in the massive data setting The traditionalalgorithms have a super-linear computational complexity on the size of the data setmaking them infeasible when there are many data points In the scope of this thesis,we apply data sampling to deal with the massive data setting A very basic approachof this method is random sampling or uniform sampling In fact, while uniformsampling is feasible for some problems, there are instances where it performs verypoorly due to the naive nature of the sampling strategy For example, real-world datais often imbalanced and contains clusters of different sizes As a consequence, asmall fraction of data points can be very important and have an enormous impact onthe objective function Such imbalanced data sets are problematic for methods basedon uniform sampling since, with high probability These methods only sample pointsin large clusters and the information in small clusters is discarded [13]

The idea of finding a relevant subset from original data to decrease the putational cost brings scientists to the concept of coreset, which was first applied ingeometric approximation by Agarwal et al in 2004 [2], [3] The problem of coresetconstructions for k-median and k-means clustering was then stated and investigatedby Har-Peled et al in [28], [29] Since that time, many coreset construction algo-rithms have been proposed for a wide variety of clustering problems In this thesis,

Trang 21

1.3Research Contributions

In this thesis, we will solve the problem of clustering large data sets by using datasampling methods and the framework for big data - Apache Spark Since the frame-work Spark is a part of technical field and it is maintained by Apache, we do notmake any change in its configurations or do not improve any thing belong to Sparkas well Instead, our research focus on the data sampling methods which will findthe most relevant subsets, called coresets, of a full data set Coresets, in other words,can be described as a compact subset such that models trained on coresets will alsoprovide a good fit with models trained on full data set By using coresets, we canscale down a big data to a tiny one in order to reduce the computational cost of amachine learning problem With deeply research about coresets, our thesis has somescientific and practical contributions as follows

• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for the α - lightweight coreset, then we propose ageneral lightweight coreset construction that is very fast and practical This isproved is Chapter4

Trang 22

Chapter 1 Introduction 6

1.3.2Practical Significance

• Due to its high runtime, our proposed FFT-based coreset construction is veryhard to be used in reality However, through experiments with some state-of-the-arts coreset constructions, this proposed algorithm is showed that it is ableto produce one of the best sample coresets that can be used in experiments.• Our proposed α - lightweight coreset model is a generalization of the tradi-

tional lightweight coreset This proposal can be used for various practicalcases, especially for situations that need to focus on multiplicative errors oradditive errors of the samples.

1.4Organization of Thesis

The remaining of this thesis is organized as follows.

• Chapter 2 This chapter is an overview over prior works related to this thesis,including the k-means and k-means++ algorithms, the definition of coresets, ashort brief about Apache Spark and a theorem about bounds for sample com-plexity.

• Chapter 3 We introduce about farthest-first-traversal algorithm as well as theProTraS for finding coresets Then we propose an FFT-based algorithm forcoreset construction.

• Chapter 4 This chapter is about lightweight coreset and our general lightweightcoreset model We also prove the correctness of this model and propose a gen-eral algorithm for this α - lightweight coreset.

• Chapter 5 This chapter shows the experimental running for clustering largedatasets We use the α - lightweight coreset for sampling process and kMeans++for clustering on Apache Spark framework.

• Chapter 6 We have the thesis conclusion and an ending here.

1.5Publications relevant to this Thesis

• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang A ComparativeStudy of the Use of Coresets for Clustering Large Datasets pp 45-55 LNCS11814 Future Data and Security Engineering FDSE 2019.

• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang A First-Traversal-based Sampling Algorithm for k-clustering International Con-ference on Ubiquitous Information Management and Communication IMCOM2020.

Trang 23

Chapter 2

Background and Related Works

In this chapter, we provide a short introduction about background and prior worksrelated to this thesis.

• k-Means and k-means++ Clustering• Coresets

• Apache Spark

• Bounds & Pseudo-dimension

Trang 24

Chapter 2 Background and Related Works 82.1k-Means and k-Means++ Clustering

The k-means clustering is one of the oldest and most important questions in machinelearning Given an integer k and a data set X ⊂ Rd, the goal is to choose k centers soas to minimize the total squared distance between each point and its closest center.The k-means clustering can be described as follows

Let X⊂ Rd, the k-means clustering problems is to find a set Q⊂ Rdwith|Q| = ksuch that the function φX(Q) is minimized, where

The Lloyd’s algorithm is described in Algorithm1

Algorithm 1 k-Means Clustering - Lloyd’s Algorithm [42]Require: data set X , number of clusters k

Ensure: k separated clusters

1: Randomly initialize k centers C = {cj}k

j=1∈ Rdxk2: while (Not Convergence) do

The algorithm was then developed by Inaba et al [33], Matousek [47], Vegaet al [63], etc However, one of the most advanced improvement of k-means is thek-means++ by Author and Vassilvitskii in [6] We will give an overview about thisalgorithm in next section.

Trang 25

Chapter 2 Background and Related Works 9

In the Algorithm 1, the initial set of cluster centers (line 1) is based on randomsampling where k points are selected uniformly at random from the data set Thissimple approach was fast and easy to implement However, there are many naturalexamples for which the algorithm generates arbitrarily bad clusterings This happensdue to the conflict placement of the starting centers, and in particular, it can holdwith high probability even if the centers are chosen uniformly at random from thedata points [6]

To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithmnamed k-means++ which uses adaptive seeding based on a technique called D2 -sampling to create its initial seed set before running Lloyd’s algorithm to conver-gence [8] Given an existing set of centers S, the D2- sampling strategy, as the namesuggests, samples each point x ∈ X with probability proportional to the squared dis-tance to the selected centers, i.e.,

p(x|S) = d(x, S)

∑x0∈Xd(x0, S)2

The D2- sampling is described in Algorithm2

Algorithm 2 D2Sampling for k-Means++ [6]Require: data set X , number of clusters kEnsure: initial set S used for k-means

Trang 26

k-Chapter 2 Background and Related Works 102.2Coresets

In this thesis, we apply data sampling via coresets to deal with the massive data ting In computational geometry, a coreset is a small set of points that approximatesthe shape of a larger point set, in the sense that applying some geometric measureto the two sets results in approximately equal numbers [Wikipedia] In the usageof clustering problem terms, a coreset is a weighted subset of the data such that thequality of any clustering evaluated on coreset closely approximates the quality on thefull data set.

set-In most cases, it is not easy to find this most relevant subset Consequently,attention has shifted to developing approximation algorithms The goal now is tocompute an (1 + ε)-approximation subset, for some 0 < ε < 1 The framework ofcoresets has recently emerged as a general approach to achieve this goal [3].

The definition of coreset depends on each machine learning problem For thek-means clustering, definition of coresets can be stated as

Definition 1 (Coresets for k-Means Clustering)(S Har-Peled and S Mazumdar [28])

Let ε > 0, the weighted set C is a (k, ε)- coreset of X if for any Q ⊂ Rd ofcardinality at most k

|φX(Q) − φC(Q)| ≤ εφX(Q)this also equivalent to

(1 − ε)φX(Q) ≤ φC(Q) ≤ (1 + ε)φX(Q)

This is a strong theoretical guarantee as the cost evaluated on the coreset φC(Q)has to approximate the cost on the full data set φX(Q) up to a 1 ± ε multiplicativefactor uniformly for all possible sets of cluster centers As a direct consequence,solving on the coreset yields provably competitive solutions when evaluated on thefull data set [43] More formally, Lucic et al in [43] showed that, if C is a coreset ofX with ε ∈ (0,13), then

φX(QC∗) ≤ (1 + 3ε)φX(Q∗X)

where Q∗C denotes the optimal solution of k centers on C and φX(Q∗X) denotes theoptimal solution on X This means that the optimal solution of coreset can produce an(1 + 3ε) approximation on the original data As a result, we can solve the clusteringproblem on the coreset while retaining strong theoretical guarantees.

Trang 27

Chapter 2 Background and Related Works 11

Many coreset construction algorithms have been proposed in recent years for means clustering problems One of the first methods are based on exponential gridsby Har-Peled and Mazumdar in [28] and an improved version by Har-Peled andKushal in [29] The coreset construction with sampling-based approach was firstused by Chen [15] and was investigated deeply by Feldman et al with plenty ofresearch about coresets for k-means (Feldman, Monemizadeh and Sohler [21]), highdimensional subspace (Feldman, Monemizadeh, Sohler and Woodruff [22]), coresetsfor mixture models (Feldman, Faulkner and Krause [23]), PCA and projective clus-tering (Feldman, Schmidt and Sohler [24]), etc By based on prior works, Lucic et al.constructs coresets for the estimation of Gaussian mixture models (Lucic, Faulknerand Krause [44]) and for clustering with Bregman divergences (Lucic, Bachem andKrause [43]) Recently, Bachem et al proposed coresets for nonparametric clus-tering (Bachem, Lucic and Krause [7]), one-shot coresets for k-clustering (Bachem,Lucic and Lattanzi [11]) and lightweight coresets (Bachem, Lucic and Krause [12]).Besides, Ros and Guillaume proposed a coreset construction based on Farthest-First-Traversal algorithm in [58] For a more detail survey of the results about coresets,we refer the reader to paper [9] by Bachem, Lucic and Krause.

k-In this thesis, we will continue prior works and propose new algorithms of set constructions for k-means clustering

core-• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros &Guillaume in [58], we propose an FFT-based coreset construction This part isexplained and proved clearly in Chapter3

• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for called the α - lightweight coreset This is provedis Chapter4

Trang 28

Chapter 2 Background and Related Works 122.3Apache Spark

Apache Spark, an open-source distributed cluster computing framework, was inally developed at the University of California, Berkeley’s AMPLab The Sparkcodebase was then donated to the Apache Software Foundation There are many re-sources about this on the Internet To provide a brief and clear details about Spark,most of the content for this section are referred and taken from

Apache Spark is a powerful unified analytics engine for large-scale distributed dataprocessing and machine learning On top of the Spark core data processing engineare libraries for SQL, machine learning, graph computation, and stream processing.These libraries can be used together in many stages in modern data pipelines andallow for code reuse across batch, interactive, and streaming applications.

Spark is useful for ETL processing, analytics and machine learning workloads, andfor batch and interactive processing of SQL queries, machines learning inferences,and artificial intelligence applications.

FIGURE2.1: Spark Logo - https://spark.apache.org

Data Pipelines Much of Spark’s power lies in its ability to combine very ferent techniques and processes into a single, coherent whole Outside Spark, thediscrete tasks of selecting data, transforming that data in various ways, and ana-lyzing the transformed results might easily require a series of separate processingframeworks, such as Apache Oozie Spark, on the other hand, offers the ability tocombine these, crossing boundaries between batch, streaming, and interactive work-flows in ways that make the user more productive.

dif-Spark jobs perform multiple operations consecutively, in memory, only spilling todisk when required by memory limitations Spark simplifies the management ofthese disparate processes, offering an integrated whole – a data pipeline that is easierto configure, run, and maintain.

Trang 29

Chapter 2 Background and Related Works 13

Challenges with Previous Technologies Before Spark, there was MapReduce.With MapReduce, iterative algorithms require chaining multiple MapReduce jobstogether This causes a lot of reading and writing to disk For each MapReduce job,data is read from a distributed file block into a map process, written to and read froma file in between, and then written to an output file from a reducer process.

FIGURE2.2: The Components of Spark

Advantages of Spark The goal of the Spark project was to keep the benefits ofMapReduce’s scalable, distributed, fault-tolerant processing framework while mak-ing it more efficient and easier to use Spark is designed for speed:

• Spark runs multi-threaded lightweight tasks inside of JVM processes, ing fast job startup and parallel multi-core CPU utilization.

provid-• Spark caches data in memory across multiple parallel operations, making itespecially fast for parallel processing of distributed data with iterative algo-rithms.

• Spark provides a rich functional programming model and comes packaged withhigher level libraries for SQL, machine learning, streaming, and graphs.

The components of Apache Spark are shown in Figure2.21


Trang 30

Chapter 2 Background and Related Works 142.4Bounds on Sample Complexity of Learning

In Chapter4of this thesis, we consider functions families which map from a datasetX → [0, 1] and where each function in the family corresponds to a solution in oursolution space Intuitively, the pseudo-dimension provides a measure of the combi-natorial complexity of the underlying machine learning problem and may be viewedas the generalization of the VC-dimension to [0, 1]-valued functions [13] The defi-nition of pseudo-dimension was first proposed by Haussler [31] For an overview onthe pseudo-dimension, we refer to Anthony and Bartlett [5].

Definition 2 (Pseudo-Dimension) (Haussler [31])

Fix a countably infinite domain X The pseudo-dimension of a set F of functionsfrom X to [0,1], denoted by Pdim(F), is the largest d such that

There is a sequence x1, x2, , xd of domain elements from X and a sequencer1, r2, , rd of real thresholds such that

for each b1, b2, , bd∈ {above, below}, there is an f ∈ F such thatfor all i= 1, 2, , d, we have f (xi) ≥ ri ⇐⇒ bi= above

By using the properties of the pseudo-dimension, Li, Long, and Srinivasan posed a theorem for a tight bound on the number of required samples for all functionsin the function family [40].

pro-Theorem 1 (Bounds on the Sample Complexity of Learning)(Y Li, P M Long and A Srinivasan [40])

Let α > 0, v > 0 and δ > 0 Fix a countably infinite domain X and let p(.) beany probability distribution over X,

Let F be a set of functions from X to [0,1] with Pdim(F) = d and denote by C asample of m points from X sampled independently according to p(.)

Then, for

m≥ cα2v

v+ log1δ

where c is an absolute constant, it holds with probability at least1 − δ that∀ f ∈ F : dv

p(x) f (x), 1|C| ∑

≤ αwhere

dv(a, b) = |a − b|a+ b + v

Over all choices of F with Pdim(F) = d, this bound on m is tight.

Trang 31

cluster-• Firstly, we show that the Farthest-First-Traversal algorithm (FFT) can yield a(k, ε)-coreset for both k-median and k-means clustering

• Secondly, we illustrate some existing limitations of ProTraS [58], the the-art coreset construction that based on FFT

state-of-• From that, by based on FFT combined with good points from ProTraS, wepropose an algorithm for coreset construction of both k-means and k-medianclustering

• We compare this proposed coreset with other state-of-the-art sample coresetsfrom Lightweight Coreset of Bachem et al [12], Adaptive Sampling of Feld-man et al in [23] and Uniform Sampling as baseline to show that this proposedcoreset can be considered as the best suitable subset of any original full data.

Even though this thesis is mainly about coresets for k-means clustering However, inthis section, we also prove results about coresets for k-median clustering Therefore,the FFT-based coresets can be applied not only for k-means but also for k-medianclustering.

Trang 32

Chapter 3 FFT-based Coresets 163.1Farthest-First-Traversal Algorithm

We start this chapter with a short introduction about Farthest-First-Traversal (FFT)algorithm In computational geometry, the FFT of a metric space is a set of pointsselected sequently; after the first point is chosen arbitrarily, each next successivepoint is located as the farthest one from the set of previously-selected points Thefirst use of the FFT was by Rosenkrantz, Stearns & Lewis [59] in connection withheuristics for the traveling salesman problem Then, Gonzalez [26] used it as part ofa greedy approximation algorithm for the problem of finding k clusters that minimizethe maximum diameter of a cluster Later, Arthur & Vassilvitskii [6] use a FFT-likealgorithm to propose k-means++ algorithm.

The FFT is described in Algorithm3

Algorithm 3 Farthest-First-Traversal algorithmRequire: dataset X with |X | = n

3.2FFT-based Coresets for k-Median and k-MeansClustering

Firstly, we define some expressions used in this section.

Let X⊂ Rdbe a data set and x∈ X Let C ⊂ X be a subset of X For each c ∈ C,we denote

Trang 33

Chapter 3 FFT-based Coresets 17

Theorem 2 (FFT-based Coresets for k-Median Clustering)

There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-median clustering.

Proof: k-median is a variation of k-means where instead of calculating the meanfor each cluster to determine its centroid, one instead calculates the median Withk-median, the function φ of X and C are defined as

d(x, Q) ≤ d(x, c) + d(c, Q)If x ∈ T (c) =⇒ d(x, c) ≤ maxt∈T (c)d(t, c) = dc

then d(x, Q) ≤ dc+ d(c, Q), ∀x ∈ T (c)Sum the inequality for all x ∈ T (c),

x∈T (c)

d(x, Q)∑

Trang 34

Chapter 3 FFT-based Coresets 18

1 − ∆φX(Q)

≤ φC(Q) ≤ φX(Q)

1 + ∆φX(Q)

By choosing

ε = ∆φX(Q) =

=⇒ |X | = ∑

|T (c)| + 1

= ∑

|T (c)| + |C|Hence,

n− m2

ε = ∆

φX(Q) ≤ n− mp

∑c∈Cdc2φX(Q)=⇒ lim

m→nε ≤ limm→n

 n− mp

= 0

Then, limm→nε = 0

So, subset C from FFT on data X is a (k, ε)-coreset of X for k-median clustering 

Trang 35

Chapter 3 FFT-based Coresets 19

Theorem 3 (FFT-based Coresets for k-Means Clustering)

There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-means clustering.

Proof: With k-means, the function φ of X and C areφX(Q) = ∑

Using triangle inequality, ∀x ∈ X and c ∈ C, we haved(x, Q) ≤ d(x, c) + d(c, Q)d(x, Q) ≤ dc+ d(c, Q), ∀x ∈ T (c)d(x, Q)2≤ dc+ d(c, Q)2

d(x, Q)2≤ dc2+ d(c, Q)2+ 2dcdmaxSum the inequality for all x ∈ T (c), then for all c ∈ C,

x∈T (c)

d(x, Q)2≤ wcdc2+ wcd(c, Q)2+ 2wcdcdmax∑

wcd(c, Q)2≤ wcdc2+ ∑

x∈T (c)

d(x, Q)2+ 2wcdcdmax∑

Trang 36

Chapter 3 FFT-based Coresets 20

∆ = ∑

wcdc dc+ 2dmaxand choose

ε = ∆φX(Q) =

∑c∈Cdc2 dc+ 2dmax2φX(Q)

= 0=⇒ lim

m→nε = 0

So, subset C from FFT on data X is a (k, ε)-coreset of X for k-means clustering From theorem 2 for k-median and theorem3 for k-means clustering, we con-clude this section by a 2-in-1 theorem as follows

Theorem 4 (FFT-based Coresets for both k-Median and k-Means Clustering)The sample from applying FFT algorithm on data set X is a(k, ε)-coreset of Xfor both k-median and k-means clustering

Trang 37

Chapter 3 FFT-based Coresets 213.3ProTraS algorithm and limitations

In previous section, we have shown that FFT algorithm can be used to build a coresetfor both k-median and k-means clustering However, there are few research aboutFFT-based coreset In 2018, FFT algorithm was first used to find coresets by Ros andGuillaume [58] The sample from their coreset construction can be considered as acoreset for k-median clustering Even though their work consists some limitations, itis still a very valuable resource and a great idea for further study.

In 2017, they proposed DENDIS [56] and DIDES [57] which are iterative algorithmsbased on the hybridization of distance and density concepts They differ in the pri-ority given to distance or density, and in the stopping criterion defined accordingly.However, they have drawbacks In 2018, by based on the FFT algorithm and thegood points from DENDIS and DIDES, Ros and Guillaume proposed a new algo-rithm named ProTraS [58] that is both easy to tune and scalable ProTraS algorithmis based on the sampling cost which is computed according to the within group dis-tance and to the representativeness of the sample item This algorithm is designedto produce a (k, ε)-coreset and use the approximation level, ε, as the stopping crite-rion This algorithm is then used and provides good respect in some research such asin [62] by Le Hong Trang et al.

The original ProTraS is described in Algorithm4

According to line18of Algorithm4, the cost function can be expressed as

cost = ∑

pkn = ∑

Trang 38

Chapter 3 FFT-based Coresets 22

Algorithm 4 ProTraS Algorithm [58]Require: T = {xi}, for i = 1, 2, , n, and εEnsure: S = {yj} and T (yj), for j = 1, 2, , s

1: Select an initial pattern xinit ∈ T

11: Find dmax(yk) = maxxm∈T (yk)d(xm, yk)

12: Store dmax(yk), xmax(yk) where dmax(yk) = d(xmax(yk), yk)

Trang 39

Chapter 3 FFT-based Coresets 23

FIGURE3.1: Original-R15 (small cluster) and scaling-R15

The ProTraS will stop when cost < ε0 with ε0 is a given and input constant(line4) This cost function is taken from equation3.3of ε for k-median clustering

ε = ∑c∈CwcdcφX(Q)

To easily extruding the denominator of this equation, Ros and Guillaume have usedthe assumption that

d(x, Q) ≥ 1, ∀x ∈ X

or equivalent to costX(Q) ≥ n, which lead to some errors as follows

• For data with large distance between points, Algorithm 4 needs a lot of erations before reaching stop condition; in some cases, with given ε0= 0.1,Algorithm4returns a coreset with size equal to the size of full data set.• Algorithm4cannot determine two data sets in same shape but different scale

it-parameters (Figure 3.1) With same given ε0, the data with larger scale willhave bigger coreset size.

In fact, with assumption as above, Algorithm4can only be optimized and create goodsamples in some circumstances that are satisfied the given assumption Although theoriginal ProTraS is just for k-median clustering and has some limitations, the ideaof coreset construction from FFT algorithm is worth for deeper research In nextsection, we propose a FFT-based coreset construction for both k-means and k-medianclustering that is independent to the value of ε

Trang 40

Chapter 3 FFT-based Coresets 243.4Proposed FFT-based Coreset Construction

Based on FFT algorithm and three theorems in section 3.2, we propose the algorithmfor coreset constructions for both k-median and k-means clustering Proposed algo-rithm pseudocode is described in Algorithm5 To complete the proposed algorithm,we also propose strategy for Initial Step and strategy for decreasing the computa-tional complexity.

Algorithm 5 Proposed FFT-based Coreset ConstructionRequire: X = {xi}; i = 1, 2, , n; and m

The main loop (lines 5-20) includes two sub-loops.

• In the first one (lines 6-9), each unselected pattern, xi∈ X \ C, is attached tothe closest selected element in C.

• The second loop finds the next selected representative, x∗ = xmax(c∗), whichis the farthest item from current C In this step, if we simply choose x∗ as thefarthest distance between all points in X to current set C like the original FFT,i.e.

d(x∗,C) = max

(x∈X \C)∧(c∈C)d(x, c)

Ngày đăng: 05/08/2024, 00:22


  • Đang cập nhật ...

