Phân cụm các tập dữ liệu có kích thước lớn dựa vào lấy mẫu và nền tảng spark

ĐẠI HỌC QUỐC GIA TP HCM TRƯỜNG ĐẠI HỌC BÁCH KHOA NGUYỄN LÊ HOÀNG PHÂN CỤM CÁC TẬP DỮ LIỆU CĨ KÍCH THƯỚC LỚN DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK Ngành : Khoa học Máy tính Mã số : 8480101 LUẬN VĂN THẠC SĨ TP HỒ CHÍ MINH, tháng 01 năm 2020 VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY ———oOo——— NGUYEN LE HOANG CLUSTERING LARGE DATASETS BASED ON DATA SAMPLING AND SPARK Computer Science No.: 8480101 MASTER THESIS Ho Chi Minh City, January 2020 CƠNG TRÌNH ĐƯỢC HỒN THÀNH TẠI TRƯỜNG ĐẠI HỌC BÁCH KHOA ĐẠI HỌC QUỐC GIA – TP HỒ CHÍ MINH Cán hướng dẫn khoa học: PGS TS ĐẶNG TRẦN KHÁNH …………………… Cán đồng hướng dẫn: TS LÊ HỒNG TRANG …………………… Cán chấm nhận xét 1: TS PHAN TRỌNG NHÂN …………………… Cán chấm nhận xét 2: PGS TS HUỲNH TRUNG HIẾU …………………… Luận văn thạc sĩ bảo vệ Trường Đại học Bách Khoa, ĐHQG TP HCM ngày 30 tháng 12 năm 2019 Thành phần Hội đồng đánh giá luận văn thạc sĩ gồm: PGS TS NGUYỄN THANH BÌNH TS NGUYỄN AN KHƯƠNG TS PHAN TRỌNG NHÂN PGS TS HUỲNH TRUNG HIẾU PGS TS NGUYỄN TUẤN ĐĂNG Xác nhận Chủ tịch Hội đồng đánh giá LV Trưởng Khoa quản lý chuyên ngành sau luận văn sửa chữa (nếu có) CHỦ TỊCH HỘI ĐỒNG PGS TS NGUYỄN THANH BÌNH TRƯỞNG KHOA KH & KTMT ii THE RESEARCH WORK FOR THIS THESIS HAS BEEN CARRIED OUT AT UNIVERSITY OF TECHNOLOGY VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY Under the supervision of • Supervisor: ASSOC PROF DR DANG TRAN KHANH • and co-supervisor: DR LE HONG TRANG Examiner Board • Examiner 1: DR PHAN TRONG NHAN • Examiner 2: ASSOC PROF DR HUYNH TRUNG HIEU This thesis is reviewed and defended at University of Technology, VNU-HCMC on December 30, 2019 The members of Thesis Defense Committee are: ASSOC PROF DR NGUYEN THANH BINH DR NGUYEN AN KHUONG DR PHAN TRONG NHAN ASSOC PROF DR HUYNH TRUNG HIEU ASSOC PROF DR NGUYEN TUAN DANG Confirmation from President of Thesis Defense Committee and Dean of Faculty of Computer Science and Engineering President of Thesis Defense Committee Dean of Faculty of Computer Science and Engineering Assoc.Prof.Dr Nguyen Thanh Binh (signed) Assoc.Prof.Dr Pham Tran Vu (signed) ĐẠI HỌC QUỐC GIA TP.HCM TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc NHIỆM VỤ LUẬN VĂN THẠC SĨ Họ tên học viên : NGUYỄN LÊ HOÀNG MSHV : 1770472 Ngày, tháng, năm sinh : 12/03/1988 Nơi sinh : TP HCM Ngành : Khoa học Máy tính Mã số : 8480101 I TÊN ĐỀ TÀI : PHÂN CỤM CÁC TẬP DỮ LIỆU CĨ KÍCH THƯỚC LỚN DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK II NHIỆM VỤ VÀ NỘI DUNG : - Tìm hiểu nghiên cứu toán gom cụm, phương pháp lấy mẫu, tổng quát hoá liệu tảng Apache Spark cho liệu lớn - Dựa vào phương pháp lấy mẫu, đề xuất chứng minh giải thuật xây dựng tập coreset để tìm tập hợp phù hợp vừa để giảm chi phí tính tốn vừa sử dụng tập hợp đại diện cho tập liệu gốc toán gom cụm - Thử nghiệm đánh giá phương pháp đề xuất III NGÀY GIAO NHIỆM VỤ : 04/01/2019 IV NGÀY HOÀN THÀNH NHIỆM VỤ : 07/12/2019 V CÁN BỘ HƯỚNG DẪN : PGS TS ĐẶNG TRẦN KHÁNH TS LÊ HỒNG TRANG TP HCM, ngày 07 tháng 12 năm 2019 CÁN BỘ HƯỚNG DẪN PGS TS ĐẶNG TRẦN KHÁNH TRƯỞNG KHOA KH & KTMT iv SOCIALIST REPUBLIC OF VIETNAM Independence - Freedom - Happiness —————————— VIETNAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY —————————— MASTER THESIS OBLIGATIONS Student: NGUYEN LE HOANG Date of Birth: March 12, 1988 Major: Computer Science StudentID: 1770472 Place of Birth: Ho Chi Minh City Number: 8480101 I THESIS TITLE: CLUSTERING LARGE DATASETS BASED ON DATA SAMPLING AND SPARK II OBLIGATIONS AND CONTENTS: • Study and research about clustering problems, data sampling methods, data generalization and Apache Spark framework for big data • Based on Data Sampling, we propose and prove algorithms for coreset constructions in order to find the most suitable subsets that can both be used to reduce the computational cost and be used as the representative subsets of the full original datasets in clustering problems • Do experiments and evaluate the proposed methods III START DATE: January 4, 2019 IV END DATE: December 7, 2019 V SUPERVISORS: ASSOC PROF DR DANG TRAN KHANH and DR LE HONG TRANG Ho Chi Minh City, December 7, 2019 Supervisor Dean of Faculty of Computer Science and Engineering Assoc.Prof.Dr Dang Tran Khanh (signed) Assoc.Prof.Dr Pham Tran Vu (signed) v Acknowledgements I am very grateful to my supervisor, Assoc Prof Dr DANG TRAN KHANH and co-supervisor Dr LE HONG TRANG for the guidance, inspiration and constructive suggestions that help me in the preparation of this graduation thesis I would like to thank my family very much, especially to my parents, who have always been by my side and supported me whatever I want Ho Chi Minh City, December 7, 2019 vi Abstract Since the development of technology, data has become one of the most essential factors in 21st century However, the explosion of Internet has transformed these data to big ones which are very hard to handle and execute In this thesis, we propose solutions for clustering large-scale data, a vital problem in machine learning and a widely-applied matter in industry To solve this problem, we use the data sampling methods which are based on the concept of coresets - the subsets of data that must be small enough to reduce computational complexity but must keep all representative characteristics of original one In other words, now we can scale down big datasets to the much smaller ones that can be clustered efficiently while these results can be considered as the solutions for the whole original datasets Besides, in order to make the solving process for large-scale datasets much more faster, we apply the open framework for big data Apache Spark In the scope of this thesis, we propose and prove two methods for coreset constructions for k-means clustering We also some experiments and evaluate these proposed algorithms to estimate the advantages and disadvantages of each one This thesis can be divided into four parts as follows: • Chapter and Chapter are the introduction and overview about coresets and related background These chapters also provide a brief about Apache Spark, some definitions as well as theorems that are used in this thesis • In Chapter 3, we propose and prove the first coreset construction which is based on the Farthest-First-Traversal algorithm and ProTraS algorithm [58] for k-median and k-means clustering We also evaluate this method at the end of this chapter • In Chapter 4, based on prior work about Lightweight Coreset [12], we propose and prove the correctness of the second coreset construction, the α lightweight coreset for k-means clustering, a general and adjustable-parameter form of lightweight coreset • In Chapter 5, we apply the α - lightweight coreset and the data generalization method for solving the whole problem of this thesis - clustering large scale datasets We also apply Apache Spark to solve the problem faster To evaluate the correctness, we experiments with some large scale benchmark data samples Tóm tắt Với phát triển công nghệ, liệu trở thành yếu tố quan trọng kỷ 21 Tuy nhiên, bùng nổ Internet biến đổi liệu thành liệu vô lớn khiến cho việc xử lý khai thác trở nên khó khăn Trong đề tài này, đề xuất giải pháp để giải tốn gom cụm cho liệu có kích thước lớn, xem toán quan trọng máy học (machine learning) tốn áp dụng rộng rãi cơng nghiệp Để giải tốn, chúng tơi sử dụng phương pháp lấy mẫu dựa khái niệm tập coreset – định nghĩa tập thoả mãn hai điều kiện: phải đủ nhỏ để giảm độ phức tạp tính tốn phải mang đầy đủ đặc trưng đại diện tập gốc Nói cách khác, thu nhỏ tập liệu lớn thành tập nhỏ để phân cụm hiệu kết thu tập xem kết tập gốc Bên cạnh đó, để trình xử lý tập liệu có kích thước lớn nhanh hơn, sử dụng tảng xử lý liệu lớn Apache Spark Trong phạm vi luận văn này, đề xuất chứng minh hai phương pháp để xây dựng tập cốt coreset cho tốn gơm cụm k-means Chúng tơi thực thi thử nghiệm đánh giá giải thuật đề xuất để tìm ưu khuyết phương pháp Luận văn chia thành phần sau: • Chương chương giới thiệu khái niệm tập coreset kiến thức liên quan Trong chương này, tóm tắt ngắn gọn Apache Spark định lý sử dụng luận văn • Trong chương 3, đề xuất chứng minh phương pháp để xây dựng tập coreset dựa giải thuật Farthest-First-Traversal giải thuật ProTraS [58] cho tốn gơm cụm k-median k-means Chúng tơi tiến hành đánh giá giải thuật cuối chương • Trong chương 4, dựa cơng trình lightweight coreset [12], chúng tơi đề xuất chứng minh tính đắn phương pháp xây dựng coreset thứ hai, α - lightweight coreset, cho tốn gơm cụm kmeans, xem dạng tổng quát điều chỉnh hệ số lightweight coreset • Trong chương 5, sử dụng phương pháp α - lightweight coreset với phương pháp tổng quát hoá liệu để giải tổng thể tốn – gơm cụm tập liệu có kích thước lớn Chúng sử dụng tảng Apache Spark để toán giải nhanh Để đánh giá độ xác, chúng tơi tiến hành thử nghiệm so sánh kết tập mẫu benchmark có kích thước lớn viii Declaration of Authorship I, NGUYEN LE HOANG, declare that this thesis - Clustering Large Datasets based on Data Sampling and Spark, and the work presented in this thesis are my own I confirm that: • This work was done wholly or mainly while in candidature for a Master of Science at University of Technology, VNU-HCMC • No part of this thesis has previously been submitted for any degree or any other qualification at this University or any other institution • Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work • I have acknowledged all main sources of help • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself Signed: Date: Chapter Clustering Large Datasets via Coresets and Spark 52 TABLE 5.2: Experimental Results for dataset Birch1 Algorithm Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Coreset Size 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 16000 16000 16000 32000 32000 32000 Coreset Runtime 0.0038 0.0103 0.0094 0.0038 0.0103 0.0094 0.0046 0.0115 0.0106 0.0072 0.0127 0.0131 0.0103 0.0176 0.0168 0.0197 0.0230 0.0234 Spark Runtime 2.3503 2.3500 2.2821 2.3503 2.3500 2.2821 2.8099 2.7684 2.7968 3.3656 3.4976 3.5200 4.7815 4.7319 4.9040 7.3382 7.1153 7.3908 DataGen Runtime 2.1930 2.2467 2.2577 2.1930 2.2467 2.2577 2.3425 2.2752 2.2743 2.4064 2.3997 2.4318 2.7711 2.7988 2.8434 3.5552 3.5684 3.6035 Total Runtime 4.5471 4.6071 4.5493 4.5471 4.6071 4.5493 5.1570 5.0551 5.0816 5.7791 5.9101 5.9648 7.5629 7.5484 7.7641 10.9131 10.7068 11.0177 ARI 0.6713 0.6458 0.6369 0.6713 0.6458 0.6369 0.7311 0.7277 0.7186 0.7471 0.7399 0.7302 0.7592 0.7550 0.7498 0.7978 0.7914 0.7632 Chapter Clustering Large Datasets via Coresets and Spark 53 TABLE 5.3: Experimental Results for dataset Birch2 Algorithm Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Coreset Size 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 16000 16000 16000 32000 32000 32000 Coreset Runtime 0.0038 0.0101 0.0091 0.0038 0.0101 0.0091 0.0044 0.0114 0.0108 0.0063 0.0124 0.0126 0.0112 0.0169 0.0175 0.0190 0.0237 0.0231 Spark Runtime 2.2264 2.2480 2.3082 2.2264 2.2480 2.3082 2.6986 2.6105 2.6509 3.1804 3.2386 3.1801 4.2391 4.4452 4.3759 6.7060 6.3036 6.4106 DataGen Runtime 2.1448 2.1778 2.1941 2.1448 2.1778 2.1941 2.3050 2.2338 2.2372 2.3344 2.3505 2.3658 2.6625 2.7806 2.7152 3.4141 3.5076 3.4681 Total Runtime 4.3750 4.4360 4.5114 4.3750 4.4360 4.5114 5.0080 4.8557 4.8988 5.5211 5.6015 5.5585 6.9128 7.2427 7.1086 10.1391 9.8349 9.9019 ARI 0.7347 0.7118 0.7333 0.7447 0.7118 0.7333 0.7943 0.7864 0.7799 0.8193 0.7932 0.7967 0.8157 0.7918 0.7998 0.7868 0.7958 0.7795 Chapter Clustering Large Datasets via Coresets and Spark 54 TABLE 5.4: Experimental Results for dataset Birch3 Algorithm Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Coreset Size 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 16000 16000 16000 32000 32000 32000 Coreset Runtime 0.0033 0.0105 0.0092 0.0033 0.0105 0.0092 0.0046 0.0114 0.0108 0.0068 0.0132 0.0122 0.0108 0.0167 0.0164 0.0198 0.0236 0.0229 Spark Runtime 2.2814 2.2729 2.2887 2.2814 2.2729 2.2887 2.8330 2.7355 2.7106 3.5519 3.3772 3.5111 4.8743 4.8968 4.8520 7.2676 7.1766 7.3618 DataGen Runtime 2.1710 2.2208 2.2327 2.1710 2.2208 2.2327 2.2485 2.2602 2.2393 2.4074 2.3619 2.3880 2.6743 2.6477 2.6786 3.3381 3.4145 3.4107 Total Runtime 4.4556 4.5042 4.5306 4.4556 4.5042 4.5306 5.0862 5.0070 4.9607 5.9661 5.7524 5.9113 7.5594 7.5612 7.5470 10.6254 10.6147 10.7954 ARI 0.7437 0.7371 0.7434 0.7437 0.7371 0.7434 0.7727 0.7635 0.7586 0.7966 0.7756 0.8060 0.7791 0.8054 0.7717 0.8100 0.7982 0.8050 Chapter Clustering Large Datasets via Coresets and Spark 55 TABLE 5.5: Experimental Results for dataset ConfLongDemo Algorithm Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Coreset Size 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 16000 16000 16000 32000 32000 32000 Coreset Runtime 0.0038 0.0103 0.0094 0.0038 0.0103 0.0094 0.0046 0.0115 0.0106 0.0072 0.0127 0.0131 0.0103 0.0176 0.0168 0.0197 0.0230 0.0234 Spark Runtime 2.3503 2.3500 2.2821 2.3503 2.3500 2.2821 2.8099 2.7684 2.7968 3.3656 3.4976 3.5200 4.7815 4.7319 4.9040 7.3382 7.1153 7.3908 DataGen Runtime 2.1930 2.2467 2.2577 2.1930 2.2467 2.2577 2.3425 2.2752 2.2743 2.4064 2.3997 2.4318 2.7711 2.7988 2.8434 3.5552 3.5684 3.6035 Total Runtime 4.5471 4.6071 4.5493 4.5471 4.6071 4.5493 5.1570 5.0551 5.0816 5.7791 5.9101 5.9648 7.5629 7.5484 7.7641 10.9131 10.7068 11.0177 ARI 0.6713 0.6458 0.6369 0.6713 0.6458 0.6369 0.7311 0.7277 0.7186 0.7471 0.7399 0.7302 0.7592 0.7550 0.7498 0.7978 0.7914 0.7632 Chapter Clustering Large Datasets via Coresets and Spark 56 TABLE 5.6: Experimental Results for dataset KDDCup Bio Algorithm Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Alpha34 Lightweight Uniform Coreset Size 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 16000 16000 16000 32000 32000 32000 Coreset Runtime 0.0052 0.3664 0.3674 0.0052 0.3664 0.3674 0.0080 0.3638 0.3728 0.0115 0.3723 0.3677 0.0178 0.3706 0.3806 0.0306 0.4077 0.4074 Spark Runtime 2.6302 2.5161 2.5022 2.6302 2.5161 2.5022 3.8037 3.9257 3.8220 6.1304 5.2533 5.3248 9.6949 9.2364 9.0320 16.0353 16.1055 15.0087 DataGen Runtime 56.2187 55.5587 56.1563 56.2187 55.5587 56.1563 53.2827 54.2313 53.8871 52.0077 51.5868 52.0609 48.0922 47.5339 47.7005 44.5965 44.6503 44.6620 Total Runtime 58.8541 58.4412 59.0259 58.8541 58.4412 59.0259 57.0944 58.5208 58.0818 58.1496 57.2125 57.7533 57.8050 57.1409 57.1130 60.6624 61.1635 60.0780 ARI 0.2247 0.2141 0.2080 0.2247 0.2141 0.2080 0.2574 0.2452 0.2401 0.2894 0.2640 0.2607 0.3195 0.3101 0.2993 0.3314 0.3309 0.3111 57 Chapter Conclusions In this thesis, we solve the problem of clustering large scale datasets The approaches we use are based on the data sampling via coresets and Apache Spark During investigating data sampling via coresets, we propose two coreset constructions for k-means clustering The whole thesis can be summarized as follows In Chapter 2, we introduce and give an overview about background and related works If k-means clustering is a very classical term of machine learning, "coreset" seems to be more fascinating The meaning of coreset is that instead of solving problems on big data which cost a lot of computations, one can find a subset so that the solutions on this subset can be approximate to the solutions of the original dataset We also provide a brief introduction about Apache Spark in this chapter and some definitions as well as theorems that are useful and related to this thesis In Chapter 3, we have proved that the Farthest-First-Traversal algorithm itself is a very good method to find coresets for both k-median and k-means problems Based on this, we propose a novel algorithm of coreset constructions for k-means and k-median that depends on the number of coreset size The disadvantage of this proposed algorithm as well as all other FFT-related algorithms when comparing with sampling methods is the speed and runtime This is obviously since all FFT-related algorithms need to execute each point in full data set However, the results from this algorithm contains not only the coreset of full data with high correctness but also provides very useful characteristics of data: max distance and number of elements of each representative in the coreset These information can be used for further purposes in some researches and applications that need to estimate distribution, density or structure of original data set Moreover, unlike other sampling-based methods which create different samples each time we re-run the process, the coreset from proposed algorithm is unique and unchanged, this means that we only need to run one time and the received subset is truly a coreset of the full data for k-means and k-median clustering In Chapter 4, based on prior work about Lightweight Coreset [12], we propose a general lightweight coreset, named α - lightweight coreset which allows both multiplicative and additive errors Unlike traditional lightweight coresets where both multiplicative and additive errors are treated the same and have equal quantities, the α - Chapter Conclusions 58 lightweight coreset allows to adjust the proportion between these two errors: α = 12 for traditional lightweight coreset, α larger means more focusing on multiplicative, or reversely, α smaller means need more concentrating on additive errors In this chapter, we also propose and prove the general algorithm for the α lightweight coreset construction Since this approach is a sampling-based method, the algorithm can execute extremely fast and can be used in practice In Chapter 5, we apply the proposed method in Chapter for solving the main problem of this thesis - clustering large scale datasets To solve it properly and faster, we apply a framework for big data, Apache Spark This approach enforces the problem to run smoothly and quickly Fortunately, Spark is a wonderful framework with various libraries and built-in functions for machine learning and data mining With Spark, the k-means++ clustering is now so easy to implement and deploy To evaluate the method, we experiments with some large scale sample data sets And the results have shown that data sampling through uniform sampling should be replaced by the α - lightweight coreset With nearly equal in time running, but the results from both traditional lightweight coreset and α - lightweight coreset (in this experiment, we choose α = 34 ) outperform the results from random uniform sampling Overall, in this thesis, we propose and prove two methods for coreset constructions for k-means clustering The FFT-based coreset constructions seem to be very slow, but the accuracy can be considered as one of the best relevant subset to any original data This method should be used for research and science and can be used as a baseline to compare with future proposed methods In the other hand, our second coreset construction, the α - lightweight coreset, is based on sampling-based method This approach can find a coreset very quickly, but the accuracy of this method is not as high as FFT-based coreset However, these methods are obviously better than random uniform sampling - a very naive and widely-used method Finally, each method mentioned in this paper has its own advantages and disadvantages The options ’Slow but more accuracy’ or ’Fast but less correct’ will be weighed before applying any of these algorithms in practice Future Research In this thesis, we just solve the problems of clustering large scale datasets via data sampling In specific, we just solve the problem with k-means clustering; and our proposed coreset constructions are also just for k-means clustering There are plenty of available deeper research such as • Coreset constructions for other types of clustering, for example, Gaussian Mixture Model or k-means clustering but for general metrics, not only for Euclidean distances • Coreset construction for deep learning or other fields beside machine learning and data mining 59 Bibliography [1] K Ackermann and S D Angus "A Resource Efficient Big Data Analysis Method for the Social Sciences: the Case of Global IP Activity" Procedia Computer Science, vol 29, pp 2360–2369, 2014 [2] P K Agarwal, C M Procopiuc and K R Varadarajan "Approximating Extent Measures of Points" Journal of the ACM (JACM), vol 51(4), pp 606–635, 2004 [3] P K Agarwal, C M Procopiuc and K R Varadarajan "Geometric Approximation via Coresets" Combinatorial and Computational Geometry, vol 52, pp 1–30, 2005 [4] M Ankerst, M Breunig, H Kriegel and J Sander "OPTICS: Ordering Points to Identify the Clustering Structure" Proceedings on ACM SIGMOD International Conference on Management of Data, vol 28, pp 49–60, 1999 [5] M Anthony and P L Bartlett "Neural Network Learning: Theoretical Foundations" Cambridge University Press, 2009 [6] D Arthur and S Vassilvitskii "k-Means++: The Advantages of Careful Seeding" Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms, pp 1027–1035, 2007 [7] O Bachem, M Lucic and A Krause "Coresets for Nonparametric Estimation - the Case of DP-Means" International Conference on Machine Learning (ICML), 2015 [8] O Bachem, M Lucic and A Krause ”Distributed and Provably Good Seedings for k-Means in Constant Rounds“ International Conference on Machine Learning (ICML) 2017 [9] O Bachem, M Lucic and A Krause "Practical Coreset Constructions for Machine Learning" arXiv preprint, 2017 [10] O Bachem, M Lucic, S H Hassani and A Krause ”Uniform Deviation Bounds for k-Means Clustering“ International Conference on Machine Learning (ICML) 2017 [11] O Bachem, M Lucic and S Lattanzi "One-Shot Coresets: The Case of kClustering" International Conference on Artificial Intelligence and Statistics (AISTATS), 2018 BIBLIOGRAPHY 60 [12] O Bachem, M Lucic and A Krause "Scalable and Distributed Clustering via Lightweight Coresets" International Conference on Knowledge Discovery and Data Mining (KDD), 2018 [13] O Bachem "Sampling for Large-scale Clustering" PhD Thesis ETH Zurich Library 2018 [14] S Ben-David and M Ackerman ”Measures of Clustering Quality: A Working Set of Axioms for Clustering“ Neural Information Processing Systems (NIPS) 2009 [15] K Chen ”On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications“ SIAM Journal on Computing, vol 39.3, pp 923–947, 2009 [16] A Coates and A Y Ng ”Learning Feature Representations with k-Means“ Neural Networks: Tricks of the Trade Springer, pp 561– 580, 2012 [17] D Comaniciu and P Meer "Mean Shift: A Robust Approach Toward Feature Space Analysis" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24, pp 603–619, 2002 [18] T K Dang and K T K Tran The Meeting of Acquaintances: A Cost-Efficient Authentication Scheme for Light-Weight Objects with Transient Trust Level and Plurality Approach Security and Communication Networks 2019 [19] W Dong, F Douglis, K Li, H Patterson, S Reddy and P Shilane "Tradeoffs in Scalable Data Routing for Deduplication Clusters" The 9th USENIX Conference on File and Storage Technologies, 2011 [20] M Ester, H Kriegel, J Sander and X Xu "A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" Proceedings of the second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 226–231, 1996 [21] D Feldman, M Monemizadeh and C Sohler "A PTAS for k-Means Clustering based on Weak Coresets" Symposium on Computational Geometry (SoCG), pages 11–18 ACM, 2007 [22] D Feldman, M Monemizadeh, C Sohler and D P Woodruff "Coresets and Sketches for High Dimensional Subspace Approximation Problems" Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, pages 630–649, 2010 [23] D Feldman, M Faulkner and A Krause "Scalable Training of Mixture Models via Coresets" Advances in Neural Information Processing Systems (NIPS), pages 2142–2150, 2011 [24] D Feldman, M Schmidt and C Sohler "Turning Big Data into Tiny Data: Constant-size Coresets for k-Means, PCA and Projective Clustering" Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, pages 1434–1453, 2013 BIBLIOGRAPHY 61 [25] J Friedman, T Hastie and R Tibshirani "Sparse Inverse Covariance Estimation with the Graphical Lasso" Biostatistics 9(3), pp 432–441, 2008 [26] T F Gonzalez "Clustering to Minimize the Maximum Inter-cluster Distance" Theoretical Computer Science, vol 38, pp 293–306, 1985 [27] S Guha, R Rastogi and K Shim "CURE: An Efficient Clustering Algorithm for Large Databases" ACM SIGMOD Rec 27, pp 73–84, 1998 [28] S Har-Peled and S Mazumdar "On Coresets for k-Means and k-Median Clustering" ACM Symposium on Theory of Computing (STOC), pp 291–300, 2004 [29] S Har-Peled and A Kushal "Smaller Coresets for k-Median and k-Means Clustering" ACM Symposium on Computational Geometry (SoCG), pp 126–134, 2005 [30] S Har-Peled "Geometric Approximation Algorithms" American Mathematical Society Providence, vol 173, 2011 [31] D Haussler ”Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications“ Information and Computation vol 100.1, pp 78–150, 1992 [32] L Hubert and P Arabie "Comparing Partitions" Journal of Classification, vol 2, pp 193-218, 1985 [33] M Inaba, N Katoh and H Imai "Applications of Weighted Voronoi Diagrams and Randomization to Variance-based k-Clustering" Proceeding of 10th Annual Symposium on Computational Geometry, pp 332–339, 1994 [34] W Inoubli, S Aridhi, H Mezni and A Jung "Big Data Frameworks: A Comparative Study" arXiv:1610.09962v2, 2017 [35] M Iwayama and T Tokunaga "Cluster-Based Text Categorization: A Comparison of Category Search Strategies" Conference on Research and Development in Information Retrieval (SIGIR), pp 273–280, ACM 1995 [36] D Jiang, C Tang and A Zhang "Cluster Analysis for Gene Expression Data: A Survey" Transactions on Knowledge and Data Engineering (TKDE) 16.11, pp 1370–1386, 2004 [37] L Kaufman and P Rousseeuw "Finding Groups in Data: troduction to Cluster Analysis", vol 344 Wiley Online DOI:10.1002/9780470316801, 2005 An InLibrary [38] H Kriegel, P Kroger, J Sander and A Zimek "Density-based Clustering" Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 1, Issue 3, pp 231–240, 2011 [39] J Kleinberg ”An Impossibility Theorem for Clustering“ Neural Information Processing Systems (NIPS) 2003 BIBLIOGRAPHY 62 [40] Y Li, P M Long and A Srinivasan ”Improved Bounds on the Sample Complexity of Learning“ Journal of Computer and System Sciences, vol 62.3, pp 516–527, 2001 [41] G Linden, B Smith and J York "Amazon.com Recommendations: Item-ToItem Collaborative Filtering“ IEEE Internet Computing 7.1, pp 76–80, 2003 [42] S P Lloyd "Least Squares Quantization" PCM IEEE Transaction of Information Theory, vol 28, pp 129–137 1982 [43] M Lucic, O Bachem and A Krause "Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures" International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1–9, 2016 [44] M Lucic, M Faulkner and A Krause "Training Mixture Models at Scale via Coresets" Journal of Machine Learning (JMLR), 2017 [45] J MacQueen "Some Methods for Classification and Analysis of Multivariate Observations" Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1, pp 281–297, 1967 [46] B Marr "Big Data: 20 Mind-Boggling Facts Everyone Must Read" https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mindboggling-facts-everyone-must-read/ 2015 [47] J Matousek "On Approximate Geometric k-Clustering" Discrete & Computational Geometry, vol 24, pp 61–84, 2000 [48] R T Ng and J Han "CLARANS: A Method for Clustering Objects for Spatial Data Mining" IEEE Transactions on Knowledge and Data Engineering, vol 14, pp 1003–1016, 2002 [49] H Park and C Jun "A Simple and Fast Algorithm for k-Medoids Clustering" Expert System with Applications, vol 36(2), pp 3336–3341, 2009 [50] T N Phan and T K Dang A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce SN Computer Science, 1(1), 2020 [51] J Qiu and B Zhang "Mammoth Data in the Cloud: Clustering Social Images" Clouds Grids Big Data, vol 23, pp 231, 2013 [52] W M Rand "Objective Criteria for the Evaluation of Clustering Methods" Journal of the American Statistical Association, vol 66, 1971 [53] C E Rasmussen "The Infinite Gaussian Mixture Model" Advances in Neural Information Processing System 12, pp 554–560, 1999 [54] M.H ur Rehman, C.S Liew, A Abbas, et al "Big Data Reduction Methods: A Survey" Data Science and Engineering, Volume 1, Issue 4, pp 265–284 https://doi.org/10.1007/s41019-016-0022-0, 2016 BIBLIOGRAPHY 63 [55] D Reinsel, J Gantz and J Rydning "Data Age 2025: The Evolution of Data to Life-Critical" IDC White Paper International Data Corporation, 2017 [56] F Ros and S Guillaume "DENDIS: A New Density-based Sampling for Clustering Algorithm" Expert Systems with Applications, vol 56, pp 349-359, 2016 [57] F Ros and S Guillaume "DIDES: A Fast and Effective Sampling for Clustering Algorithm" Knowledge and Information Systems, vol 50, pp 543-568, 2017 [58] F Ros and S Guillaume "ProTraS: A Probabilistic Traversing Sampling Algorithm" Expert Systems with Applications, vol 105, pp 65-76, 2018 [59] D J Rosenkrantz, R E Stearns and P M Lewis II "An Analysis of Several Heuristics for the Traveling Salesman Problem" SIAM Journal on Computing, vol 6, pp 563–581, 1977 [60] K Scheinberg, S Ma and D Goldfarb "Sparse Inverse Covariance Selection via Alternating Linearization Methods" NIPS’10 Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol 2, pp 2101-2109, 2010 [61] J Shi and J Malik "Normalized Cuts and Image Segmentation" Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22.8, pp 888–905, 2000 [62] L H Trang, P V Ngoan and N V Duc "A Sample-Based Algorithm for Visual Assessment of Cluster Tendency (VAT) with Large Datasets" Future Data and Security Engineering, LNCS 11251, pp 145-157, 2018 [63] W F d l Vega, M Karpinski, C Kenyon and Y Rabani "Approximation Schemes for Clustering Problems" Proceedings of the 35th annual ACM Symposium on Theory of Computing, pp 50–58, 2003 [64] D Xu and Y Tian "A Comprehensive Survey of Clustering Algorithms" Annals of Data Science June 2015, Volume 2, Issue 2, pp 165–193 https://doi.org/10.1007/s40745-015-0040-1, 2015 [65] R Xu and D Wunsch II, "Survey of Clustering Algorithms" IEEE Transactions on Neural Networks, vol 16(3), 2005 [66] X Xu, M Ester, H Kriegel and J Sander "A Distribution-based Clustering Algorithm for Mining in Large Spatial Databases" Proceedings of the Fourteenth International Conference on Data Engineering, pp 324-331, 1998 [67] R B Zadeh and S Ben-David ”A Uniqueness Theorem for Clustering“ Uncertainty in Artificial Intelligence (UAI) 2009 [68] T Zhang, R Ramakrishnan and M Livny "BIRCH: An Efficient Data Clustering Method for Very Large Databases" ACM SIGMOD Rec 25, pp 103–104, 1996 BIBLIOGRAPHY 64 [69] H Zou, Y Yu, W Tang and H M Chen "FlexAnalytics: a Flexible Data Analytics Framework for Big Data Applications with I/O Performance Improvement" Big Data Research, vol 1, pp 4–13, 2014 [70] Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang "A Comparative Study of the Use of Coresets for Clustering Large Datasets" pp 45-55 LNCS 11814 Future Data and Security Engineering FDSE 2019 [71] Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang "A Farthest-FirstTraversal-based Sampling Algorithm for k-clustering" International Conference on Ubiquitous Information Management and Communication IMCOM 2020 65 VIETNAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY —————————— SOCIALIST REPUBLIC OF VIETNAM Independence - Freedom - Happiness —————————— CURRICULUM VITAE I PERSONAL INFORMATION • Fullname: NGUYEN LE HOANG • Date of Birth: March 12, 1988 • Place of Birth: Ho Chi Minh City • Address: 599 Nguyen Trai street, ward 7, district 5, Ho Chi Minh City • Phone: 0938 744 938 • Email: nlhoang@hcmut.edu.vn II EDUCATION • 2008 - 2013: Bachelor in Information Systems, Advanced Program, University of Information Technology, Vietnam National University - HCMC • 2017 - now: Master in Computer Science, University of Technology, Vietnam National University - HCMC III EXPERIENCE • 05/2013 - 02/2014: Developer, Vigilant Video - Vigilant Solutions • 05/2014 - 09/2018: Startup, Orochi Ltd Co • 10/2019 - now: Researching Staff, AC Lab, University of Technology, Vietnam National University - HCMC IV PUBLICATIONS • A Fast Algorithm for Predicting Topics of Scientific Papers based on CoAuthorship Graph Model Truong Hoang Nhut, Nguyen Le Hoang, Do Phuc Advanced Method for Computational Collective Intelligence ICCCI 2012 BIBLIOGRAPHY 66 • Predicting Preferred Topics of Authors Based on Co-Authorship Network Nguyen Le Hoang, Pham Vu Dang Khoa, Do Phuc International Conference on Information and Communication Technologies RIVF 2013 • Finding the Cluster of Actors in Social Network Based on the Topic of Messages Tran Quang Hoa, Vo Ho Tien Hung, Nguyen Le Hoang, Ho Trung Thanh, Do Phuc Asian Conference on Intelligent Information and Database Systems ACIIDS 2014 • A Comparative Study of the Use of Coresets for Clustering Large Datasets Nguyen Le Hoang, Le Hong Trang, Tran Khanh Dang International Conference on Future Data and Security Engineering FDSE 2019 • A Farthest-First-Traversal-based Sampling Algorithm for k-clustering Le Hong Trang, Nguyen Le Hoang, Tran Khanh Dang International Conference on Ubiquitous Information Management and Communication IMCOM 2020 Ho Chi Minh City, February 14, 2020 ... PHÂN CỤM CÁC TẬP DỮ LIỆU CĨ KÍCH THƯỚC LỚN DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK II NHIỆM VỤ VÀ NỘI DUNG : - Tìm hiểu nghiên cứu tốn gom cụm, phương pháp lấy mẫu, tổng quát hoá liệu tảng Apache Spark. .. tập liệu lớn thành tập nhỏ để phân cụm hiệu kết thu tập xem kết tập gốc Bên cạnh đó, để q trình xử lý tập liệu có kích thước lớn nhanh hơn, chúng tơi sử dụng tảng xử lý liệu lớn Apache Spark Trong... Spark cho liệu lớn - Dựa vào phương pháp lấy mẫu, đề xuất chứng minh giải thuật xây dựng tập coreset để tìm tập hợp phù hợp vừa để giảm chi phí tính tốn vừa sử dụng tập hợp đại diện cho tập liệu gốc

Định dạng
Số trang	82
Dung lượng	4,37 MB