Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 165 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
165
Dung lượng
4,47 MB
Nội dung
COMPLEX QUERY PROCESSING AND RECOVERY IN DISTRIBUTED SYSTEMS SHEN YANYAN Bachelor of Science Peking University, China A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2015 DECLARATION I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has not been submitted for any degree in any university previ- ously. Shen Yanyan August 11, 2015 ACKNOWLEDGMENT I want to express my sincere gratitude to my supervisor, Prof. Beng Chin Ooi, for his continuous guidance and support over the past five years. I knew little about research when I started my PhD study. It was Prof. Ooi who taught me how to become a good researcher and enlightened me on challenging research problems. No matter how busy he is, he has always been available to answer my questions and offer his wise advice. I am very grateful to his encouragement when my papers got rejected and his forgiveness to my poor written English. I would like to thank Divesh Srivastava, Luna Xin Dong, Laks V.S. Laksh- manan, Luciano Barbosa, my mentors during my summer internships at AT&T Lab in year 2011 and 2012. They taught me valuable research skills and right working attitude. Thank you to Divesh, for innumerable technical discussions, informal chats about life and insightful advice on our research projects. I would also like to thank Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, my in- ternship mentors at Microsoft Research Redmond, for their guidance and sup- port on the search problem. It has been such a pleasure working with all of my mentors. In addition, I would like to thank all the interns I met at AT&T Lab and Microsoft DMX group. Without them, I would not have had such great and productive summers. I would like to thank my thesis committee members, Prof. Kian-Lee Tan and Prof. Chee-Yong Chan, for their helpful suggestions and insightful comments on this dissertation. I would like to thank all my colleagues in the database group for their company during my entire PhD life. Special thanks to Prof. Wei Lu, who helped i me through all the three thesis works and provided helpful advice. Thanks to my seniors, Sai Wu, Su Chen, Shanshan Ying, Ju Fan, Xuan Liu, Meihui Zhang, Meiyu Lu, Feng Li, Peng Lu, and my junior fellows, Jinyang Gao, Sheng Wang, Qian Lin, for their assistance and support to my research and life. I am always grateful to my long-term house mates, Jingwen Bian, Chao Chen, Xiao Liu, Guanfeng Wang, Jing Yang and Jie Yang, who have shared many exciting and joyful day and night with me. Thank all of you for the assistance to my life and putting up with my bad temper. I would like to thank my best friends, Qi Sun, Minhui Xu, Chengyuan Yang and Yiqing Wu, who were shocked by my intention to pursue a PhD degree and missing me all the time when I am in Singapore. We have known each other for over 12 years and I believe our friendship will live forever. Finally, I want to express my deepest gratitude to my parents for their endless love, support, understanding and encouragement to me. ii CONTENTS Acknowledgment i Abstract vii 1 Introduction 1 1.1 Brief Review of Distributed Systems . . . . . . . . . . . . . . . 6 1.2 Research Challenges in Distributed Systems . . . . . . . . . . . 9 1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Complex Query Processing . . . . . . . . . . . . . . . . . 11 1.2.3 Resilience to Failures . . . . . . . . . . . . . . . . . . . . 13 1.3 Objective and Contributions . . . . . . . . . . . . . . . . . . . . 15 1.3.1 k Nearest Neighbor Join . . . . . . . . . . . . . . . . . . 15 1.3.2 Efficient Graph Processing Engine . . . . . . . . . . . . . 16 1.3.3 Recovery in Distributed Graph Processing Systems . . . 17 1.4 Synopsis of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 18 2 Literature Review 21 2.1 Answering k Nearest Neighbor Join Query . . . . . . . . . . . . 21 2.1.1 Objects under Metric Space . . . . . . . . . . . . . . . . 22 2.1.2 Existing Solutions to kNN Join . . . . . . . . . . . . . . 22 2.2 Advanced Distributed Graph Processing Systems . . . . . . . . 25 2.2.1 Synchronous Graph Processing . . . . . . . . . . . . . . 25 2.2.2 Asynchronous Graph Processing . . . . . . . . . . . . . . 26 iii CONTENTS 2.3 Recovery Mechanisms in Distributed Systems . . . . . . . . . . 27 2.3.1 Modeling Failures . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Failure Recovery . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 kNN Join using MapReduce Framework 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 kNN Join . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Voronoi Diagram-based Partitioning . . . . . . . . . . . 39 3.2.3 MapReduce Framework and epiC . . . . . . . . . . . . . 41 3.3 An Overview of kNN Join Using MapReduce . . . . . . . . . . . 42 3.4 Handling kNN Join Using MapReduce . . . . . . . . . . . . . . 44 3.4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 First MapReduce Job . . . . . . . . . . . . . . . . . . . . 46 3.4.3 Second MapReduce Job . . . . . . . . . . . . . . . . . . 47 3.5 Minimizing Replication of S . . . . . . . . . . . . . . . . . . . . 51 3.5.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.2 Grouping Strategies . . . . . . . . . . . . . . . . . . . . . 53 3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 55 3.6.1 Study of Parameters of Our Techniques . . . . . . . . . . 57 3.6.2 Effect of k . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.3 Effect of Dimensionality . . . . . . . . . . . . . . . . . . 63 3.6.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6.5 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 epiCG: An Efficient Distributed Graph Engine on epiC 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 Issues and Opportunities . . . . . . . . . . . . . . . . . . 68 4.1.2 Our Solution and Contributions . . . . . . . . . . . . . . 69 4.2 Overview of epiCG . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1 Distributed Graph Structure . . . . . . . . . . . . . . . . 74 4.3.2 Graph Loading and Output . . . . . . . . . . . . . . . . 76 4.3.3 Iterative Computation . . . . . . . . . . . . . . . . . . . 82 iv CONTENTS 4.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 86 4.5.2 Benchmark Tasks and Datasets . . . . . . . . . . . . . . 87 4.5.3 Effect of Vertex-cut Degree Threshold θ . . . . . . . . . 88 4.5.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5.5 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Failure Recovery in epiCG 97 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.1 Background of epiCG . . . . . . . . . . . . . . . . . . . . 101 5.2.2 Failure Recovery in epiCG . . . . . . . . . . . . . . . . . 103 5.3 Partition-based Recovery . . . . . . . . . . . . . . . . . . . . . . 106 5.3.1 Recomputing Failed Partitions . . . . . . . . . . . . . . . 108 5.3.2 Handling Cascading Failures . . . . . . . . . . . . . . . . 110 5.3.3 Correctness and Completeness . . . . . . . . . . . . . . . 111 5.4 Reassignment Generation . . . . . . . . . . . . . . . . . . . . . . 112 5.4.1 Estimation of T low . . . . . . . . . . . . . . . . . . . . . 113 5.4.2 Cost-Sensitive Reassignment Algorithm . . . . . . . . . . 115 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5.1 A Brief Review of epiCG . . . . . . . . . . . . . . . . . . 119 5.5.2 Major APIs . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.3 Implementation Details in epiCG . . . . . . . . . . . . . 121 5.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 123 5.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 123 5.6.2 Benchmark Tasks and Datasets . . . . . . . . . . . . . . 123 5.6.3 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6.4 Semi-clustering . . . . . . . . . . . . . . . . . . . . . . . 126 5.6.5 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . 128 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 v CONTENTS 6 Conclusion 135 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bibliography 139 vi [...]... effective and efficient approaches to two challenging issues in distributed systems: complex query processing and fault tolerance Most specifically, we first focus on answering a complex analytics query, k nearest neighbor join in a distributed manner We then propose an efficient graph processing engine to handle graph-related analytics queries Finally, we address the recovery problem in distributed systems. .. ware used in distributed systems, etc The development of distributed systems should be able to guarantee the anonymity of sensitive data and the correctness of computation results In this thesis, we mainly focus on two challenging issues in distributed systems: complex query processing and fault tolerance 1.2.2 Complex Query Processing To discover the value of Big Data, modern distributed systems such... effective and efficient solutions to address two challenging issues in epiC: complex query processing and failure recovery We employ epiC as our underlying distributed system due to its simplicity, efficiency and extensibility, but our approaches can be implemented in other distributed systems as well For the query processing, we first focus on the problem of answering k nearest neighbor join queries in epiC... challenges in distributed systems We then elaborate on two important challenges: complex query processing and resilience to failures 1.2.1 Overview In order to ensure that distributed systems are efficient, scalable and reliable, we have to address the following challenging issues 9 CHAPTER 1 INTRODUCTION • Storage Data storage is a fundamental challenge in distributed systems Data processed by distributed systems. .. in distributed systems Typically, we consider two kinds of queries: offline data analytics queries and online transactional queries In general, query processing in distributed systems has to address several problems: correctness, efficiency, scalability, accuracy and speedup Noting that there does not exist a distributed system that can fit all requirements with one size, different kinds of distributed systems. .. the above two challenging issues (i.e., complex query processing and failure recovery) , in this thesis, we first study the problem of answering k nearest neighbor join query in epiC We then extend epiC and develop an efficient graph processing engine, called epiCG, on top of epiC, to handle graph analytics queries efficiently For the recovery issue, the traditional checkpoint-based recovery method works... kNN join [71], but incurs long recovery latency for the iterative graph analytics tasks Hence, we propose a novel parallel recovery mechanism and implement it in epiCG to accelerate the recovery process In the remainder of this chapter, we first review several advanced distributed systems We then present research challenges in distributed systems and provide background of complex query processing and. .. ReduceUnits) and introduce several pruning rules to eliminate the examination of dissimilar object pairs Contributions Our proposed method is the first distributed solution for answering kNN join query Compared with the existing index-based approaches [15, 16], our distributed solution allows us to perform pair-wise examinations for candidate object pairs in parallel, thus accelerating the processing of kNN join... rollback and re-execute the lost computation since the latest checkpoint To address the problem, we study the problem of efficient failure recovery in distributed graph processing systems We first formalize the failure recovery problem in graph processing systems We then propose a novel partition-based recovery method to parallelize the failure recovery processing Different from the traditional checkpoint-based... vertex-cut generation In terms of fault tolerance, epiCG achieves automatic failure detection and recovery We compare epiCG with two advanced distributed graph processing systems, Giraph [2] and PowerGraph [36] The results illustrate the high efficiency and scalability of epiCG 1.3.3 Recovery in Distributed Graph Processing Systems In the third piece of this thesis, we focus on the recovery issue in epiC/epiCG . Facebook, LinkedIn), spacial networks (e.g., Google Maps, FedEx) and the Web. Querying and mining large graphs are becoming increasingly important in many real applications. Examples include two-hop friend. implemented in other distributed systems as well. For the query processing, we first focus on the prob- lem of answering k nearest neighbor join queries in epiC. We then introduce our graph processing. COMPLEX QUERY PROCESSING AND RECOVERY IN DISTRIBUTED SYSTEMS SHEN YANYAN Bachelor of Science Peking University, China A THESIS SUBMITTED FOR THE DEGREE OF