Developing a hadoop based distributed system for metagenomic binning

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING ——————– * ——————— THESIS DEVELOPING A HADOOP-BASED DISTRIBUTED SYSTEM FOR METAGENOMIC BINNING Major: Computer Engineering Committee: Computer Engineering (English Program) Supervisor: Assoc Prof Dr Tran Van Hoai Reviewer: Assoc Prof Dr Thoai Nam —o0o— Student 1: Tran Duong Huy (1752242) Student 2: Pham Nhat Phuong (1752042) Student 3: Nguyen Huu Trung Nhan (1752392) ĐẠI HỌC QUỐC GIA TP.HCM -TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA:KH & KT Máy tính _ BỘ MƠN:KHMT _ CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP Chú ý: Sinh viên phải dán tờ vào trang thuyết trình HỌ VÀ TÊN: _MSSV: HỌ VÀ TÊN: _MSSV: HỌ VÀ TÊN: _MSSV: NGÀNH: LỚP: _ Đầu đề luận án: Developing a Hadoop-based Distributed System for Metagenomic Binning _ _ Nhiệm vụ (yêu cầu nội dung số liệu ban đầu): - Tìm hiểu Metagenomic Binning thuật tốn BiMeta - Tìm hiểu Hadoop cài đặt hệ thống minh hoạ - Phát triển thuật toán BiMeta dựa Hadoop - Xây dựng giao diện để người dùng sử dụng BiMeta hệ thống tính tốn Hadoop - Đánh giá khả sử dụng Hadoop cho toán Metagenomic Binning tập số liệu dùng cộng đồng Ngày giao nhiệm vụ luận án: … Ngày hoàn thành nhiệm vụ: … Họ tên giảng viên hướng dẫn: Phần hướng dẫn: 1) PGS.TS Trần Văn Hoài, hướng dẫn toàn bộ. _ 2) 3) Nội dung yêu cầu LVTN thông qua Bộ môn Ngày tháng năm CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH (Ký ghi rõ họ tên) (Ký ghi rõ họ tên) PHẦN DÀNH CHO KHOA, BỘ MÔN: Người duyệt (chấm sơ bộ): _ Đơn vị: _ Ngày bảo vệ: Điểm tổng kết: _ Nơi lưu trữ luận án: _ KHOA KH & KT MÁY TÍNH MSSV: 1752242 MSSV: 1752042 MSSV: 1752392 -based Distributed System for Metagenomic Binning 15 - B - - - Sinh viê - gì? Declaration We commit that our topic "Developing a Hadoop-based distributed system for metagenomic binning" is our personal thesis proposal We declare that this topic is conducted under our effort, time, and the recommendation of our supervisor, Assoc Prof Dr Tran Van Hoai All of the research results are conducted by ourselves and not copied from any other sources If there is any evidence of plagiarism, we will be responsible for all consequences Ho Chi Minh City, 2020, Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan i Acknowledgement This thesis would not have been possible to complete without the help and support of many others First and foremost, we would like to express our sincere gratitude to our supervisor, Associate Professor PhD Tran Van Hoai His insight and expert knowledge in the field improved our research work He has also helped us to have organized our thinking in research and in writing the thesis We sincerely thank the teachers in the Faculty of Computer Science and Engineering from Ho Chi Minh City University of Technology for their enthusiasm to impart knowledge during the time we study at school With knowledge accumulated throughout the learning process, it helps us to complete this thesis In the end, we would like to wish the teachers and our supervisor good health and success in their noble careers ii Abstract Bioinformatics research is considered to be an area in which biological data is vast, extensive and complex Biological data are constantly evolving and often unlabeled, so a controlled test method cannot be used One of the most difficult problems to be solved in this area is that detecting new symptoms of the virus, or at least grouping them together, is an urgent need (for example, determining the proximity of SARS-CoV-2 to the bat virus) One way to solve the problem is to create scalable clustering tools that can handle very large amounts of data Genomics and next-generations technology like Illumina, Roche 454 are producing 200 billion kits a week, transferring 60 thousand genes and efficient computers Our goal in this thesis, inspired by previous research, is to create a Hadoop-based tool for metagenomic binning iii Contents Declaration i Acknowledgement ii Abstract iii Introduction 1.1 Overview of this thesis 1.2 Scope and Objectives 1.3 Thesis Outline Background 2.1 Metagenomic 2.1.1 Background 2.1.2 Basic Concepts 2.1.3 Metagenomic Binning Hadoop 2.2.1 Hadoop Components 2.2.2 HDFS - Hadoop Distributed File System 2.2.3 Hadoop MapReduce 12 2.2.4 YARN 14 Spark 18 2.3.1 What is Spark? 18 2.3.2 Introduction to Spark 19 2.3.3 How Sparks run on a cluster? 22 2.2 2.3 iv Related Work 3.1 3.2 A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads 28 3.1.1 Background 28 3.1.2 Method 28 3.1.3 Fundamentals of proposed method 29 3.1.4 Datasets 30 3.1.5 Result 30 3.1.6 Conclusion 30 Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons 31 3.2.1 Background 31 3.2.2 Method 31 3.2.3 Datasets and Hadoop cluster configuration 31 3.2.4 Result 32 Methodology 33 4.1 BiMetaReduce 33 4.1.1 Step 1: Load Fasta file 33 4.1.2 Step 2: Create Document 33 4.1.3 Step 3: Create Corpus 33 4.1.4 Step 4: Build Overlap Graph 35 4.1.5 Step and 6: Find Connected Component and Clustering 35 System Design 37 4.2.1 Overview 37 4.2.2 Proposed System Architecture 37 4.2.3 Devices and Components 37 4.2.4 Web Application Design 39 4.2 28 Experimental and Evaluation 50 5.1 Datasets 50 5.2 Experiments 50 Conclusions References 56 57 v List of Figures 2.1 Reads and Sequence 2.2 Hadoop Components 2.3 A Basic Hadoop Cluster 2.4 Upload A File To HDFS 10 2.5 Read A File From HDFS 11 2.6 The first few lines of the file S8.fna 13 2.7 The pipeline of the phase Reading Fasta File 14 2.8 Map-Reduce Logical Data Flow for Reading Fasta File 16 2.9 Hadoop YARN - How YARN manages a running job 17 2.10 Spark’s toolkit 18 2.11 The architecture of a Spark Application 20 2.12 A narrow dependency 21 2.13 A wide dependency 21 2.14 A cluster driver and worker (no Spark Application yet) 22 2.15 Spark’s Cluster mode 23 2.16 Spark’s Client mode 24 2.17 Requesting resources for a driver 24 2.18 Launching the Spark Application 25 2.19 Application execution 25 2.20 Shutting down the application 26 3.1 Binning process of BiMeta 29 3.2 The Libra workflow 32 3.3 Scalability testing for Libra 32 4.1 Workflow of the BiMetaReduce program 35 4.2 Overview of the proposed system 38 4.3 Python & Django frameworks 39 vi Sequence Diagram (a) Login and Register Sequence Diagram Figure 4.7: Login and Register Sequence Diagram 44 (b) Homepage,"About" page and "Your Project" page Sequence Diagram Figure 4.8: Homepage,"About" page and "Your Project" page Sequence Diagram 45 (c) "System" page Sequence Diagram Figure 4.9: "System" page Sequence Diagram 46 User Interface (a) "Homepage" Figure 4.10: Homepage Figure 4.11: Homepage 47 Figure 4.12: Homepage (b) "System" page Figure 4.13: "System" page (c) "Your Project" page 48 Figure 4.14: "Your Project" page UI (d) "About" page Figure 4.15: "About" page UI 49 Chapter Experimental and Evaluation 5.1 Datasets Simulated datasets are widely used to evaluate the performance of binning algorithms because of the lack of standard metagenomic datasets In this thesis, we plan to use the simulated datasets from the Bimeta paper to compare the performance but because of the limited resource, we cannot perform the algorithm on all of this dataset There are 25 synthetic datasets used in our experiments Table 5.1 presents the dataset that contains Roche 454 single-end long reads with the length of approximately 700 bp and the sequencing error rate of 1%, (denoted by from R1 to R9) For paired-end short reads, with the length approximate of 80 bp, are created following the Illumia error profile with an error rate of 1% (denoted by from S1 to S8, and L1 to L6) is showed in Table 5.2 and Table 5.3 Samples No of species Phylogenetic distance Ratio No of reads R1 R2 R3 R4 R5 R6 R7 R8 R9 2 2 2 3 1:1 1:1 1:1 1:1 1:1 1:1 1:1:8 1:1:8 1:1:1:1:2:14 82960 77293 93267 34457 40043 70550 290473 374830 588258 Species Genus Genus Family Family Order Family and Order Family and Phylum Species, Order, Family, Phylum, and Kingdom Table 5.1: Simulated datasets of long reads 5.2 Experiments Figure 5.1 shows the comparison of runtime between BiMeta (sequential) and BiMetaReduce (MapReduce and Spark) programs The BiMetaReduce program runs on a 3-node system (1 50 Samples No of species Phylogenetic distance S1 S2 S3 S4 S5 S6 S7 S8 2 2 3 5 Ratio No of reads Species 1:1 Species 1:1 Order 1:1 Phylum 1:1 Species, Family 1:1:1 Phylum, Kingdom 3:2:1 Order, Order, Genus, Order 1:1:4:4 Genus, Order, Order, Order 3:5:7:9:11 96367 195339 338725 375302 325400 713388 1653550 456224 Table 5.2: Simulated datasets of short reads (S*) Samples No of species Phylogenetic distance Ratio No of reads L1 L2 L3 L4 L5 L6 2 2 2 1:1 1:2 1:3 1:4 1:5 1:6 176688 259568 342448 425328 508209 591089 Class Class Class Class Class Class Table 5.3: Simulated datasets of short reads (L*) master and worker nodes) As a result, BiMetaReduce program takes more time to run than the sequential program It is because of the nature of Hadoop and MapReduce, MapReduce program will run across many worker nodes, it will involve writing and reading the data from HDFS which is hard drive memory, while the sequential program will load the data to random-access memory to run the computation In our system, the user can set different run types (Hadoop or sequential) for the computation The experiment for this function is to configure various scenarios and see whether the result is the same or not Figure 5.2 illustrates the runtime of different settings where each step can run sequentially or distributedly The experiment is run with the subset of the R4 data file (235 reads from group and 351 reads from group 2) The settings are shown in table 5.4 (✓ is run in Hadoop and blank is run in sequential) We can see that the maximum runtime is when the program is run fully distributed and the minimum time is fully sequential Despite different settings, each step can run in sequential or distributed, the precision, recall, and f-measures of each setting are similar to each other Figure 5.3 shows the comparison of runtime between setting and whereas figure 5.4 gives the runtime of setting and The result suggests that the step run in Hadoop will take more time to run, and with the result from figure 5.4, we found that the step that take longest time to execute is step – find connected component Figure 5.5, 5.6, and 5.7 shows multiple graph options of our system The FruchtermanReingold force-directed layout [3] is the default option which is the easiest to see the connection of the metagenomic reads dataset (figure 5.5) Kamada-Kawai force-directed algorithm [7] and the deterministic layout that places the vertices on a circle is figure 5.6 and 5.7, respectively 51 Figure 5.1: Sequential and MapReduce Runtime ID s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 Step ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Precision Recall F-measures 0.5340 0.5270 0.5341 0.5290 0.5324 0.5205 0.5068 0.5410 0.5512 0.5102 0.5990 0.5990 0.5990 0.5990 0.5990 0.5990 0.5990 0.5990 0.5990 0.5990 0.5647 0.5609 0.5647 0.5618 0.5637 0.5570 0.5491 0.5685 0.5741 0.5511 Table 5.4: Setting of the runtime experiment The limitation of our experiment is the scalability of the Hadoop and Spark module has not been tested due to the lack of available worker nodes But our system still can scale up and deploy in cloud computing 52 Figure 5.2: Runtime of different settings Figure 5.3: Runtime of setting and 53 Figure 5.4: Runtime of setting and Figure 5.5: Fruchterman-Reingold graph of the R4 file 54 Figure 5.6: Kamada-Kawai graph of the R4 file Figure 5.7: Circular graph of the R4 file 55 Chapter Conclusions In this section, our team will summarize the work in this thesis, which indicates what work has been done, what is the future plan for this thesis We have already learned about the fundamental of Genes and Genomes, Research about Genomic Technologies Also, we have research and understand the foundation of the Hadoop distributed processing framework for data processing, storage for big data applications We have learned about metagenomic clustering problems We proposed the architecture of the distributed system based on Hadoop for the metagenomic binning problem The system is implemented using Hadoop, Spark, Django for the back-end and HTML, CSS, Javascript for the front-end Our system provides various features for the user from upload the genomic data file to compute clustering In the end, we ran some experiments to test our system and the result is promising which the BiMeta and BiMetaReduce return a similar output In the future, we will learn how to optimize the MapReduce program, deploy the system on cloud and experiment with many computing nodes 56 References [1] Illyoung Choi et al “Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons” In: GigaScience 8.2 (Dec 2018) giy165 ISSN: 2047-217X DOI: 10.1093/ gigascience/giy165 eprint: https://academic.oup.com/gigascience/article-pdf/8/2/giy165/ 27654619/giy165.pdf URL: https://doi.org/10.1093/gigascience/giy165 [2] Supratim Choudhuri “Bioinformatics for Beginners” In: Bioinformatics for Beginners Ed by Supratim Choudhuri Oxford: Academic Press, 2014, pp 209–218 ISBN: 9780-12-410471-6 DOI: https://doi.org/10.1016/B978- 0- 12- 410471- 6.00009- URL: http://www.sciencedirect.com/science/article/pii/B9780124104716000098 [3] Thomas M J Fruchterman and Edward M Reingold “Graph Drawing by Force-Directed Placement” In: Softw Pract Exper 21.11 (Nov 1991), pp 1129–1164 ISSN: 0038-0644 DOI : 10.1002/spe.4380211102 URL : https://doi.org/10.1002/spe.4380211102 [4] Arpita Ghosh, Aditya Mehta, and Asif M Khan “Metagenomic Analysis and its Applications” In: Encyclopedia of Bioinformatics and Computational Biology Ed by Shoba Ranganathan et al Oxford: Academic Press, 2019, pp 184–193 ISBN: 978-012-811432-2 DOI: https : / / doi org / 10 1016 / B978 - - 12 - 809633 - 20178 - URL: http://www.sciencedirect.com/science/article/pii/B9780128096338201787 [5] Samuele Girotto, Cinzia Pizzi, and Matteo Comin “MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures” In: Bioinformatics (2016) DOI: 10.1093/bioinformatics/btw466 [6] Huson D H et al “MEGAN analysis of metagenomic data” In: Genome research (Oct 2007), pp 377–386 DOI: 10.1101/gr.5969107 [7] T Kamada and S Kawai “An Algorithm for Drawing General Undirected Graphs” In: Inf Process Lett 31.1 (Apr 1989), pp 7–15 ISSN: 0020-0190 DOI: 10.1016/00200190(89)90102-6 URL: https://doi.org/10.1016/0020-0190(89)90102-6 [8] National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications “The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet.” In: Washington (DC): National Academies Press (US) (2007) URL: https://www ncbi.nlm.nih.gov/books/NBK54011/ [9] Rachid Ounit et al “CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers” In: BMC Genomics (Jan 2015) DOI: 10.1186/ s12864-015-1419-2 [10] Nicola Segata et al “Metagenomic microbial community profiling using unique cladespecific marker genes” In: Nature Methods (2012) DOI: 10.1038/nmeth.2066 57 [11] Ji H Shendure J “Next-generation DNA sequencing” In: Nat Biotechnol (Oct 2008), pp 1135–1145 DOI: 10.1038/nbt1486 [12] Le Van Vinh et al “A two-phase binning algorithm using l-mer frequency on groups of nonoverlapping reads” In: Algorithms for Molecular Biology (2015) DOI: 10.1186/s13015014-0030-4 [13] Yi Wang et al “MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample” In: Bioinformatics (2012) DOI: 10.1093/ bioinformatics/bts397 [14] Salzberg S.L Wood D.E “Kraken: ultrafast metagenomic sequence classification using exact alignments” In: Genome Biol (Nov 2013) DOI: 10.1186/gb-2014-15-3-r46 [15] Yu-Wei Wu and Yuzhen Ye “A novel abundance-based algorithm for binning metagenomic sequences using l-tuples” In: Journal of computational biology : a journal of computational molecular cell biology (2011) DOI: 10.1089/cmb.2010.0245 58 ... parameters and processing datasets and then display relative information • User is provided a dashboard for managing and displaying all historical processed datasets and their relative information... Logical Data Flow for Reading Fasta File, Hadoop Mapper in one slave computer within Hadoop Cluster will handled a block of data of file S8.fna This Hadoop Mapper will process the input data block... than the state-ofthe-art binning algorithms on both simulated and real metagenomic datasets 30 3.2 Libra: scalable k-mer? ?based tool for massive all-vs-all metagenome comparisons 3.2.1 Background

Tiêu đề	Developing A Hadoop-Based Distributed System For Metagenomic Binning
Tác giả	Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan
Người hướng dẫn	Assoc. Prof. Dr. Tran Van Hoai
Trường học	Vietnam National University of Ho Chi Minh City
Chuyên ngành	Computer Engineering
Thể loại	thesis
Thành phố	Ho Chi Minh City

Định dạng
Số trang	69
Dung lượng	2,92 MB