Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 52 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
52
Dung lượng
1,5 MB
Nội dung
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY .*** MASTER’S THESIS BUILDING A PLATFORM FOR MANAGING, ANALYZING, AND SHARING BIOMEDICAL BIG DATA Master student: Dao Dang Toan Supervisors: Dr Nguyen Thanh Huong Assoc-Prof Dr Dao Trung Kien A thesis submited in fulfilment of the requirements for the degree of Master of Science in the Pervasive, Space and Interaction Department International Research Institute MICA Hanoi – 2021 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Đào Đăng Toàn Để tài luận văn: Xây dựng tảng quản lý, phân tích, chia sẻ liệu lớn y sinh học Chuyên ngành: Khoa học máy tính Mã số SV: CB190241 Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày 21/09/2021 với nội dung sau: - Yêu cầu sửa luận văn, bố cục lại làm rõ đóng góp mình, tham khảo ý kiến chỉnh sửa thầy cô hướng dẫn - Bổ sung kết có vào luận văn Ngày 18 tháng 10 năm 2021 Giáo viên hướng dẫn CHỦ TỊCH HỘI ĐỒNG Tác giả luận văn ĐỀ TÀI LUẬN VĂN Tên đề tài (tiếng Việt): Xây dựng tảng quản lý, phân tích chỉa sẻ liệu lớn y sinh học Tên đề tài (tiếng Anh): Building a platform for managing, analyzing and sharing biomedical big data Giáo viên hướng dẫn Acknowledgement It is an honor for me to write thankful words to those who have been supporting, guiding and inspiring me from the moment, when I started my work in Vingroup Big Data Institute and International Research Institute MICA, until now, when I am writing my master thesis I owe my deepest gratitude to my supervisor, Dr Nguyen Thanh Huong Her expertise, understanding and generous guidance made it possible to work in a new topic for me She has made available her support in a number of ways to find out the solution to my works It is a pleasure to work with her I would like to show my gratitude to Assoc Prof Dao Trung Kien and all of members of the Pervasive Space and Interaction Department for their guidance which help me a lot in how to study and to research in right way, and also the valuable advice for my works Special thanks to Dr Vo Sy Nam and my colleagues at Vingroup Big Data Institute for their support Their suggestions enable me to keep my thesis in the right direction Finally, this thesis would not have been possible if there were no encouragement from my family and friends Their words give me power in order to overcome all the embarrassment, discouragement and other difficulties Thanks for everything in helping me to get this day Abstract With the advancement of hardware and software technologies, the data explosion in biomedical research and in healthcare systems in recent years has required urgent solutions for managing, analyzing and sharing data In particular, research in omics science is moving from a hypothetical approach to a datadriven approach Additionally, the healthcare industry has always required tighter integration with biomedical data to promote personalized medicine and deliver better treatments However, dealing with the huge amount of information generated every day requires complex solutions Many solutions from hardware to software are born to solve the problem of big data such as high-performance computing solutions (HPC) or solutions that utilize distributed computing and storage systems (Spark, Hadoop) Recognizing the challenges in managing biomedical data, we leveraged existing technologies to build a data management, analysis and sharing system that we call MASH Hanoi, October 18th 2021 Dao Dang Toan Table of Contents _Toc85361781CHAPTER INTRODUCTION 1.1 Motivation 1.2 System’s Main Objective 1.3 System Requirements 1.3.1 Functional Requirements 1.3.2 Non-functional Requirements 1.4 Main Contributions CHAPTER THEORETICAL BACKGROUND ON MASH CONSTRUCTION 2.1 Distribution of Data Samples 2.2 System Input Files 2.2.1 The FASTQ Format 2.2.2 The SAM/BAM Format 2.2.3 The VCF Format 2.3 Big Data Technologies 2.3.1 Hadoop 2.3.2 Spark 2.3.3 Elasticsearch 2.3.4 Data Lake 2.3.5 Data Warehouse 2.3.6 Distributed Object Storage 10 2.4 Literature Review 11 2.4.1 Cloud-based Computing 11 2.4.2 Data Commons 12 2.4.3 Summary 12 CHAPTER MASH SYSTEM DESIGN AND DEVELOPMENT 13 3.1 Solution Overview 13 3.2 Data Model 13 3.2.1 Graph Data Model 13 3.2.2 Document Data Model 15 3.3 Overall Architecture of the System 16 3.3.1 Overview of System Architecture 16 CHAPTER SOLUTIONS TO SPEED UP DATA INSERTION AND QUERYING 21 4.1 Data Insertion 21 4.2 Data Querying 22 4.3 Application of genetic algorithm in optimal parameter selection 29 4.3.1 Introduction to Genetic Algorithm 29 4.3.2 Parameter Tuning 30 CHAPTER MASH CONSTRUCTION RESULTS 32 5.1 Test Environment 32 5.2 Result of Parameter Optimization by Genetic Algorithm 32 5.3 Insertion performance 32 5.4 Query performance 33 CONCLUSION AND PERSPECTIVES 38 Research Questions and Outcomes 38 Contributions and Perspectives 39 2.1 Contributions 39 2.2 Perspectives 39 PUBLICATIONS 40 REFERENCES 41 List of Figures Figure 2.1: FASTQ format Figure 2.2: VCF format Figure 2.3: Data Warehouse Overview Figure 2.4: Distributed Object Storage System Architecture [8] 10 Figure 3.1: MASH data model 14 Figure 3.2: MASH system architecture 16 Figure 3.3: Layer diagram – MASH system architecture 19 Figure 3.4: System authentication and authorization architecture 19 Figure 3.5: Workflow service architecture 20 Figure 4.1: Data insertion steps 21 Figure 4.2: Flat data type 22 Figure 4.3: Nested data type 23 Figure 4.4: Support data analysis and search by selecting filter options 28 Figure 4.5: Querying data interface 28 Figure 4.6: Representation of a parameter set 30 Figure 4.7: Specific value of a parameter set 30 Figure 4.8: Genetic Algorithm flow chart for Parameter tuning [30] 31 Figure 5.1: Performance of data insertion phase 33 Figure 5.2: Query performance (10 CCR) 35 Figure 5.3: Query performance (100 CCR) 36 Figure 5.4: Query performance (500 CCR) 37 List of Tables Table 4.1: Query schema 24 Table 4.2: Parameters for tuning 30 Table 5.1: Configuration parameters of the server in the test environment 32 Table 5.2: Result of parameter optimization by Genetic Algorithm 32 Table 5.3: Insertion performance 33 Table 5.4: Test types 34 Table 5.5: Test cases 34 Table 5.6: Query performance (10 CCR) 35 Table 5.7: Query performance (100CCR) 36 Table 5.8: Query performance (500 CCR) 36 List of Abbreviations MASH FAIR DGV4VN VM DDoS CI/CD BCL SAM BAM VCF GUID DNA RAM ID ETL CRUD DDBJ OSDC OCC NCI GDC AAA CWL HTTP/HTTPS I/O CPU SSD HDD SNV LncRNA CCR SQL GA Management, Analysis, Sharing and Harmonization Findable, Accessible, Interoperable, Reusable the Database of Genomic Variants for Vietnamese population project Virtual Machine Distributed Denial of Service Continuous Integration/Continuous Delivery Base CalL Sequence Alignment Map Binary Alignment Map Variant Call Format Globally Unique IDentifier DeoxyriboNucleic Acid Random Access Memory IDentifier Extract, Transform, Load Create, Read, Update, and Delete DNA Data Bank of Japan the Open Science Data Cloud the Open Commons Consortium The National Cancer Institute Genomic Data Commons Administration, Authorization, and Authentication Common Workflow Language HyperText Transfer Protocol/HyperText Transfer Protocol Secure Input/Output Central Processing Unit Solid-State Drive Hard Disk Drive Single-Nucleotide Variant Long non-coding RNAs ConCurrent Requests Structured Query Language Genetic Algorithm CHAPTER INTRODUCTION 1.1 Motivation This thesis focuses on some of the problems that research projects on the human genome are facing To better understand these issues, let's look at following interesting story of research projects on the human genome: ● In the 2000s: The first human genome project was completed after 13 years and approximately 1000 scientists were involved That project had spent more than billions USD to decode the first human genome And it has a very high impact on the genomics field The completion of this project was a big science event at that time ● And today, thanks to the technology development, to decode a human genome, it takes only a few days with approximately 1000 USD ● In the near future, maybe in the next few years, we only have to spend no more than 100 USD and a few hours for one human genome decoding The cost of decoding a human genome has been dropping rapidly This means genomics research will generate a huge amount of data And it raises the challenges in data analysis, and management Those challenges continue to grow in the future Many projects/systems were established to address the above problems And the Database of Genomic Variants for Vietnamese Population Project (DGV4VN) which was funded by Vinbigdata is one of those projects, in this project we built a big biomedical data platform which named MASH to solve the challenges of big data Management, Analysis and Sharing biomedical big data 1.2 System’s Main Objective At the time we started building the MASH system, we had a requirement to build a system to hold over 1,200 terabytes of data We need to share a part of that amount of data to our partner and research community And that amount of data is growing quickly over time So it is necessary to build a scalable system, which can store a very big continuous growing data, and that data must be Findable, Accessible, Interoperable, and Reusable One of the biggest challenges of every big data platform is how to ensure the performance of the system And high performance is also one of the goals for the MASH 1.3 System Requirements 1.3.1 Functional Requirements MASH comprises of four main functional groups, including management, analysis, sharing and visualization of biomedical data The specific functions are set out as follows: Integrate and manage data from various projects ranging from the ongoing DGV4VN project to data obtained from future projects such as cancer genomics, other health data, or projects of different fields (c) In the gene index: Select and count all documents that have “high” values in the “impact” subfield of the “variant_list” nested field The above design of the data model helps avoid joining indexes, thereby improving data querying performance 4.3 Application of genetic algorithm in optimal parameter selection In section 4.1, several parameters that can affect the performance of data insertion Elasticsearch database were proposed These parameters are: Bulk_size: Size of bulk requests N_workers: Number of workers/threads which are used to insert data to Elasticsearch N_shards: Number of shards of each Elasticsearch index N_replicas: Number of replicas of each Elasticsearch index N_masters: Number of master nodes in the Elasticsearch cluster N_data: Number of data nodes in the Elasticsearch cluster The search space of potential combinations of the above parameter options are almost limitless In this case, heuristic methods prove to be more effective than specific searching method, hence genetic algorithm is suitable to be applied In this section, the content of genetic algorithm (GA) is briefly described, then the application of such algorithm in optimal parameter searching for data insertion and querying from Elasticsearch database is presented in detail 4.3.1 Introduction to Genetic Algorithm The Genetic Algorithm (GA) follows the Darwinian theory of Natural Selection [21] which is based upon a bigger class of Evolutionary Algorithms Genetic Algorithms are mostly used to solve problems with a very large search space The “Survival of the fittest” theory of the Darwinian theory is the basis of the Genetic Algorithm [22] GA includes operations such as mutations, crossover and selection: - Mutation: Each individual child inherits the characteristics of both parents After living together for a while, a population will reach a limit on the number of gene pairs of offspring that are made up of the genes of the parents That is when the mutation happened Mutation is one of the main causes of evolution and natural selection - Natural selection: Over time, weaker, unadapted individuals will be eliminated by factors such as food chain disputes, harsh environments and destruction by other species Only individuals with superior adaptive traits will be kept - Crossover: During the reproductive period, the offspring will inherit traits from both parents Usually, an individual will receive half of the genes from both parents Children can adapt better or worse than their parents 29 4.3.2 Parameter Tuning In order to find the optimal value based on GA, all parameters need to be tuned All tuning parameters together with their possible tuning ranges are listed in the following table: Table 4.2: Parameters for tuning Parameters Description Possible values 100 – 10000 (step: 100) bulk_size Size of bulk requests n_workers Number of workers/threads – 100 which are used to insert data to Elasticsearch n_shards Number of shards Elasticsearch index n_replicas Number of replicas of each – Elasticsearch index n_masters Number of master nodes in the – Elasticsearch cluster N_data Number of data nodes in the – 20 Elasticsearch cluster of each – 1000 (step: 10) Each combination of all parameters generates a set of parameters, in which each element in parameter set represents specific value The representation of parameter set is shown as below bulk_size n_workers n_shards n_replicas n_masters N_data Figure 4.6: Representation of a parameter set Consequently, the value for each parameter set can be: 100 Figure 4.7: Specific value of a parameter set After knowing all the parameters and procedures for selection operations, we propose an optimization procedure applying genetic algorithm as suggested in a previous study [23]: 30 Figure 4.8: Genetic Algorithm flow chart for Parameter tuning [30] In Figure 4.8 after generating a population with P individuals, we need to measure the fitness of each individual by experimental methods, the results of GA algorithm for parameters tunning are described in detail in the section 5.2 31 CHAPTER MASH CONSTRUCTION RESULTS 5.1 Test Environment To perform testing, the entire system will be implemented on a physical service with the following parameters: Table 5.1: Configuration parameters of the server in the test environment Item Description Server Platform ProLiant DL360 Gen10 Processor Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Memory 396 GB Local Storage SSD 3PARdata - SAN 10TB Hyper-threading Enable NUMA node(s) Thread(s) per core Core(s) per socket 12 Socket(s) On-line CPU(s) list 0-47 We undertake tests on data insertion and querying performance To test the performance of querying data, queries are generated from a specialized performance testing tool Details of test results are provided in the next sections 5.2 Result of Parameter Optimization by Genetic Algorithm This is the parameter set which was found after applying the genetic algorithm All the results in the next sections were tested on a Elasticsearch cluster which was deployed using this parameter set Table 5.2: Result of parameter optimization by Genetic Algorithm Bulk_size N_workers N_shards N_replicas N_masters N_data 400 12 30 5.3 Insertion performance The results of the data insertion test are shown in the following table I did the tests for MASH with Spark and MASH without Spark, to show the impacts of the solution which integrates bioinformatics tools with Spark to take the advantage of this distributed computing platform 32 Table 5.3: Insertion performance Running Time (seconds) Tool 1027 912 653 MASH with Spark CPU cores RAM (GB) Disk type 8 32 32 32 SSD SSD SSD 8 MASH 11 32 SSD without Spark The results are shown better in the following bar chart: 5524 File sizes (GB) 11 11 11 Number Number of of nodes processes Figure 5.1: Performance of data insertion phase When running on the same node (same number of CPU cores, RAM, disk), the bioinformatics tools after being integrated with Spark gave a very impressive performance improvement, specifically after integrating with Spark Spark gave approximately 5.4x improvement in data insertion performance So the solution has proved this impact to the performance of data insertion phase, this saves a lot of time when we have to process many bigger files in a real life system 5.4 Query performance The data stored in the Elasticsearch database uses the document model for data exploration and visualization Different queries have completely different response times, this difference is evident as the amount of data or the number of requests sent to the system increases Test cases are created based on the queries sent to the system Each type of query has different tasks and effects on the system's infrastructure The queries sent to the system are very diverse, to test all the queries is impossible We select five common queries in our system to the testing The queries that are used for the test cases are listed in the following table: 33 Table 5.4: Test types No Type Description Aggregation Aggregation COUNT the number of genes of all subjects who have gender is female Aggregation Aggregation COUNT the number of variants of all subjects who have gender is female Selection Select all variants which have impact is “Modifier” Selection List details about all female subjects Selection Select all subjects who have original area is “Ha Noi” Test cases are listed in the below table: Table 5.5: Test cases Test case CCR Description T1 10 Aggregation COUNT the number of genes of all subjects who have gender is female T2 100 Aggregation COUNT the number of genes of all subjects who have gender is female T3 500 Aggregation COUNT the number of genes of all subjects who have gender is female T4 10 Aggregation COUNT the number of variants of all subjects who have gender is female T5 100 Aggregation COUNT the number of variants of all subjects who have gender is female T6 500 Aggregation COUNT the number of variants of all subjects who have gender is female T7 10 Select all variants which have impact is “Modifier” T8 100 Select all variants which have impact is “Modifier” T9 500 Select all variants which have impact is “Modifier” T10 10 List details about all female subjects with condition is a specific gene symbol T11 100 List details about all female subjects with condition is a specific gene symbol 34 T12 500 List details about all female subjects with condition is a specific gene symbol T13 10 Select all subjects who have original area is “Ha Noi” T14 100 Select all subjects who have original area is “Ha Noi” T15 500 Select all subjects who have original area is “Ha Noi” These test cases are tested before and after optimization of data and database mapping configuration corresponding to Nested field type (NF) and parent-child (PC) type Let’s look at the query results of the first five test cases: Table 5.6: Query performance (10 CCR) Test cases parent_child nested T1 423 36 T4 815 102 T7 1098 360 T10 85 70 T13 14 12 Figure 5.2: Query performance (10 CCR) With 10 CCR, we have seen quite a clear difference in data query performance between the two schema types Nested type has better performance than parent_child type from to 11.75 times Tests T10 and T13 not show much difference in query performance between two types of schema which can be explained by the relatively small number of subjects in the data (at the time of 35 testing the dataset contains 504 subjects), while other queries work with variants and genes in the dataset up to 27,019,368 and 60,343 respectively In the next five test cases, we can see that when the number of CCR increases to 100, the difference in data query performance between the two schema types becomes even more obvious, the nested type still shows to 38 times better performance than the parent child Table 5.7: Query performance (100CCR) Test case parent_child nested T2 9282 320 T5 15621 504 T8 61890 1617 T11 239 215 T14 81 76 Figure 5.3: Query performance (100 CCR) In the last five test cases, for the most time consuming query, the nested type still has an acceptable response time when the number of CCR up to 500 In contrast, parent_child shows bad performance, the most time consuming query takes up to approximately 14 minutes And nested type also shows to 78.4 times better query performance than parent child type Table 5.8: Query performance (500 CCR) Test case parent_child nested T3 130435 2021 T6 180129 2983 T9 834986 10653 T12 1622 1542 T15 751 748 36 Figure 5.4: Query performance (500 CCR) The results showed a significant improvement in the performance of the data query from the application of the denormalization technique This is also one of the important contributions of this thesis 37 CONCLUSION AND PERSPECTIVES Research Questions and Outcomes The main research question is answered through developing, testing, evaluating and then recommending suitable data models for efficient, high performance data management, data insertion and data retrieval Along with proposing a highly scalable and maintainable system architecture, meeting the requirements of a large data system, biomedical data storage Several specific questions have been raised during the development of the system The following sections describe some of the main questions and summarize the possible solutions to answer those questions (a) How to aggregate data from different sources? System raw data is generated from two sources, which are data generated from the sequencing machine and data uploaded to the system by the user However, metadata data needs to be collected from more sources, including sequencing machines, community, hospitals, and 3rd party APIs Data from different sources has a diverse structure and requires different extraction techniques, in addition the data sources can be updated on a regular basis In order for the data in the target storage area to be updated every time the source data changes, ensuring data integrity, ETL tools were developed with three steps including Extract, Transform and Load All ETL tools are centrally managed, with monitoring methods to ensure all ETL tasks run successfully To ensure performance, ETL tools limit interaction with databases, instead of that, the databases (data sources) will be dumped into files Working with files instead of APIs improves performance of ETL tools approximately 20 times (b) Does data store in MASH satisfy the criteria such as findable, accessible, interoperable, and reusable (FAIR)? Data in datalakes are easy to forget and cannot be reused due to the large amount of data but lack of necessary information To ensure that the data contained in MASH is findable, accessible, interoperable, and reusable, each file object in the MASH datalake is assigned a GUID and each GUID is associated with its metadata All metadata information is stored in the graph data model (c) What are the limitations of the proposed models? MASH uses two types of data models for different purposes: Graph data model is used for raw data management This data model is suitable for storing close relationships between data elements in the system, thus ensuring findable, accessible, interoperable, and reusable properties However, the data query performance of the graph data model is not good, the larger the data in the tables, the data joins will give poor performance Therefore, in MASH, graph data model is only used to store metadata, small-sized data The document model gives higher data query performance than the graph data model because it takes advantage of the localization of the data To 38 ensure response time, the user data needs to be preseeded into the indexes, thereby minimizing joins between indexes However, the amount of data duplicated between tables will increase and ensuring the consistency of data between indexes also becomes a big challenge In MASH, data model is used to store data of VCF files and make it available to the user via exploration service Contributions and Perspectives 2.1 Contributions ● Technically, through system research and development, we have the following major contributions: - Propose an architecture for a flexible platform that can be deployed on cloud or on-premises and can handle data coming from many different sources at the same time - Offers a data model suitable for raw biomedical data management Provide a method and data model to reduce the time to insert data, and to retrieve data Besides, the system has gone live and brought certain benefits to users including researchers, students, doctors, etc Help them have a reliable access point in terms of data quality Provides users with simple, efficient tools to search, filter, and analyze data Published an article in the Journal of Science and Technology of Technical Universities, ISSN: 2354-1083, vol 147, pp 14-21 2.2 Perspectives The data contained in MASH is findable, accessible, interoperable, and reusable We propose some next research directions to allow users to exploit the data that the system has more effectively through: (a) Allow users to develop pluggable applications and provide users with the results they want (b) Allows the system to be updated, using the latest technologies while still ensuring backwards compatibility 39 PUBLICATIONS Nguyen Thanh Huong, Dao Dang Toan, “Cluster-based Routing Approach in Hierarchical Wireless Sensor Networks toward Energy Efficiency using Genetic Algorithm”, Journal of Science and Technology of Technical Universities, ISSN: 2354-1083, vol 147, pp 14-21 40 [1] [2] REFERENCES Miloslavskaya, N., & Tolstoy, A (2016) Big Data, Fast Data and Data Lake Concepts Procedia Computer Science, 88, 300–305 https://doi.org/10.1016/j.procs.2016.07.439 McLaren, W., Gil, L., Hunt, S E., Riat, H S., Ritchie, G R S., Thormann, A., Flicek, P., & Cunningham, F (2016) The Ensembl Variant Effect Predictor Genome Biology, 17(1) https://doi.org/10.1186/s13059-0160974-4 [3] Borthakur D The Hadoop Distributed File System: Architecture and Design Hadoop Project Website 2007; 11:21 [4] Dean, J., & Ghemawat, S (2008) MapReduce Communications of the ACM, 51(1), 107–113 https://doi.org/10.1145/1327452.1327492 [5] Zaharia, Matei & Chowdhury, Mosharaf & Franklin, Michael & Shenker, Scott & Stoica, Ion (2010) Spark: Cluster Computing with Working Sets Proceedings of the 2nd USENIX conference on Hot topics in cloud computing 10 10-10 [6] O’Brien, A R., Saunders, N F W., Guo, Y., Buske, F A., Scott, R J., & Bauer, D C (2015) VariantSpark: population scale clustering of genotype information BMC Genomics, 16(1) https://doi.org/10.1186/s12864-0152269 [7] Atzeni, P., Bugiotti, F., Cabibbo, L., & Torlone, R (2020) Data modeling in the NoSQL world Computer Standards & Interfaces, 67, 103149 https://doi.org/10.1016/j.csi.2016.10.003 [8] Pollack, Kristal & Brandt, Scott (2005) Efficient Access Control for Distributed Hierarchical File Systems 253-260 10.1109/MSST.2005.11 [9] 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes Nature 2012; 491(7422):56–65 doi:10.1038/nature11632 [10] Wu, D., Dou, J., Chai, X., Bellis, C., Wilm, A., Shih, C C., Soon, W W J., Bertin, N., Lin, C B., Khor, C C., DeGiorgio, M., Cheng, S., Bao, L., Karnani, N., Hwang, W Y K., Davila, S., Tan, P., Shabbir, A., Moh, A., … Zhao, Y (2019) Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore Cell, 179(3), 736-749.e15 https://doi.org/10.1016/j.cell.2019.09.019 [11] Tateno, Y (2002) DNA Data Bank of Japan (DDBJ) for genome scale research in life science Nucleic Acids Research, 30(1), 27–30 https://doi.org/10.1093/nar/30.1.27 [12] (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia Nature, 576(7785), 106–111 https://doi.org/10.1038/s41586-0191793-z 41 [13] Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., & Taylor, J (2010) Galaxy CloudMan: delivering cloud compute clusters BMC Bioinformatics, 11(Suppl 12), S4 https://doi.org/10.1186/1471-2105-11s12-s4 [14] Afgan, E., Baker, D., Coraor, N., Goto, H., Paul, I M., Makova, K D., Nekrutenko, A., & Taylor, J (2011) Harnessing cloud computing with Galaxy Cloud Nature Biotechnology, 29(11), 972–974 https://doi.org/10.1038/nbt.2028 [15] Heath, A P., Greenway, M., Powell, R., Spring, J., Suarez, R., Hanley, D., Bandlamudi, C., McNerney, M E., White, K P., & Grossman, R L (2014) Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets Journal of the American Medical Informatics Association, 21(6), 969–975 https://doi.org/10.1136/amiajnl-2013-002155 [16] Yung, C K., Mihaiescu, G L., Tiernay, B., Zhang, J., Gerthoffert, F., Yang, A., Baker, J., Bourque, G., Boutros, P C., Knoppers, B M., Ouellette, B F., Sahinalp, C., Shah, S P., Ferretti, V., & Stein, L D (2017, July 1) Abstract 378: The Cancer Genome Collaboratory Molecular and Cellular Biology, Genetics Proceedings: AACR Annual Meeting 2017; April 1-5, 2017; Washington, DC https://doi.org/10.1158/1538-7445.am2017-378 [17] Grossman, R L., Heath, A., Murphy, M., Patterson, M., & Wells, W (2016) A Case for Data Commons: Toward Data Science as a Service Computing in Science & Engineering, 18(5), 10–20 https://doi.org/10.1109/mcse.2016.92 [18] Grossman, R L., Greenway, M., Heath, A P., Powell, R., Suarez, R D., Wells, W., & Harvey, C (2012, November) The design of a community science cloud: The open science data cloud perspective In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (pp 1051-1057) IEEE [19] Jensen, M A., Ferretti, V., Grossman, R L., & Staudt, L M (2017) The NCI Genomic Data Commons as an engine for precision medicine Blood, 130(4), 453–459 https://doi.org/10.1182/blood-2017-03-735654 [20] Gaurav Kaushik, Sinisa Ivkovic, Janko Simonovic, Nebojsa Tijanic, Brandi Davis-Dusenbery, and Deniz Kural Graph theory approaches for optimizing biomedical data analysis using reproducible workflows bioRxiv, 2016 doi: 10.1101/074708 [21] K F Man, K.S Tang, S Kwong, “Genetic Algorithms: Concepts and Applications”, IEEE Transactions on Industrial Electronics, Vol 43, No.5, October 1996 [22] L Haldurai, T Madhubala, R Rajalakshmi, “A Study on Genetic Algorithm and its Applications”, International Journal of Computer Sciences and Engineering, Vol-4, Issue-10, ISSN-2347-2693 42 [23] Nguyen Thanh Huong, Dao Dang Toan, “Cluster-based Routing Approach in Hierarchical Wireless Sensor Networks toward Energy Efficiency using Genetic Algorithm”, Journal of Science and Technology of Technical Universities, ISSN: 2354-1083, vol 147, pp 14-21 43 ... Việt): X? ?y dựng tảng quản lý, phân tích ch? ?a sẻ liệu lớn y sinh học Tên đề tài (tiếng Anh): Building a platform for managing, analyzing and sharing biomedical big data Giáo viên hướng dẫn Acknowledgement... research "building a platform for managing, analyzing, and sharing biomedical big data" In this topic, the system is developed with basic functions for managing, sharing, analyzing, and visualizing... its data, in MASH system, we use Elasticsearch as a data source for data analysis and data visualization task Figure 4.1: Data insertion steps Ensuring the performance for data warehouses is a