Masters thesis of research deep learning in classifying cancer subtypes, extracting relevant genes and identifying novel mutations

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	102
Dung lượng	830,52 KB

Nội dung

DEEP LEARNING IN CLASSIFYING CANCER SUBTYPES, EXTRACTING RELEVANT GENES AND IDENTIFYING NOVEL MUTATIONS A thesis submitted in fulfilment of the requirements for the degree of Master of Engineering Rak[.]

DEEP LEARNING IN CLASSIFYING CANCER SUBTYPES, EXTRACTING RELEVANT GENES AND IDENTIFYING NOVEL MUTATIONS A thesis submitted in fulfilment of the requirements for the degree of Master of Engineering Raktim Kumar Mondol (B.Sc in Electrical and Electronic Engineering, BRAC University) School of Engineering College of Science, Engieering and Health RMIT University Melbourne, Australia February 2019 ii DECLARATION I certify that, except where due acknowledgment has been made, the work presented in this thesis is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed I acknowledge the support I have received for my research through the provision of an Australian Government Research Training Program Scholarship Raktim Kumar Mondol 10/01/2019 iii ACKNOWLEDGMENTS It is a pleasure to acknowledge my appreciation to the people who have made this thesis possible I wish to express my sincere gratitude to my supervisors Dr Omid Kavehei, Dr Samuel Ippolito and Dr Reza Bonyadi for their continued guidance, motivation and support during my research I would like to thank them for their patience, trust and encouragement throughout my masters Furthermore, I would like to thank Dr Esmaeil Ebrahimie from University of Adelaide for collaborating in my research and providing me critical feedback I would also like to extend my sincere thanks to my fellow PhD colleague Nhan Duy Truong for guidance in my project with his kind words Finally, I would also like to thank my family for their support and understanding during this period iv PREFACE This dissertation is original and unpublished work by the author, R.K Mondol v TABLE OF CONTENTS Page LIST OF TABLES viii LIST OF FIGURES ix ABBREVIATIONS xi NOMENCLATURE xiv ABSTRACT xvi Introduction 1.1 Background 1.2 Problem Statement 1.3 Scope and Rationale of the research 1.4 Objectives of the Research 1.5 Research Questions 1.5.1 Research Question 1.5.2 Research Question 1.5.3 Research Question Structure of the Thesis Literature Review 1.6 2.1 Overview 2.2 Deep Learning Theoretical Background 2.3 2.2.1 Artificial Neural Network & Various types of Data Mining Methods 2.2.2 Types of Neural Networks 11 2.2.3 Training Neural Network using Backpropagation 12 Bioinformatics Theoretical Background 14 2.3.1 Background 14 2.3.2 Molecular biology 14 2.3.3 Various Pipelines in Bioinformatics 16 vi Page 2.3.4 Gene Ontology and Pathway analysis 19 Feature Extraction using Adversarial Autoencoder 21 3.1 Overview 21 3.2 Challenges 22 3.3 Data Collection 23 3.4 Evaluating Performance of AAE using Classification 24 3.5 3.6 3.4.1 Background 24 3.4.2 Methodology 25 3.4.3 AAE model implementation 26 3.4.4 Performance Metrics of Classification 27 3.4.5 Results 29 3.4.6 Discussion 32 Evaluating Performance of AAE by Analyzing Connectivity Matrices 33 3.5.1 Overview 33 3.5.2 Methodology 35 3.5.3 AAE model Implementation 37 3.5.4 Results 38 3.5.5 Discussion 40 Chapter Summary 41 Identification of Novel Mutation using Bioinformatics Pipelines 42 4.1 Overview 42 4.2 Data Collection 43 4.3 Methodology 44 4.4 Results & Discussion 45 4.5 Gene Ontology (GO) & Protein-protein interaction (PPI) Analysis 52 4.6 Chapter Summary 53 Conclusions and Future Work 54 5.1 Conclusions 54 vii Page 5.2 Recommendation for future Work 55 REFERENCES 59 A Algorithms 69 B Tables 71 C Figures 77 D Codes 79 D.1 AAE Architechture 79 D.2 Fine Tuning 81 D.3 Extract Weight form Latent Space 83 D.4 Analyze and sort Weight Matrix 84 viii LIST OF TABLES Table Page 3.1 Summary of proposed AAE architecture Each of the network in AAE contains one hidden layer with 1000 neurons To train the model, adadelta optimizer is used with learning rate of and binary cross entropy is used as a loss function 29 3.2 Results of Gene enrichment analysis using various feature extraction methods on BRCA cancer data-set 39 4.1 Eight leukemia data samples with their corresponding sample types are described below 44 4.2 List of novel variants obtained from annotated data 50 B.1 Specification of computer used in the experiment 71 B.2 Comparison of various feature extraction techniques using BRCA dataset with twelve different classifiers Five-fold cross validation were performed during evaluation 72 B.3 Benchmarking of various feature extractor while classifying various cancer subtypes using BRCA dataset 73 B.4 Benchmarking of various feature extractor while extracting biologically relevant genes using BRCA dataset 73 B.5 Results of pathway analysis using various feature extraction methods on BRCA data-set 74 B.6 Results of gene enrichment analysis using various feature extraction methods further validated on UCEC data-set 74 B.7 The novel variants with associated cancer and their primary expression 75 B.8 Results of GeneOntology (GO) enrichment analysis using variants genes are shown 76 ix LIST OF FIGURES Figure Page 1.1 The cost of genome sequencing is decreasing since the last decade [1] 1.2 Nucleotide sequence, mass spectrometry, and microarray data keep increasing [3] 2.1 Biological neuron compared with the artificial neural network [7] 2.2 Multilayer perceptron with input layer, one hidden layer, and output layer [15].11 2.3 Autoencoder reconstruct actual input to its output layer and encoded data stored in the hidden layer [16] 13 2.4 Structure of a gene includes a promoter region, RNA-coding region and terminator sites [26] 16 2.5 DNA transcribed into RNA, and this transcript may then be translated into protein [28] 16 2.6 Pipeline for differential gene analysis using RNA-Seq data [31] 17 2.7 Pipeline for variant analysis using DNA-Seq data [33] 3.1 Architecture of AAE for classifying cancer subtypes 3.2 Various classifiers such as DT, GB, KNN etc are used to evaluate the performance of feature extraction methods such as AAE, PCA, AE, VAE and DAE (a) Precision score is compared among feature extraction methods using twelve different classifiers (b) Recall score is compared among feature extraction methods using twelve different classifiers 30 3.3 Performance of various feature extracting methods in terms of their computation time 31 3.4 Proposed AAE is compared with other methods in terms of seven performance metrics such as accuracy, F1-score, recall, precision, AUC, MCC and Kappa 32 3.5 Block diagram for weight matrix analysis using AAE architecture 3.6 Histogram of AAE weights after applying TopGene algorithm 40 18 26 36 x Figure Page 4.1 In this block diagram, the whole pipeline for variant analysis are shown First, in the pre-processing steps quality of the data are observed and trim the data to improve the quality Then data is aligned with reference human genome Next, PCR duplicates are removed before variant calling Next, important variants are obtained through filtering After that, filtered variants are annotated with reference variant annotation databases Finally, data is further refined in order to find, validate and visualize novel variant 46 4.2 Three different variant calling tools are shown in venn diagram for sample SRR40862 & SRR408630 48 4.3 In this figure, variant obtained from the SRR408623 sample is shown Here, the novel variant is located in Chromosome X:150470758 positions where nucleotide base changes from C to G Here, the MAMLD1 gene is found to be responsible for this mutation 52 C.1 The olfactory transduction pathway obtained from AAE model The GCD expressing ORNs and Odorant (highlighted in red) are oncogenic activators responsible for triggering olfactory transduction pathway 77 C.2 In this figure, Interaction between proteins are shown Here, PLK3 variant gene are connected to SF1 variant gene via CDC5L 78 ... this kind of problem Furthermore, deep learning has revolutionised several areas of biology and medicine by examining the variety of issues such as diagnosing diseases, protein integrations and. .. reduced chance of error Thus, developing such method and pipeline is of current research interest At the beginning of the research, high dimensionality problem of the dataset was addressed and tried... heterogeneous, often has many missing values and require sophisticated tools to analyse These complex nature of data can be handled using deep learning- based algorithms [2] Some of the problems regarding deep

Ngày đăng: 11/03/2023, 11:45