Mô hình xử lý hiệu quả dữ liệu biểu hiện gen tt tiếng anh

MINISTRY OF EDUCATION & TRAINING CAN THO UNIVERSITY DOCTORAL THESIS SUMMARY Specialization: Information System Code: 62 48 01 04 HUYNH PHUOC HAI EFFECTIVE MODEL FOR GENE EXPRESSION DATA ANALYSIS Cantho, 2019 The dissertation is completed at: CAN THO UNIVERSITY Academic Instructors: Do Thanh Nghi, Assoc Prof., PhD Nguyen Van Hoa, PhD The dissertation will be defended before the Board of thesis review Meeting at: …………………………… At hour day month year The dissertation is available at: - National Library - Information and Learning Center, Can Tho University PUBLISHED ARTICLES [CT1] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "Novel hybrid DCNN-SVM model for classifying RNA-Sequencing gene expression data", Journal of Information and Telecommunication (JIT), Taylor & Francis, 3:4, pp 533-547, 2019 [CT2] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "Enhancing gene expression classification of support vector machines with generative adversarial networks", Journal of Information and Communication Convergence Engineering (JICCE), Vol 17, pp 14-20, 2019 (SCOPUS) [CT3] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "So sánh mơ hình học sâu kỹ thuật học tự động khác phân lớp liệu biểu gene microarray", in proc of the 10th National Conference on Fundamental and Applied Information Technology Research (FAIR’10), pp 841- 850, 2017 [CT4] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "A Coupling Support Vector Machines with the Feature Learning of Deep Convolutional Neural Networks for Classifying Microarray Gene Expression Data", in proc of the 10th Asian Conference on Intelligent Information and Database Systems (ACIIDS), Springer, pp 233-243, 2018 [CT5] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "Random ensemble oblique decision stumps for classifying gene expression data",in proc of International Symposium on Information and Communication Technology 2018, Association for Computing Machinery (SoICT), pp 137-144, 2018 [CT6] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do,"A combined enhancing and feature extraction algorithm to improve learning accuracy for gene expression classification", in proc of 6th International Conference on Future Data and Security Engineering 2019 (FDSE), Springer, pp 255-273, 2019 [CT7] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "Improvements in the Large p, Small n Classification Issue", in Journal of SN Computer Science, Springer, 1, 207, 2020 CHAPTER INTRODUCTION 1.1 The urgency of the thesis In recent years, cancer is one of the leading causes of death worldwide Therefore, more and more studies that have been conducted to find effective solutions to diagnose and treat cancer However, there are still many challenges in cancer treatment because possible causes of cancer are genetic disorders or epigenetic alterations in the cells Gene expression data take advantage of the sufficient classification of cancers, which addresses the problems relating to cancer causes and treatment regimens However, the characteristics of gene expression data have very-high-dimensional and small-sample-size which lead to over-fitting of classification models High-dimensional data classification is a fundamental task in machine learning However, a characteristic of gene expression data is that the number of variables (genes) n far exceeds the number of samples m, commonly known as "curse of dimensionality" problem The vast amount of gene expression data leads to statistical and analytical challenges and conventional statistical methods give improper result due to high dimension of gene expression data with a limited number of patterns It is not feasible when to build machine learning model due to the extremely large feature sets with millions of features and high computing cost Several classification methods have been applied to the analysis of gene expression data Although there have been many researches about cancer classification from gene expression data, it remains a critical need to improve classifying accuracy In addition, training data sample size is relatively small compared to features vector size, so that classifiers may give poor classification performance due to overfitting Therefore, the dissertation "Effective model for gene expression data analysis" is conducted to contribute a small part to the research field on gene expression data classification 1.2 Objectives, objects and scope of research of the thesis The main objective of this dissertation proposes new approaches for gene expression data classification to improve the accuracy of proposed models Specific objectives include the following: • The first objective is to propose a new model that extracts features from gene expression data to improve the accuracy of classification models • The second objective is to propose a new model that enhances gene expression data to improve the accuracy of classification models • The final objective is to investigate a new algorithm that efficiently classify very-high-dimensional gene expression data as well as it also improves the accuracy when combine with feature extraction and enhancing data models The main objects of the study include features extraction, enhancing data, classification models based on gene expression data of human Literature review and experiment are two main research methods to be used by this dissertation Within the scope of this dissertation, the solutions have been suggested to address very-high-dimensional and small-sample-size issues of gene expression data classification 1.3 Contribution of the thesis In this dissertation we are interested to tackle these issues with the following contributions Firstly, we propose a new feature extraction model to learn latent features of gene expression data by deep convolutional neural network (DCNN) The model improves the classification accuracy on both gene expression data of DNA Microarray and RNA-Seq technologies Experiment results show that DCNN is effective to extract features from gene expression data On the other hand, we propose a combined enhancing and extraction model to address both challenges of classification models using gene expression data In this approach, SMOTE algorithm generates new data from features extracted by DCNN These models are used in conjunction with classifiers that efficiently classify gene expression data Secondly, we propose a new enhancing gene expression data model with generative adversarial network (GAN) GAN that generates new data from original training datasets was implemented The GAN was used in conjunction with classifiers that efficiently classify gene expression data Numerical test results show that our proposed model improve the classification accuracy of algorithms including k nearest neighbors, decision trees, support vector machines and random forests Finally, we investigate random ensemble oblique decision stumps (RODS) based on linear support vector machine (SVM) that is suitable for classifying very-high-dimensional microarray gene expression data Our classification algorithms (called Bag-RODS and Boost-RODS) learn multiple oblique decision stumps in the way of bagging and boosting to form an ensemble of classifiers more accurate than single model Numerical test results show that our proposed algorithms are more accurate than the-state-of-the-art classification models, including k nearest neighbors, support vector machines, decision trees and ensembles of decision trees like random forests, bagging and adaboost In addition, these models also improve the classification accuracy by combined with enhancing data model using the GAN and feature extraction model using DCNN 1.4 Thesis structure The rest of this dissertation is structured as follows In Chapter 2, we cover the theoretical background of this work, introduce some breakthrough related works Chapter proposes a new features extraction model using deep convolutional neural network (DCNN) This model extracts latent features from gene expression data, then they are used in conjunction with support vector machines, k nearest neighbors and random forest that efficiently classify gene expression data In Chapter 4, we propose a novel gene expression classification model of multiple classifying algorithms with synthetic minority oversampling technique (SMOTE) using features extracted by DCNN We propose enhancing gene expression data using generative adversarial network in Chapter In Chapter 6, we investigate random ensemble oblique decision stumps (RODS) based on linear support vector machine (SVM) that is suitable for classifying very-high-dimensional gene expression data We then conclude in Chapter as well as show some future works CHAPTER BACKGROUND AND REVIEW OF LIERATURE 2.1 Gene expression data Gene expression is a process during which genes are processed into functional gene products, proteins It can be used to study the effect of treatments or to discover diseases by comparing a healthy gene expression to the expression of those genes that are infected or changed by the treatment The development of gene expression analysis technology enables researchers to investigate and address issues which is once thought to be impractical for the simultaneous measurement of the expression levels of thousands of genes in a single experiment Classifying gene expression data models have provided useful information for diagnosing cancer and drug discovery The purpose is to learn classifiers from gene expression data which automatically assign the label to a given expression profile After assessing the quality of the prediction they can be applied to classify a new expression profile 2.2 Evaluation protocol We are interested in the accuracy of our proposal for classifying gene expression data Therefore, we report the comparison of the classification performance obtained by our models and the best state-of-the-art algorithms The Student’s test was used to assess the classification results of the learning algorithms In this dissertation, the experiments use three evaluation protocols Firstly, some datasets are already divided in training set (Trn) and testing set (Tst) For these datasets, we used the training data to build the our model Then, we classified the testing set using the resulted model Secondly, with a datasets having less than 300 data points, the test protocol is leave-one-out cross-validation (loo) For the others, we used 10-fold cross-validation protocols remains the most widely to evaluate the performance Our evaluation used on the classification accuracy All experiments are run on machine Linux Mint, Intel(R) Xeon(R) CPU 3.07GHz, cores and RAM 2.3 Datasets In our experiments, we use datasets provided by ArrayExpress, Kent Ridge and TCGA (http://gdac.broadinstitute.org) repositories In this dissertation, experiments were conducted on various datasets The DNA Microarray datasets consists 50 very-high-dimensional datasets of the Kent Ridge and ArrayExpress repositories (Datasets I) The small sample size DNA Microarray datasets use 20 very-samplesample datasets (less 130 samples) from the Kent Ridge and ArrayExpress repositories (Datasets II) The RNA-Seq gene expression datasets consists 25 binary classes RNA-Seq gene expression datasets which are small and medium-sample sizes ranging from 66 to 1100 samples Each sample has 20,531 features In addition, the large RNA-Seq gene expression is used, which contains 12,181 data samples representing 36 tumor types (Datasets III) All datasets and their characteristics are summarized in full-text dissertation 2.4 Related research works During the past decade, many algorithms have been used to classify gene expression data including support vector machines, neural network, k nearest neighbors, decision trees, random forests, random forests of oblique decision trees, bagging and boosting In this session, we brief overview popular classification models for gene expression as well as discuss related works In addition, we also introduction deep convolutional neural network DCNN (Lecun, 85) and generative adversarial networks GAN (Goodfellow, 14) models 2.5 Conclusion In short, this chapter cover the theoretical background of this work, introduce some breakthrough related work The main contribution is collect datasets, discusses related works for gene expression data models CHAPTER FEATURE EXTRACTION MODEL FOR GENE EXPRESSION DATA 3.1 Introduction In this chapter, we propose a DCNN extract features from original gene expression data and use classification algorithms classify new features This model addresses very-high-dimensional issue of gene expression data Experiment results show DCNN improves the classification accuracy on both gene expression data of DNA Microarray and RNA-Seq technologies Results of this chapter are published in CT1, CT3 and CT4 3.2 Methods First of all, we implement a new DCNN that extracts new features from origin gene expression data Therefore, a new architecture of DCNN is designed to extract latent features from original gene expression data The network architecture is shown in Figure 3.1 CONV1 kernel size: (3x3) feature maps CONV2 kernel size: (3x3) feature maps new features SVM, kNN, RF, C4.5 Input POOLING1 POOLING2 kernel size: (3x3) kernel size: (3x3) Hình 3.1: A new DCNN architecture for feature extraction gene expression data The layers are respectively named CONV1, POOLING1, CONV2, POOLING2, and output The input layer receives the gene expression in the 2-D matrix format In network structure, the successive layers are designed to learn progressively higher-level features, until the last layer which produces categories Once training processing is completed, the last layer, which is a linear classified operating on the features extracted by the previous layers The numbers feature maps and kernel size show in Figure 3.1 Our model has taken advantage of DCNN to learn latent features from very-high-dimensional input spaces This process can be viewed as a projection of data from higher dimensional space to a lower dimensional space Secondly, the new features extracted by DCNN following which the non-linear SVM (SVM), linear SVM (LSVM), random forests (RF), k nearest neighbors (kNN) and decision trees (C4.5) learn to classify gene expression data in this phase In our algorithm, DCNN is used to extract features The training and testing samples are fed through the trained network, and the output of the last layer are extracted as the features The feature sets from the training samples are then used as input to train various classifiers in the usual way The classifiers consist SVM, LSVM, kNN, RF and C4.5 that are used to classify new data In our approach, we propose to use RBF kernel type in non-linear SVM model because it is general and efficient Moreover, the DCNN can improve accuracy classification of LSVM and kNN 3.3 Evaluation We implement DCNN in Python using TensorFlow Other algorithms like SVM, RF, C4.5 in Scikit, LibSVM libraries In order to train network, we use Adam for optimization with batch size is to 32 We start to train with a learning rate of 0.00002 for all layers, and then rise it manually every time when the validation error rate stopped improving The number of epochs is 200 The RF learn 200 decision trees kNN tried to use k among 1, 3; 5; The C =105 was used for LSVM Finally, an attempt was made to tune parameters C and γ of the RBF kernel to obtain good accuracy for the non-linear SVM All hybrid parameters show in full-text dissertation 3.3.1 Classifying DNA Microarray gene expression data The classification results of 50 DNA Microarray show in full-text dissertation Table 3.1 summarizes the results of these statistical tests with the paired Student ratio test First and foremost, we evaluate DCNN model We compare the accuracy of classifying algorithms (SVM, LSVM, kNN, C4.5 and RF) CHAPTER ENHANCING GENE EXPRESSION DATA USING SMOTE ALGORITHM 4.1 Introduction In this chapter, we propose a novel gene expression classification model of multiple classifying algorithms with synthetic minority oversampling technique (SMOTE) using features extracted by deep convolutional neural network (DCNN) In our approach, the DCNN extracts latent features of original gene expression data, then the SMOTE algorithm generates new data from the features of DCNN was implemented Experiment results show DCNN and SMOTE improves the classification accuracy of classifiers include: SVM, LSVM, kNN and RF Results of this chapter are published in CT6 4.2 Methods The proposed algorithm is effective combination of two algorithms DCNN and SMOTE The algorithm performs the training task with three main phases (Figure 4.1) Original data DCNN New features Training Data Classifiers SMOTE Synthetic Data Label new data using SVM Hình 4.1: Proposed Model using DCNN and SMOTE First of all, we implement a new DCNN that extract new features from original gene expression data The new features improve the dissimilarity power of gene expression representations and thus obtain higher accuracy rate than original features Although the data dimension has reduced but training data sample size is relatively diminutive compared to feature vector size, so that classifiers may give poor classification performance due to over-fitting 11 In a second order phase training, our model use SMOTE generates new samples from features extracted by DCNN model In our approach, in the very-high-dimensional data setting only kNN classifiers based on the Euclidean distance seem to benefit substantially from the use of over-sampling, provided that feature extraction by extraction model is performed before using this algorithm For traditional over-sampling algorithm, it is not effective for very-high-dimensional data and this problem has tackled by DCNN model in our approach We propose a new SMOTE algorithm that generates new synthetically gene expression data from new features extracted of DCNN Our algorithm generates synthetic data which has almost similar characteristics of the training data points Synthetic data points (xnew ) are generated in the following way Firstly, the algorithm takes the feature vectors and its nearest neighbors, computes the distance between these vectors Secondly, the difference is multiplied by a random number (γ) between and 1, and it is added back to feature vector This causes the selection of a random point along the line segment between two specific features Then, LSVM is used to set label for generating samples with constant C = 103 An amount of new samples (p%) and k nearest neighbors are hyper parameters of the algorithm Last but not least, our model generates new training data following which the classifiers learns to classify gene expression data efficiently in this phase The classifiers consist non-linear SVM, LSVM, kNN, RF and C4.5 that are used to classify new data 4.3 Evaluation The classification results of 50 DNA Microarray datasets show in full-text dissertation Table 4.1 summarizes the results of these statistical tests with the paired Student ratio test Tables 4.1 show that DCNN-SMOTE-[SVM, LSVM, kNN, RF] significantly increases the mean accuracy of 4.83, 3.37, 2.9, 2.08% points compared to SVM, LSVM, kNN and RF respectively All p-values are less than 0.05 These results show effective of DCNN and SMOTE that improve accuracy of SVM, LSVM, RF and kNN classifiers In the comparison between DCNN-SMOTE-C4.5 with C4.5, DCNN-SMOTE-C4.5 slightly superior to C4.5 of C4.5 with 27 wins, tie, 21 defeat, p-value = 1.06E-01 (not significant different) 12 Bảng 4.1: Summary of the accuracy comparison on 50 datasets Model Mean (%) Win Tie Defeat p-value SVM LSVM kNN RF C 4.5 83.34 84.64 78.77 82.62 75.70 DCNN-SMOTE-SVM DCNN-SMOTE-LSVM DCNN-SMOTE-kNN DCNN-SMOTE-RF DCNN-SMOTE-C4.5 88.17 87.54 82.51 84.70 78.47 DCNN-SMOTE-SVM & SVM DCNN-SMOTE-LSVM & LSVM DCNN-SMOTE-kNN & kNN DCNN-SMOTE-RF & RF DCNN-SMOTE-C4.5 & C4.5 29 34 29 29 27 11 12 10 17 21 1.33E-03 8.72E-03 2.26E-03 2.78E-02 0.11 DCNN-SMOTE-SVM & DCNN-SVM DCNN-SMOTE-LSVM & DCNN-LSVM DCNN-SMOTE-kNN & DCNN-kNN DCNN-SMOTE-RF & DCNN-RF DCNN-SMOTE-C4.5 & DCNN-C4.5 23 25 31 28 27 17 11 12 10 14 11 10 18 18.20E-0.2 1.59E-01 8.45E-02 7.06E-02 1.47E-01 DCNN-SMOTE-SVM DCNN-SMOTE-SVM DCNN-SMOTE-SVM DCNN-SMOTE-SVM 29 40 35 46 17 8 2.09E-01 2.27E-08 6.99E-05 6.17E-11 & & & & DCNN-SMOTE-LSVM DCNN-SMOTE-kNN DCNN-SMOTE-RF DCNN-SMOTE-C4.5 4.4 Conclusion We have presented a new classification algorithm of multiple classifiers with SMOTE using features extracted by DCNN that tackle both issues of gene expression data A new DCNN model extract new features from origin gene expression data, then a SMOTE algorithm generates new data from the features of DCNN was implemented These models are used in conjunction with classifiers that efficiently classify gene expression data From the obtained results, it is observed that DCNN-SMOTE can improve performance of SVM, LSVM, RF and kNN algorithms In addition, the proposed DCNN-SMOTE-SVM approach has the most accurate, when compared to the than the-stateof-the-art classification models in consideration 13 CHAPTER ENHANCING DATA MODEL FOR GENE EXPRESSION DATA 5.1 Introduction In this chapter, we propose an accuracy approach for the precise classification of gene expression data that the GAN generate new training data, following which classifiers learns to classify gene expression data efficiently This model tackle the sample-samples-size issue of gene expression data Results for 20 low-sample-size and very-highdimensional gene expression datasets illustrate that our proposed is more accurate than the state-of-the-art classifying models including SVM, LSVM, RF, kNN and C4.5 In addition, GAN also improve accuracy of SVM, LSVM, RF, kNN and C4.5 using enhancing data Results of this chapter are published in CT2 5.2 Methods The GAN architecture in this approach has two deep-neural-network models: a generator G model and discriminator D model (Figure 5.1) The aim is to train the G, which generates new samples that are indistinguishable from the data distribution The D is optimized to distinguish samples from the real data distribution Pdata from those of the generated data distribution pg The G takes vector noise z pz as input networks and generates samples G(z) with distribution pg The generated data samples generated by model G are then sent to the D to determine their similarity with original training data GAN optimization finds a Nash equilibrium between the G and D The generator G takes a noise vector from 100 random numbers to draw from a uniform distribution as an input player The output of G is a vector gene expression The network architecture consists of five hidden layers with the following layer sizes: 32, 64, 128, 256, and 512 The Tanh activation function is used at the output layer 14 Hình 5.1: Architecture of GAN to generate gene expression data The discriminator network D has a typical neural-network architecture that takes the input data of a vector gene expression D consists of five hidden layers with sizes 512, 256, 128, 64, and 32 The sigmoid activation function is used at the output layer The proposed algorithm take advantage of GAN to enhancing gene gene expression for training data The low-sample-size problem of gene expression data classification is solved by generating new data to enlarge gene expression datasets The algorithm performs the training task with two main phases First of all, we implement a new GAN that our model use GAN generates new sample from origin gene expression data Then, LSVM is used to set label for generating samples with constant C = 103 An amount of new samples (p) is hyper parameters of the algorithm Secondly, our model generates new training data following which the classifiers learns to classify gene expression data The classifiers consist non-linear SVM, LSVM, kNN, RF and C4.5 15 5.3 Evaluation The classification results of 20 DNA Microarray show in full-text dissertation Table 5.1 summarizes the results of these statistical tests with the paired Student ratio test Bảng 5.1: Accuracy comparison between these models on 20 DNA Microarray datasets Model Mean (%) Win Tie Defeat p-value GAN-SVM & SVM GAN-LSVM & LSVM GAN-RF & RF GAN-kNN & kNN GAN-C4.5 & C4.5 19 17 11 17 16 0 1 1.95E-02 1.31E-02 0.04 4.80E-04 9.71E-04 GAN-SVM GAN-SVM GAN-SVM GAN-SVM & & & & GAN-LSVM GAN-RF GAN-kNN GAN-C4.5 13 12 18 15 0 1.15E-02 1.10E-02 2.59E-08 5.07E-07 GAN-SVM GAN-SVM GAN-SVM GAN-SVM & & & & LSVM RF kNN C4.5 19 13 19 19 0 0 1 9.18E-06 1.10E-02 2.59E-08 5.07E-07 SVM GAN-SVM LSVM GAN-LSVM kNN GAN-kNN RF GAN-RF C4.5 GAN-C4.5 73.68 78.63 75.25 77.16 66.85 71.58 75.45 76.75 68.19 74.72 The GAN parameters included the number of samples generated by the GAN and the number of epochs An attempt was made to tune the epoch parameter from 50 to 100 to find the best experiment results The LSVM used C = 105 for the set label for the generated data Finally, an attempt was made to tune parameters C and γ of the RBF kernel to obtain good accuracy for the non-linear SVM The best parameters 16 show in full-text dissertation Tables 4.1 show that GAN-[SVM, LSVM, kNN, RF, C4.5] significantly increases the mean accuracy of 4.95, 1.91, 4.73, 1.3, 6.53 % points compared to SVM, LSVM, kNN, RF, C4.5 respectively All pvalues are less than 0.05 These results show effective of GAN that improve accuracy of SVM, LSVM, RF, C4.5 and kNN classifiers In detail, GAN-SVM has 19 wins, ties, and defeats (p-value = 1.95E-02) against SVM In addition, GAN-LSVM has 17 wins, ties, and defeats (p-value = 1.31E-02) compared with LSVM In the comparison with kNN, the GAN-kNN has 17 wins, ties, and defeat (p-value = 4.80E-04) Moreover, GAN-C4.5 has 16 wins, ties, and defeats (p-value = 9.71E-04) against C4.5 In addition, the focus was on the classification performance comparison of the GAN-SVM with four other methods GAN-kNN, GAN-C4.5, GAN-LSVM and GAN-RF GAN-SVM shows the best performance All p-values are less than 0.05 5.4 Conclusion A new GAN method was proposed to improve accuracy gene expression data classification The approach uses the GAN to generate new samples from original gene expression datasets, and then SVM, LSVM, RF, C4.5 and kNN are used as the classifying model From the obtained results, it is observed that GAN can improve performance of SVM, LSVM, RF, C4.5 and kNN algorithms as well as the GANSVM model is more accurate than the state-of-the-art classifications, including kNN, SVM, LSVM, C4.5, and RF 17 CHAPTER ENSEMBLE RANDOM OBLIQUE DECISION STUMPS 6.1 Introduction In this chapter, we investigate random ensemble oblique decision stumps (RODS) that is suitable for classifying microarray gene expression data Our proposed algorithms (called Bag-RODS and BoostRODS) train multiple RODS classifiers in the way of Bagging and Boosting to form an ensemble of classifiers more accurate than a single model The main idea is to improve the strength of decision stumps (used as "weak learners") with the multivariate node splitting based on the linear SVM The numerical test results on 50 datasets microarray gene expression data show that our algorithms Bag-RODS, BoostRODS are more accurate than kNN, decision tree, SVM and ensembles of decision trees including random forests, bagging and Adaboost In addition, when combine these models and GAN, DCNN also improve the accuracy of classification of gene expression data Results of this chapter are published in CT5 6.2 Ensemble random oblique decision stumps A decision stump use weak classifier in Boosting approach is an one-level decision tree consisting of the root node which is directly connected to the terminal nodes The decision stump learning algorithm selects a single attribute for node splitting as done by decision tree algorithms Thus, the strength its is reduced, particularly when dealing with datasets having dependencies among attributes Our random oblique decision stumps (RODS) algorithm learns the linear SVM to perform the oblique decision stumps in n randomly attributes sampled from n original attributes Bagging of random oblique decision stumps (denoted by BagRODS , illustrated in Figure 6.2) constructs a collection of RODS The learning algorithm Bag-RODS is described as follows: • learning set Bootstrap t is created by random sampling with replacement m individuals from the original training set D 18 • training the linear SVM to perform the oblique decision stumps in n’ randomly attributes sampled from n original attributes of Bootstrap t • The classification of a new individual x uses a majority vote in prediction results of RODS classifiers A bootstrap sample of m individuals from the training set Training set (m individuals, n dimensions) Bootstrap ··· Bootstrap Bootstrap t Root: Linear SVM ODS using n’ random ODS ODS t dimensions to perform an oblique split x x yˆ1 (x) x yˆ2 (x) yˆt (x) Prediction for a new individual x Classification: the majority class in {ˆ y1 (x), yˆ2 (x), , yˆt (x)} Hình 6.1: Bagging of random oblique decision stumps Our Bag-RODS algorithm not only improves the strength of weak classifiers with oblique splitting but also keeps the high diversity between them as it is done with usual bagging and the random attributes used in oblique splitting We have also applied Boosting framework to random oblique decision stump (denoted by Boost-RODS, illustrated in Figure 6.2) The Boost-RODS calls repeatedly the RODS learning algorithm t times so that each boosting step concentrates mostly on errors produced by previous steps The classification of a new individual x is a weight majority vote in prediction results of RODS classifiers 19 A weighted sample of m individuals from the training set Training set (m individuals, n dimensions) Learning sample ODS Learning sample Predict for updating weights ODS x ODS t Root: Linear SVM using n’ random dimensions to perform an oblique split x yˆ1 (x) Learning sample t ··· yˆ2 (x) x yˆt (x) Prediction for a new individual x Classification: the majority class in {β1 yˆ1 (x), β2 yˆ2 (x), , βt yˆt (x)} Hình 6.2: Boosting of random oblique decision stumps The RODS model given by the large margin solution of SVM can improve the generalization capacity of Boost-RODS against overfitting The key idea is to set the cost constant C (a trade-off between the margin size and the errors) to a small value (e.g 10) in SVM learning tasks 6.3 Evaluation We are interested in the performance of the random ensemble oblique decision stumps (Bag-RODS and Boost-RODS) for classifying microarray gene expression data We have implemented Bag-RODS and BoostRODS in Python using Scikit library The algorithms like random forests, decision trees C4.5, bagging C4.5 and adaboost in Scikit library and the highly efficient standard linear SVM are used as baselines DCNN and GAN are implemented in Python using TensorFlow We tune the parameters for Bag-RODS and Boost-RODS including the size of the forest (trees) and the number of random attributes (Dim) being from 20 to 6000 for RODS at each root node We tried to vary the 20 number of RODS from 25 to 500 to find the best experiment results Furthermore, we also tried to tune the cost constant C of linear SVMs for random oblique stumps to obtain the good accuracy And then, the best cost constants C are 10000 and 10 for Bag-RODS, Boost-RODS, respectively The best parameters are showed in full-text dissertation 6.3.1 Classification results on the original dimension The classification results of 50 DNA Microarray datasets show in full-text dissertation Table 6.1 summarizes the results of these statistical tests with the paired Student ratio test Bảng 6.1: Summary of the accuracy comparison Model Mean (%) Bag-RODS Boost-RODS SVM LSVM RF kNN C 4.5 Bag-C4.5 Adaboost Bag-RODS Bag-RODS Bag-RODS Bag-RODS Bag-RODS Bag-RODS Bag-RODS & & & & & & & Boost-RODS Boost-RODS Boost-RODS Boost-RODS Boost-RODS Boost-RODS Boost-RODS Win Tie Defeat p-value 28 32 39 44 46 41 43 12 10 3 2 10 8 3 3,25E-03 3.05E-03 6.22E-07 3.06E-08 3.21E-12 6.29E-07 1.19E-08 22 41 42 43 47 41 44 5 5 23 4 1,64E-01 8.63E-07 3.06E-08 3.03E-8 1.26E-12 6.19E-06 2.46E-08 27 16 0.16 86.84 86.27 83.34 84.64 82.62 78.77 75.70 81.04 77.50 SVM LSVM RF kNN C4.5 Bag-C4.5 Adaboost & & & & & & & SVM LSVM RF kNN C4.5 Bag-C4.5 Adaboost Bag-RODS Boost-RODS 21 On the one hand, Tables 6.1 show that Bag-RODS significantly improves the accuracy mean of 8.07, 11.14, 3.5, 2.2, 4.22, 5.80, 9.34 percent points compared to kNN, C4.5, SVM, LSVM, RF, Bag-C4.5 and Adaboost respectively All p-values are less than 0.05 On other hand, Tables 6.1 also show that Boost-RODS improves the accuracy mean of 7.5, 10.57, 1.65, 3.65, 5.23, 8.77 percent points obtained by kNN, C4.5, SVM, RF, BagDT and Adaboost, respectively These improvements are significant due to p-values being less than 0.05 In the comparison between Bag-RODS with Boost-RODS, BagRODS is slightly superior to Boost-RODS with 27 wins, ties, 16 defeat, pvalue=0.16 (not significant different) 6.3.2 Classification results after enhancing data using GAN Experiments are conducted with 20 very-sample-samples-size datasets (less than 130 samples) Firstly, we use GAN generates new sample from origin gene expression data Then, LSVM is used to set label for generating samples with constant C = 103 Next, our model generates new training data following which the Bag-RODS and Boost-RODS learns to classify gene expression data efficiently in this phase The classification results show in full-text dissertation Table 6.2 summarizes the results of these statistical tests with the paired Student ratio test Tables 6.2 show that GAN-Bag-RODS increases the mean accuracy of 1.9% points compared to Bag-RODS as well as GAN-Boost-RODS increases the mean accuracy of 1.39% points compared to Boost-RODS All p-values are less than 0.05 These results show effective of GAN that improve accuracy of Bag-RODS and Boost-RODS classifiers These results show that the accuracy of GAN-RODS and GANBoost-RODS are higher than GAN-kNN, GAN-RF GAN-C4.5 on 20 very-small-samples-size gene expression datasets, and they provide competitive results with GAN-SVM and GAN-LSVM 22 Bảng 6.2: Comparison between Bag-RODS, Boost-RODS and GAN-Bag-RODS, GAN-Boost-RODS Mean (%) Model Win Tie Defeat GAN-Bag-RODS & Bag-RODS GAN-Boost-RODS & Boost-RODS GAN-Bag-RODS & GAN-Boost-RODS 14 12 4 2 11 0.009 0.045 0.5 GAN-Bag-RODS GAN-Bag-RODS GAN-Bag-RODS GAN-Bag-RODS GAN-Bag-RODS 18 11 15 12 0.39 0.09 0.00013 0.16 0.032 12 15 19 13 16 0 4 0.083 0.07 0.00014 0.043 0.026 Bag-RODS GAN-Bag-RODS 76.35 78.25 Boost-RODS GAN-Boost-RODS 77.42 78.81 GAN-SVM GAN-LSVM GAN-kNN GAN-RF GAN-C4.5 78.63 77.16 71.58 76.75 74.72 & & & & & GAN-Boost-RODS GAN-Boost-RODS GAN-Boost-RODS GAN-Boost-RODS GAN-Boost-RODS GAN-SVM GAN-LSVM GAN-kNN GAN-RF GAN-C4.5 & & & & & GAN-SVM GAN-LSVM GAN-kNN GAN-RF GAN-C4.5 p-value 6.3.3 Classification results based features extracted by DCNN Experiments are conducted with 50 very-high-dimensional datasets The classification results show in full-text dissertation Firstly, we use DCNN extracts features from original gene expression data and using Bag-RODS and Boost-RODS to classify Table 6.3 summarizes the results of these statistical tests with the paired Student ratio test Tables 6.2 show that DCNN-Bag-RODS increases the mean accuracy of 1.61% and 1.62 % compared to Bag-RODS and Boost-RODS All p-values are less than 0.05 These results show effective of DCNN that improve accuracy of Bag-RODS and Boost-RODS classifiers These results show that the accuracy of DCNN-Bag-RODS and DCNN-Boost-RODS are higher than DCNN-LSVM, DCNN-kNN, DCNNRF DCNN-C4.5 on 50 very-high dimensional gene expression datasets, 23 and they provide competitive results with DCNN-SVM Bảng 6.3: Comparison between Bag-RODS, Boost-RODS and DCNN-Bag-RODS, DCNN-Boost-RODS Model Mean (%) Win Tie Defeat DCNN-Bag-RODS & Bag-RODS DCNN-Boost-RODS & Boost-RODS DCNN-Bag-RODS & DCNN-Boost-RODS 23 27 21 13 17 14 14 12 0.017 0.046 0.133 DCNN-Bag-RODS & DCNN-SVM DCNN-Bag-RODS & DCNN-LSVM DCNN-Bag-RODS & DCNN-kNN 20 32 42 14 16 10 6.95E-02 7.46E-03 7.31E-08 DCNN-Boost-RODS & DCNN-SVM DCNN-Boost-RODS & DCNN-LSVM DCNN-Boost-RODS & DCNN-kNN 15 27 41 15 11 20 12 3.70E-01 8.09E-02 5.10E-08 Bag-RODS DCNN-Bag-RODS 76.35 88.44 Boost-RODS DCNN-Boost-RODS 77.42 87.89 DCNN-SVM DCNN-LSVM DCNN-kNN 87.19 86.45 81.45 p-value 6.4 Conclusion In short, we have presented random ensemble oblique decision stumps to efficiently classify very-high-dimension microarray gene expression data The main idea is to use linear SVM to perform random oblique decision stumps for the ensemble of classifiers in the way of bagging and boosting Numerical test results show that our algorithms Bag-RODS, Boost-RODS are more accurate than the state-of-the-art classification models, including kNN, SVM, C4.5, random forests, bagging and adaboost of C4.5 Beside, our models also improve the classification accuracy by combined with enhancing data model using the GAN and feature extraction model using DCNN 24 CHAPTER CONCLUSION AND FUTURE WORKS 7.1 Results of the study The dissertation was proposed a new DCNN model to extract features from gene expression data The new features are classified using SVM, LSVM and kNN Experiments show this model performs quite comparatively well on both DNA microarray and RNA-Seq gene expression data Proposing a new classification algorithm of multiple classifiers with SMOTE using features extracted by DCNN that tackle both issues of gene expression data From experiments, it is observed that DCNNSMOTE can improve performance of SVM, LSVM, RF and kNN A new approach was proposed in dissertation that uses the GAN to generate new samples from original datasets, and then classifiers is used as the classifying model From the classification results, it is clear that GAN can improve accuracy of SVM, LSVM, RF and kNN algorithms Two new algorithms Bag-RODS, Boost-RODS are proposed in dissertation that classify efficiently gene expression data The main idea is to use linear SVM to perform random oblique decision stumps for the ensemble of classifiers in the way of bagging and boosting Experiments show that our models are more accurate than the state-of-the-art models In addition, when combine these models and GAN, DCNN also improve the accuracy of classification of gene expression data 7.2 Future works Firstly, combine GAN and DCNN models aim improve performance features extraction In addition, building more new larger architecture of CNN to improve the classification accuracy, while taking advantage of the generate data of GAN as well as multiple datasets Moreover, building a new model that can learn various type data such as: gene expression, RNA-Seq gene expression, SNP, medical images, clinical data 25 ... 2019 (SCOPUS) [CT3] Phuoc-Hai Huynh, Van Hoa Nguyen, Thanh-Nghi Do, "So sánh mơ hình học sâu kỹ thuật học tự động khác phân lớp liệu biểu gene microarray", in proc of the 10th National Conference... of GAN to enhancing gene gene expression for training data The low-sample-size problem of gene expression data classification is solved by generating new data to enlarge gene expression datasets... from those of the generated data distribution pg The G takes vector noise z pz as input networks and generates samples G(z) with distribution pg The generated data samples generated by model

Định dạng
Số trang	28
Dung lượng	297,36 KB