1. Trang chủ
  2. » Luận Văn - Báo Cáo

microarray data analysis tool (mat)

89 254 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

MICROARRAY DATA ANALYSIS TOOL (MAT) A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Sudarshan Selvaraja December, 2008 ii MICROARRAY DATA ANALYSIS TOOL (MAT) Sudarshan Selvaraja Thesis Approved: Accepted: _______________________ _______________________ Advisor Department Chair Dr. Zhong-Hui Duan Dr. Wolfgang Pelz _______________________ _______________________ Committee Member Dean of the College Dr. Yingcai Xiao Dr. Ronald F. Levant _______________________ _______________________ Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome _______________________ Date iii ABSTRACT Microarray is a technology that has been widely used by the biologists to probe the presence of genes in a sample of DNA or RNA. Using the technology, the oligonucleotide probes can be massively parallel immobilized on a microarray chip. It allows the biologists to check the expression levels of thousands of genes together. This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data. The repository currently allows datasets of GenepixPro format to be deposited, although it can be expanded to include datasets of other formats. The user interface of the repository allows users conveniently upload data files and perform preferred data preprocessing and analysis. The analysis methods implemented includes the traditional k-nearest neighbor (kNN) methods and two new kNN methods developed in this study. Additional analysis methods can be added by future developers. The system was tested using a set of microRNA gene expression data. The design and implementation of the software tool are presented in the thesis along with the testing results from the microRNA dataset. The results indicate that the new weighted kNN method proposed in this study outperforms the traditional kNN method and the proposed mean method. We conclude that the system developed in the thesis effectively provides a structured microarray data repository, a flexible graphical user interface, and rational data mining methods. iv ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Zhong-Hui Duan for giving me an opportunity to work on this project for my Masters thesis. I was motivated to choose this topic after I took Introduction to Bioinformatics course. I would like to thank her for invaluable suggestions and steady guidance during the entire course of the project. I am thankful to my committee members Dr. Yingcai Xiao and Dr. Xuan-Hien Dang for their guidance, invaluable suggestion and time. I would like to thank my friends Shanth Anand and Prashanth Puliyadi for helping me to do Master’s and change my career path. I couldn’t have achieved this without their help. I would like to thank my friend Manik Dhawan for his guidance in writing and formatting this report. I would finally like to express my gratefulness towards my parents and all my family members who were always there for me and cheering me on all situations and for their great interest in my venture. v TABLE OF CONTENTS Page LIST OF TABLES viii LIST OF FIGURES ix CHAPTER I. INTRODUCTION 1 1.1 Introduction to Bioinformatics 1 1.2 Introduction to Microarray Technology……………………………… 2 1.2.1 Genepix Experiment Procedural………………………………. 3 1.3 Applications of Microarrays 5 1.4 Need for Automated Analysis…………………………………………. 6 1.5 Knowledge Discovery in Data………………………………………… 7 1.5.1 KDD Steps…………………………………………………… 8 1.6 Classification………………………………………………………… 8 1.6.1 General Approach…………………………………………… 9 1.6.2 Decision Trees…………………………………………………. 10 1.6.3 k Nearest - Neighbor Classifiers………………………………. 12 vi 1.7 Outline of the Current Study………………………………………… 14 II. LITERATURE REVIEW … 17 2.1 Previous Work … 17 2.2 Existing Tools for Normalizing GPR Datasets……………………… 20 2.3 Stanford Microarray Database (SMD)… 21 2.4 Microarray Tools… 21 2.5 Available Source for Microarray Data………………………………… 24 III. MATERIALS AND METHODS … … 25 3.1 Database Design… … 25 3.1.1 Schema Design………………………………………………… 25 3.1.2 Table Details………………………………………………… 26 3.1.3 Attributes.…………………………………………………… 28 3.2 Description of Genepix Data Format 29 3.2.1 Features and Blocks…………………………………………… 31 3.2.2 Sample Dataset………………………………………………… 32 3.2.3 Transferring Genepix Dataset to Database…………………… 33 3.3 Data Selection … 34 3.3.1 Creation of Training and Testing Dataset…………………… 34 3.4 Preprocessing………………………………………………………… 38 3.4.1 Preprocessing in MAT………………………………………… 39 3.5 Normalization … 41 3.6 Feature Selection … 42 vii 3.6.1 Student T-Test…………………………………………………. 42 3.6.2 Implementation of T-Test in MAT…………………………… 43 3.7 Classification … 44 3.7.1 Classical kNN Method………………………………………… 44 3.7.2 Weighted kNN Method……………………………………… 46 3.7.3 Mean kNN Method…………………………………………… 47 IV. RESULTS AND DISCUSSIONS……………… 49 4.1 A Case Study………………………………………………………… 49 4.2 Results…………………………………………………………………. 53 4.3 Discussion…………………………………………………………… 55 V. CONCLUSIONS AND FUTURE WORK… 56 5.1 Conclusion…………………………………………………………… 56 5.2 Future Work…………………………………………………………… 56 REFERENCES… … 58 APPENDICES……………………………………………………………………… 61 APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2…… 62 APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL…………… 63 APPENDIX C CLASSIFICATION ALGORITHMS…………………. 66 viii LIST OF TABLES Table Page 1.1 Confusion matrix for a 2-class problem …………………… 9 1.2 Software used …………………… 16 2.1 List of microarray tools ………… 22 2.2 Available source for microarray data…………………………………………………. 25 3.1 Tables used in MAT ………………………………………………… 26 3.2 Attributes and their description……………………………………………… 29 3.3 List of default choices for feature selection…………………………………… 40 4.1 Training and testing samples – Experiment 1………………………………… 53 4.2 Accuracy of three classification methods for different N features…………… 53 4.3 Training and testing samples – Experiment 2………………………………… 54 4.4 Accuracy of three classification methods for different N features…………… 54 ix LIST OF FIGURES Figure Page 1.1 Schematic view of a typical microarray experiment…………………………. 3 1.2 Genepix experimental procedure……………………………………………. 4 1.3 Overview of KDD process 7 1.4 Mapping an input attribute set x into its class label y……………………… 8 1.5 A decision tree for the mammal classification problem…………………… 11 1.6 Classifying an unlabeled vertebrate………………………………………… 12 1.7 Schematic representation of k-NN classifier………………………………… 13 1.8 System diagram … 14 1.9 Application flow diagram……………………………………………………. 15 2.1 Sketch of the ProGene algorithm…………………………………………………… 19 3.1 Database schema…………………………………………………………… 26 3.2 Genepix_version table design……………………………………………… 27 3.3 Genepix_header table design……………………………………………… 28 3.4 Genepix_sequence table design……………………………………………… 28 3.5 Hypothetical arrays of blocks……………………………………………… 31 3.6 Sample dataset……………………………………………………………… 32 x 3.7 Creation of repository……………………………………………………… 33 3.8 Selection of datasets………………………………………………………… 34 3.9 Temporary table names for training and testing datasets ……………………. 35 3.10 Flowchart – Creation of dataset……………………………………………… 36 3.11 Replication of gene………………………………………………………… 37 3.12 Sample training dataset with median intensity values……………………… 37 3.13 Preprocessing in MAT……………………………………………………… 39 3.14 T-Test formulas……………………………………………………………… 42 3.15 Calculated p-values for the genes……………………………………………. 44 3.16 Pseudo code of kNN classical method……………………………………… 45 3.17 Pseudo code of kNN mean method………………………………………… 48 4.1 Training samples selected for the experiment……………………………… 50 4.2 Testing samples selected for the experiment………………………………… 51 4.3 Attribute selection and constraint specification for normalization………… 51 4.4 Training datasets…………………………………………………………… 52 4.5 Testing datasets………………………………………………………………. 52 4.6 Feature selection and normalization………………………………………… 52 [...]... National Database : integrated Center for toolset for data analysis Genome and comparison Resources Analysis of high density microarrays and gene Applied Maths chips A Gene Expression Data Analysis and Management Tool Extracting and visualizing patterns in large multivariate data Analysis and Visualization of Microarray Data Software for displaying and manipulating hierarchical clustered data 23 Web... Perl 5.0004_04 or later 2.4 Microarray Tools A list of available tools to work with microarray datasets is given in Table 2.1 Each tool performs different tasks for data mining 21 Table 2.1 List of microarray tools [9] Program Array Designer [13] Software from other sources Description Provider Tool assisting in primer design for microarray construction Set of analysis tools using advanced algorithms... objective of this study is to create database repository to store different microarray datasets and create a microarray analysis tool (MAT) which can be used for analysis of gene expressions The tool has been designed such that it follows the KDD steps The database repository currently allows the genepix datasets although it can be expanded to include different formats The analysis methods implemented includes... dependent We can analyze GPR datasets by uploading them directly 20 2.3 Stanford Microarray Database (SMD) The Stanford Microarray database serves as a microarray database for researchers and collaborators It allows public login for data viewing and analysis In addition, SMD functions as a resource for the entire scientific community by allowing them to download datasets, do analysis, download source... for Microarray Data Few available sources for microarray data which are available online are tabulated below These sources provide different types of microarray dataset conducted by different biological experiment Table 2.2 Available source for microarray data Name National Center for Biotechnology Information Stanford Microarray Database URL http://www.ncbi.nlm.nih.gov/geo/ University of Pittsburgh Microarray. .. such toxicants [2] 1.4 Need for Automated Analysis The intrinsic problem of a typical data set produced by microarrays is the sample size and the high dimensionality of the data set The dataset created by genepix pro has various measures for thousands of genes There is no way of analyzing the samples manually In this study we propose a microarray analysis tool (MAT) with their ability of appropriately... construction Set of analysis tools using advanced algorithms to reveal the true structure of gene expression data Identification of statistically significant hybridization signals Bayesian Analysis of Gene Expression Levels: a program for the statistical analysis of spotted microarray data Microarray database and analysis platform Platform Premier Biosoft International JAVA Optimal Design, Sprl Windows MacOS National... Microarray Dataset Collection Kent Ridge Bio-medical Data Set Repository http://bioinformatics.upmc.edu/Help /UPITTGED.html http://sdmc.lit.org.sg/GEDatasets/Datasets.html http://genome-www5.stanford.edu/ 24 CHAPTER III MATERIALS AND METHODS 3.1 Database Design A database is a structured collection of records or data that is stored in a computer system The structure is achieved by organizing the data according... List of microarray tools [9] (Continued) GeneSifter [23] GeneX [24] GenMaths [25] Genowiz™ [26] Partek Pattern Recognition [27] TIGR MultiExperiment Viewer [28] TreeArrange and Treeps [29] The GeneSifter microarray data analysis system provides access to powerful statistical tools through a web interface, with integrated features for determining the GeneSifter biological significance of the data GeneSifter... [19] Analysis & clustering of gene expression data European Bioinformatics Institute (EBI) Web GEDA [20] Gene expression data analysis and simulation tools, offering a variety of options for processing and analyzing results University of Pittsburgh and UPMC Web GeneCluster [21] Self-organizing maps Whitehead Institute/MIT Center for Genome Research JAVA Windows NT GenMAPP [22] Tools for visualizing data . database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data. The repository currently allows datasets of GenepixPro format to be. 17 2.2 Existing Tools for Normalizing GPR Datasets……………………… 20 2.3 Stanford Microarray Database (SMD)… 21 2.4 Microarray Tools… 21 2.5 Available Source for Microarray Data ………………………………. a microarray analysis tool (MAT) with their ability of appropriately representing new methods of classification and finding new classes. The 7 tool follows the knowledge discovery in data

Ngày đăng: 30/10/2014, 20:10