Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 60 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
60
Dung lượng
7,24 MB
Nội dung
CSC – Scientific Computing Ltd is a non-profit organization for highperformance computing and networking in Finland CSC is owned by the Ministry of Education CSC runs a national large-scale facility for computational science and engineering and supports the university and research community CSC is also responsible for the operations of the Finnish University and Research Network (Funet) All rights reserved The PDF version of this book or parts of it can be used in Finnish universities as course material, provided that this copyright notice is included However, this publication may not be sold or included as part of other publications without permission of the publisher c The authors and CSC – Scientific Computing Ltd 2005 Second edition ISBN 952-5520-11-0 (print) ISBN 952-5520-12-9 (PDF) http://www.csc.fi/oppaat/siru/ http://www.csc.fi/molbio/arraybook/ Printed at Picaset Oy Helsinki 2005 DNA microarray data analysis Preface This is the second, revised and slightly expanded edition of the DNA microarray data analysis guidebook As a change to the previous edition, some relatively quickly changing material such as software tutorials have been exclusively published on the book’s web site Please see http://www.csc.fi/molbio/arraybook/ for more information and to access the extra material DNA microarrays generate large amounts of numerical data, which should be analyzed effectively In this book, we hope to offer a broad view of basic theory and techniques behind the DNA microarray data analysis Our aim was not to be comprehensive, but rather to cover the basics, which are unlikely to change much over years Especially, we hope that researchers starting their data analysis can benefit from the book The text emphasizes gene expression analysis Topics, such as genotyping, are discussed shortly This book does not cover the wet-lab practises, such as sample preparation or hybridization Rather, we start when the microarrays have been scanned, and the resulting images are being analyzed Also, we take the files with signal intensities, which usually generate questions such as: “How is the data normalized?” or “How I identify the genes which are upregulated?”, and provide some simple solutions to these specific questions and many others Each chapter has a section on suggested reading, which introduces some of the relevant literature Some chapters have additional information available on the web In the first edition the software examples were included in the book, but we have now moved them into Internet This allows us to better keep the material up to date Juha Haataja and Leena Jukka are warmly acknowledged for their support during the production of this book We are very interested in receiving feedback about this publication Especially, if you feel that some essential technique has been missed, let us know Please send your comments to the e-mail address Jarno.Tuimala@csc.fi Espoo, 23rd December 2005 The authors DNA microarray data analysis List of Contributors Iiris Hovatta National Public Health Institute Haartmaninkatu FI-00290 Helsinki Finland Juha Saharinen National Public Health Institute Haartmaninkatu FI-00290 Helsinki Finland Katja Kimppa PerkinElmer Life Sciences and Analytical Sciences - Wallac Oy P.O.Box 10 FI-20101 Turku Finland Pekka Tiikkainen VTT P.O.Box 106 FI-20521 Turku Finland M Minna Laine CSC, the Finnish IT center for science Keilaranta 14 FI-02101 Espoo Finland Antti Lehmussola Tampere University of Technology P.O.Box 553 FI-33101 Tampere Finland Tomi Pasanen University of Helsinki P.O.Box 68 FI-00014 University of Helsinki Finland Janna Saarela Biomedicum Biochip Center Haartmaninkatu FI-00290 Helsinki Finland Ilana Saarikko University of Helsinki P.O.Box 68 FI-00014 University of Helsinki Finland Teemu Toivanen Centre for Biotechnology Tykistökatu FI-20521 Turku Finland Martti Tolvanen Institute of Medical Technology Biokatu FI-33520 Tampere Finland Jarno Tuimala CSC, the Finnish IT center for science Keilaranta 14 FI-02101 Espoo Finland Mauno Vihinen Institute of Medical Technology Biokatu FI-33520 Tampere Finland Garry Wong A I Virtanen -institute University of Kuopio FI-70211 Kuopio Finland 10 DNA microarray data analysis 8.9.9 Empirical p-value Analysis of variance 8.10.1 Basics of ANOVA 8.10.2 Completely randomized experiment 8.10.3 One-way and two-way ANOVA 8.10.4 Mixed model ANOVA 8.10.5 Experimental design and ANOVA 8.10.6 Balanced design 8.10.7 Common reference and loop designs 8.11 Error models 8.12 Examples 8.13 Suggested reading 8.10 II Analysis Preprocessing of data 9.1 Rationale for preprocessing 9.2 Missing values 9.3 Checking the background reading 9.4 Calculation of expression change 9.4.1 Intensity ratio 9.4.2 Log ratio 9.4.3 Variance stabilizing transformation 9.4.4 Fold change 9.5 Handling of replicates 9.5.1 Types of replicates 9.5.2 Time series 9.5.3 Case-control studies 9.5.4 Power analysis 9.5.5 Averaging replicates 9.6 Checking the quality of replicates 9.6.1 Quality check of replicate chips 9.6.2 Quality check of replicate spots 9.6.3 Excluding bad replicates 9.7 Outliers 9.8 Filtering bad data 9.9 Filtering uninteresting data 9.10 Simple statistics 9.10.1 Mean and median 9.10.2 Standard deviation 9.10.3 Variance 9.11 Skewness and normality 9.11.1 Linearity 75 75 75 76 77 77 78 78 78 80 80 80 81 82 82 82 83 84 86 86 87 87 87 87 88 88 88 88 89 89 90 90 90 91 92 93 93 93 94 94 95 Contents 9.12 9.13 9.14 9.15 9.16 11 Spatial effects Normalization Similarity of dynamic range, mean and variance Examples Suggested reading 10 Normalization 10.1 What is normalization? 10.2 Sources of systematic bias 10.2.1 Dye effect 10.2.2 Scanner malfunction 10.2.3 Uneven hybridization 10.2.4 Printing tip 10.2.5 Plate and reporter effects 10.2.6 Batch effect and array design 10.2.7 Experimenter issues 10.2.8 What might help to track the sources of bias? 10.3 Normalization terminology 10.3.1 Normalization, standardization and centralization 10.3.2 Per-chip and per-gene normalization 10.3.3 Global and local normalization 10.4 Performing normalization 10.4.1 Choice of the method 10.4.2 Basic idea 10.4.3 Control genes 10.4.4 Linearity of data matters 10.4.5 Basic normalization schemes for linear data 10.4.6 Special situations 10.5 Mathematical calculations 10.5.1 Mean centering 10.5.2 Median centering 10.5.3 Trimmed mean centering 10.5.4 Standardization 10.5.5 Lowess smoothing 10.5.6 Ratio statistics 10.5.7 Analysis of variance 10.5.8 Spiked controls 10.5.9 Dye-swap experiments 10.6 RMA preprocessing for Affymetrix data 10.7 Some caution is needed 10.8 Graphical example 10.9 Example of calculations 96 98 98 98 98 99 99 99 99 100 100 100 100 101 101 101 101 102 103 103 103 103 104 104 105 105 105 106 106 106 106 106 107 108 108 108 108 109 109 111 111 12 DNA microarray data analysis 10.10 10.11 Examples 113 Suggested reading 113 11 Finding differentially expressed genes 11.1 Identifying over- and under-expressed genes 11.1.1 Filtering by absolute expression change 11.1.2 Statistical single chip methods 11.1.3 Noise envelope 11.1.4 Sapir and Churchill’s single slide method 11.1.5 Chen’s single slide method 11.1.6 Newton’s single slide method 11.2 What about the confidence? 11.2.1 Only some treatments have replicates 11.2.2 All the treatments have replicates: two-sample t-test 11.2.3 All the treatments have replicates: one-sample t-test 11.2.4 Non-parametric tests 11.3 Examples 11.4 Suggested reading 115 115 115 116 116 116 117 118 119 119 120 121 121 122 122 12 Cluster analysis of microarray information 12.1 Basic concept of clustering 12.2 Principles of clustering 12.3 Hierarchical clustering 12.4 Self-organizing map 12.5 K-means clustering 12.6 KNN classification - a simple supervised method 12.7 Principal component analysis 12.8 Pros and cons of clustering 12.9 Visualization 12.10 Function prediction 12.11 Examples 12.12 Suggested reading 123 123 123 124 125 126 127 129 130 131 133 133 133 III Data mining 13 Gene regulatory networks 13.1 What are gene regulatory networks? 13.2 Fundamentals 13.3 Bayesian network 13.4 Calculating Bayesian network parameters 13.5 Searching Bayesian network structure 13.6 Conclusion 13.7 Suggested reading 134 135 135 135 137 139 140 141 142 Part II Analysis 82 DNA microarray data analysis Preprocessing of data Jarno Tuimala 9.1 Rationale for preprocessing Preprocessing includes analytical or transformational procedures that need to be applied to the data before it is suitable for a detailed analysis The black-box approach, where data is fed into a program and the result pops out, is simply erraneous for statistical analysis, because the results coming out from the program can be statistically erraneously derived In such cases, also the biological conclusions can be wrong Statistical tests have often strick assumptions, which need to be fulfilled Violation of assumptions can lead to grossly wrong results We strongly recommend that the researcher, even if he/she is not himself performing the data analysis, gets basic knowledge of the data, because the results presented by the bioinformatician or statistician are more easily interpretable, if one is at least somewhat familiar with the data Here we will introduce some methods for checking the data for violation of statistical test assumptions We also present some things to consider before and during the data analysis The methods are introduced in the order of applicability, although some methods are needed in several steps 9.2 Missing values There are often many missing values in microarray data As you might recall, missing values are observations (intensities of spots), where the quantification results are missing In the context of microarrays, we define missing values as • Missing because the spot is empty (intensity = 0) • Missing because background intensity is higher than the spot intensity (background corrected intensity < 0) Missing values can lead to problems in the data analysis, because they easily interfere with computation of statistical tests and clustering There are a couple of options for the treatment of missing values: They may be replaced with estimated values in a process called imputation, or they can be deleted from the further analyses Preprocessing of data 83 The default way of deleting missing data (in most of the software packages), for example while calculating a correlation matrix, is to exclude all cases that have missing data in at least one of the selected variables; that is, by casewise deletion of missing data However, if missing data are randomly distributed across cases, you could easily end up with no "valid" cases in the data set, because each of the genes will highly likely have at least one missing observation on some chip The most common solution used in such instances is to use the so-called pairwise deletion, where a statistic between each pair of variables is calculated from all cases that have valid data on those two variables Another common method is the so-called mean substitution of missing data (mean imputation, replacing all missing data in a variable by the mean of that variable) Its main advantage is that it produces internally consistent sets of results Mean substitution artificially decreases the variation of scores, and this decrease in individual variables is proportional to the number of missing data Because it substitutes missing data with artificially created average data points, mean substitution may considerably change the values of correlations Imputation is commonly carried out for intensity ratios, but can also be done for raw data Different computer programs manipulate missing values very differently, and drawing any consensus would be futile At least statistical programs often offer a possibility to define whether to use imputation, pairwise deletion or casewise deletion 9.3 Checking the background reading There is some dispute on whether the background intensities should be subtracted from the spot intensities At the moment, the background adjustion seems to be commonly used, but strictly speaking, it’s applicability should be checked Background intensities are assumed to be additive with foreground intensities Background and foreground intensities together form the spot intensity Background intensities should not vary multiplicatively with foreground intensities or spot intensities, because such phenomenon occurs usually when there has been some kind of problems with hybridization This can be assessed by a scatter plot of background intensities against spot intensities If the spot intensities are dependent on the background intensities, it is possible either not to apply any background correction to the data (Figure 9.1 and Figure 9.2) or discard the deviating observations from further analyses Another common problem with the background correction is that it might produce “pheasant tail” images on the scatter plot (Figure 9.3) Pheasant tails are formed by observations, which have exactly the same intensity value on one channel, but the other channels’ intensities can vary In such cases long vertical or horizontal straights of observations are produced, and the resulting data cloud resembles a pheasant tail in a scatter plot This is especially common in the lower end of the intensity distribution, and can be a sign that the scanner is not very reliable below a certain cut-off intensity or that the image analysis software calculates background intensities in a peculiar way When pheasant tails are observed, background uncorrected intensity values can be used for the analyses, or the deviating ... Printed at Picaset Oy Helsinki 2005 DNA microarray data analysis Preface This is the second, revised and slightly expanded edition of the DNA microarray data analysis guidebook As a change to the... analyses, or the deviating 84 DNA microarray data analysis 500 1500 14 2500 10.0 10.5 11.0 11.5 log2(mouse1$morphR) raw data vs corrected data log2? ?data vs corr data 11.0 12.5 rmean 15000 14.0... context of DNA microarray analysis For example, if we treat a certain cell line with a cancer drug, we can set up several culture flasks, and then harvest them as 88 DNA microarray data analysis