Statistical Modeling and Machine Learning for Molecular Biology CHAPMAN & HALL/CRC Mathematical and Computational Biology Series Aims and scope: This series aims to capture new developments and summarize what is known over the entire spectrum of mathematical and computational biology and medicine It seeks to encourage the integration of mathematical, statistical, and computational methods into biology by publishing a broad range of textbooks, reference works, and handbooks The titles included in the series are meant to appeal to students, researchers, and professionals in the mathematical, statistical and computational sciences, fundamental biology and bioengineering, as well as interdisciplinary researchers involved in the field The inclusion of concrete examples and applications, and programming techniques and examples, is highly encouraged Series Editors N F Britton Department of Mathematical Sciences University of Bath Xihong Lin Department of Biostatistics Harvard University Nicola Mulder University of Cape Town South Africa Maria Victoria Schneider European Bioinformatics Institute Mona Singh Department of Computer Science Princeton University Anna Tramontano Department of Physics University of Rome La Sapienza Proposals for the series should be submitted to one of the series editors above or directly to: CRC Press, Taylor & Francis Group Park Square, Milton Park Abingdon, Oxfordshire OX14 4RN UK Published Titles An Introduction to Systems Biology: Design Principles of Biological Circuits Uri Alon Glycome Informatics: Methods and Applications Kiyoko F Aoki-Kinoshita Computational Systems Biology of Cancer Emmanuel Barillot, Laurence Calzone, Philippe Hupé, Jean-Philippe Vert, and Andrei Zinovyev Python for Bioinformatics Sebastian Bassi Quantitative Biology: From Molecular to Cellular Systems Sebastian Bassi Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby Jules J Berman Computational Biology: A Statistical Mechanics Perspective Ralf Blossey Game-Theoretical Models in Biology Mark Broom and Jan Rychtáˇr Computational and Visualization Techniques for Structural Bioinformatics Using Chimera Forbes J Burkowski Structural Bioinformatics: An Algorithmic Approach Forbes J Burkowski Normal Mode Analysis: Theory and Applications to Biological and Chemical Systems Qiang Cui and Ivet Bahar Kinetic Modelling in Systems Biology Oleg Demin and Igor Goryanin Data Analysis Tools for DNA Microarrays Sorin Draghici Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition ˘ Sorin Draghici Computational Neuroscience: A Comprehensive Approach Jianfeng Feng Biological Sequence Analysis Using the SeqAn C++ Library Andreas Gogol-Döring and Knut Reinert Gene Expression Studies Using Affymetrix Microarrays Hinrich Göhlmann and Willem Talloen Handbook of Hidden Markov Models in Bioinformatics Martin Gollery Meta-analysis and Combining Information in Genetics and Genomics Rudy Guerra and Darlene R Goldstein Differential Equations and Mathematical Biology, Second Edition D.S Jones, M.J Plank, and B.D Sleeman Knowledge Discovery in Proteomics Igor Jurisica and Dennis Wigle Spatial Ecology Stephen Cantrell, Chris Cosner, and Shigui Ruan Introduction to Proteins: Structure, Function, and Motion Amit Kessel and Nir Ben-Tal Cell Mechanics: From Single ScaleBased Models to Multiscale Modeling Arnaud Chauvière, Luigi Preziosi, and Claude Verdier RNA-seq Data Analysis: A Practical Approach Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, and Garry Wong Bayesian Phylogenetics: Methods, Algorithms, and Applications Ming-Hui Chen, Lynn Kuo, and Paul O Lewis Introduction to Mathematical Oncology Yang Kuang, John D Nagy, and Steffen E Eikenberry Statistical Methods for QTL Mapping Zehua Chen Biological Computation Ehud Lamm and Ron Unger Published Titles (continued) Optimal Control Applied to Biological Models Suzanne Lenhart and John T Workman Genome Annotation Jung Soh, Paul M.K Gordon, and Christoph W Sensen Clustering in Bioinformatics and Drug Discovery John D MacCuish and Norah E MacCuish Niche Modeling: Predictions from Statistical Distributions David Stockwell Spatiotemporal Patterns in Ecology and Epidemiology: Theory, Models, and Simulation Horst Malchow, Sergei V Petrovskii, and Ezio Venturino Algorithms in Bioinformatics: A Practical Introduction Wing-Kin Sung Stochastic Dynamics for Systems Biology Christian Mazza and Michel Benaïm The Ten Most Wanted Solutions in Protein Bioinformatics Anna Tramontano Statistical Modeling and Machine Learning for Molecular Biology Alan M Moses Combinatorial Pattern Matching Algorithms in Computational Biology Using Perl and R Gabriel Valiente Engineering Genetic Circuits Chris J Myers Pattern Discovery in Bioinformatics: Theory & Algorithms Laxmi Parida Exactly Solvable Models of Biological Invasion Sergei V Petrovskii and Bai-Lian Li Computational Hydrodynamics of Capsules and Biological Cells C Pozrikidis Modeling and Simulation of Capsules and Biological Cells C Pozrikidis Cancer Modelling and Simulation Luigi Preziosi Introduction to Bio-Ontologies Peter N Robinson and Sebastian Bauer Dynamics of Biological Systems Michael Small Introduction to Bioinformatics Anna Tramontano Managing Your Biological Data with Python Allegra Via, Kristian Rother, and Anna Tramontano Cancer Systems Biology Edwin Wang Stochastic Modelling for Systems Biology, Second Edition Darren J Wilkinson Big Data Analysis for Bioinformatics and Biomedical Discoveries Shui Qing Ye Bioinformatics: A Practical Approach Shui Qing Ye Introduction to Computational Proteomics Golan Yona Statistical Modeling and Machine Learning for Molecular Biology Alan M Moses University of Toronto, Canada CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed on acid-free paper Version Date: 20160930 International Standard Book Number-13: 978-1-4822-5859-2 (Paperback) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Names: Moses, Alan M., author Title: Statistical modeling and machine learning for molecular biology / Alan M Moses Description: Boca Raton : CRC Press, 2016 | Includes bibliographical references and index Identifiers: LCCN 2016028358| ISBN 9781482258592 (hardback : alk paper) | ISBN 9781482258615 (e-book) | ISBN 9781482258622 (e-book) | ISBN 9781482258608 (e-book) Subjects: LCSH: Molecular biology–Statistical methods | Molecular biology–Data processing Classification: LCC QH506 M74 2016 | DDC 572.8–dc23 LC record available at https://lccn.loc.gov/2016028358 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com For my parents Contents Acknowledgments, xv Section i Overview chapter ◾ Across Statistical Modeling and Machine Learning on a Shoestring 1.1 ABOUT THIS BOOK 1.2 WHAT WILL THIS BOOK COVER? 1.2.1 Clustering 1.2.2 Regression 1.2.3 Classification 1.3 ORGANIZATION OF THIS BOOK 1.4 WHY ARE THERE MATHEMATICAL CALCULATIONS IN THE BOOK? 1.5 WHAT WON’T THIS BOOK COVER? 11 1.6 WHY IS THIS A BOOK? 12 REFERENCES AND FURTHER READING chapter ◾ Statistical Modeling 14 15 2.1 WHAT IS STATISTICAL MODELING? 15 2.2 PROBABILITY DISTRIBUTIONS ARE THE MODELS 18 2.3 AXIOMS OF PROBABILITY AND THEIR CONSEQUENCES: “RULES OF PROBABILITY” 23 HYPOTHESIS TESTING: WHAT YOU PROBABLY ALREADY KNOW ABOUT STATISTICS 26 2.4 ix 1.0 1.0 0.8 0.8 0.6 True positive rate True positive rate 250 ◾ Statistical Modeling and Machine Learning for Molecular Biology LDA 0.4 0.2 0.0 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 0.0 1.0 1.0 1.0 0.8 0.8 0.6 True positive rate True positive rate SVM, Gaussian kernel 0.6 Logistic regression L1 penalty = 0.2 0.4 0.2 0.0 0.6 0.2 0.4 0.6 0.8 1.0 Logistic regression L1 penalty = 0.05 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 False positive rate 1.0 0.0 0.2 0.4 0.6 0.8 False positive rate 1.0 ROC curves on the training and test set for twofold crossvalidation using various classification methods In each panel, the performance on the training data is shown for the two random samples of the data in black traces, while the performance on the held-out samples is shown in the gray traces Note that the classification performance depends on the particular random set selected in each cross-validation run, and is therefore a random variable In the upper panels, no regularization is used and the classification models achieve perfect separation on the training set, but much worse performance on the test sets In the bottom panels, regularization is used, and the training performance is similar to the test performance FIGURE 12.4 difference between the training and test sets However, if the penalty chosen is too large, the classification performance is not as good on either the training or test set Note that in this example, I divided the data into two equal parts and we could look at the classification performance using ROC curves In general, half of the data might not be enough to train the model, and you might want to use 90% of the data for training and leave only 10% for testing In that case, you would want to repeat the analysis 10 times, so that each part of the data is used as the test set once Since it’s not easy to look Evaluating Classifiers ◾ 251 at 10 ROC curves, instead you can pool all of the results from the left-out data together and make one ROC curve 12.5 LEAVE-ONE-OUT CROSS-VALIDATION In the previous example, I suggested dividing the training set into or 10 fractions, and running the classification on each part In the limit, you get “leave-one-out” cross-validation where the classifier is trained leaving out each datapoint alone The classification performance measures are then computed based on the left out datapoints, and summarized at the end I hope it’s clear that leave-one-out cross-validation is making up for lack of data by increasing the computational burden: the parameters of the classifier are being re-estimated many times—proportional to the number of data points in the dataset So, if a classification method needs a number of calculations proportional to the size of the dataset (say n) in order to estimate the parameters, the leave-one-out cross-validation estimate of performance therefore takes n2 calculations Nevertheless, even in today’s “data-rich” molecular biology, we are usually data limited and not compute limited So leave-one-out cross-validation is the most popular way to evaluate classification methods in molecular biology (and predictive machine-learning models in general) Because leave-one-out cross-validation uses almost the entire training set for each iteration, it is thought to give the most reliable estimate of the parameters, and therefore the best guess at how the classifier would perform on new data (if the whole training set was used) Figure 12.5 shows ROC curves for LDA and an SVM to classify cell type based on single-cell expression data from 100 highly expressed genes Note that the classifiers both achieve very good performance, and there doesn’t seem to be an advantage to the nonlinear classification using the SVM This suggests that although the data is 100-dimensional, in that high-dimensional space, the classes are linearly separable A very important cautionary note about cross-validation is that it only ensures that the classifier is not overfitting to the data in the training sample Thus, the cross-validation estimates of the classification performance will reflect the performance on unseen data, provided that data has the same underlying distribution as the training sample In many cases, when we are dealing with state-of-the-art genomics data, the data are generated from new technologies that are still in development Both technical and biological issues might make the experiment hard to repeat If any aspect of the data changes between the training sample and the subsequent 252 ◾ Statistical Modeling and Machine Learning for Molecular Biology 1.0 0.8 LDA 0.6 SVM True positive rate True positive rate 1.0 0.4 100 most highly expressed genes 0.2 0.8 LDA 0.6 LDA biological replicate 0.4 100 most highly expressed genes 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 False positive rate 1.0 0.0 0.2 0.4 0.6 0.8 False positive rate 1.0 FIGURE 12.5 Leave-one-out cross-validation In the left panel, the performance is shown for LDA and a support vector machine with a Gaussian kernel (SVM) The right panel shows leave-one-out cross-validation performance estimates on the original dataset (used for training and testing as in the left panel), and performance on a biological replicate where the model is trained on the original data and then applied to the replicate experiments, it is no longer guaranteed that the cross-validation accuracy will reflect the true classification accuracy on new data In machine learning, this problem is known as “covariate shift” (Shimodaira 2000) to reflect the idea that the underlying feature space might change Because the dimensionality of the feature spaces is large, and the distribution of the data in the space might be complicated, it’s not easy to pinpoint what kinds of changes are happening I can illustrate this issue using the single-cell RNA-seq data introduced in Chapter because this dataset includes a “replicate” set of 96 LPS cells and 96 unstimulated cells These are true biological replicates, different cells, sequenced on different days from different mice When I train the classifier on the original set and now apply it to these “replicate” data, the classification performance is not nearly as good as the leave-one-out crossvalidation suggests it should be ROC curves are shown in Figure 12.5 Note that this is not due to overfitting of the classifier (using a regularized model, such as penalized logistic regression does not help in this case) These are real differences (either biological or technical) between the two datasets, such that features associated with the cell class in one replicate are not associated in the same way in the second replicate Because the problem is in a 100-dimensional space, it’s not easy to figure out what exactly has changed Evaluating Classifiers ◾ 253 12.6 BETTER CLASSIFICATION METHODS VERSUS BETTER FEATURES In the single-cell expression example given earlier, the SVM with Gaussian kernel (a sophisticated nonlinear classification method) did not perform any better than using LDA (an old-fashioned, overfittingprone, linear classification method) Although at first this might seem surprising, it actually illustrates something that is reasonably common in biology For very high-dimensional classification problems of the type we often encounter in molecular biology, the choice of features, and the ability to integrate different types of features into a single predictive model will usually be much more important than the choice of classifiers For example, for the problem of identifying gene signatures, if you can make a good choice of genes, you won’t need a very sophisticated classification method In the simple case of the two-dimensional T-cell classifier, once we found better features (the T-cell receptor and one other gene in Chapter 10), we could classify almost perfectly with a linear boundary For many complicated bioinformatics problems, choosing the right features turns out to be the key to getting good accuracy: Once you’ve found those features, simple classification methods can often a good job Currently, many important classification problems are thought to be limited by available features, not classification methods For example, a major classification bottleneck in personalized medicine, identification of deleterious mutations or SNPs, appears to be limited by features, not classification methods This means that in practice, feature selection and integration turns out to be the key step in high-dimensional classification problems To illustrate this let’s return to our cell-type classification problem, based on single-cell sequence data, and now use the 1000 most highly expressed genes (instead of 100 as we were doing earlier) We can achieve essentially perfect classification (Figure 12.6) On the other hand, if we only had the three most highly expressed genes (remember, this is singlecell data, so measuring three genes’ expression in single cells could still be nontrivial), we wouldn’t be able to classify the cell types very well at all (Figure 12.6) In this case, using a better classification method (such as an SVM) does help us, but it’s unlikely that we’ll even get close to the accuracy we can get if we have much better features Indeed, there are many biological classification problems where the features we are using simply don’t have enough information to a good 254 ◾ Statistical Modeling and Machine Learning for Molecular Biology 1.0 LDA 0.8 False positive rate False positive rate 1.0 SVM 0.6 0.4 1000 most highly expressed genes 0.2 0.8 LDA 0.6 SVM 0.4 most highly expressed genes 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 True positive rate 1.0 0.0 0.2 0.4 0.6 0.8 False positive rate 1.0 FIGURE 12.6 Better features make a much bigger difference than a better classi- fier The two panels compare leave-one-out cross-validation classification performance of LDA (gray curve) to a support vector machine with a Gaussian kernel (SVM, black traces) On the left, the classifiers are trained with 1000 genes, while on the right they are trained with three genes In both cases, the SVM performs slightly better than LDA However, when data from 1000 genes is available, both classifiers can produce nearly perfect performance, while when data from only three genes is available neither classifier can perform well job In these cases, trying lots of classification methods probably won’t increase the accuracy very much—instead we need to be clever and find some new features, either by extracting new types of information from data we already have or by doing some new experiments EXERCISES I said that the AUC of random guessing was 0.5 What is the AUC of the “sleeping sniffer dog” in the case where the data are truly 1% positives and 99% negatives? Why isn’t it 0.5? What does the P–R plot look like for random guessing if there are equal numbers of positives and negatives in the data? Show that LDA with variable cutoffs corresponds to using the MAP classification rule with different assumptions about the prior probabilities In cases of limited data, one might expect leave-one-out crossvalidation to produce better classification performance than, say, sixfold cross-validation Why? Evaluating Classifiers ◾ 255 I have an idea: Instead of training the parameters of a naïve Bayes classifier by choosing the ML parameters for the Gaussian in each class, I will use the leave-one-out cross-validation AUC as my objective function and choose the parameters that maximize it Am I cool? REFERENCES AND FURTHER READING Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local alignment search tool J Mol Biol 215(3):403–410 Shimodaira, H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function J Stat Plan Infer 90(2):227–244 Yuan Y, Guo L, Shen L, Liu JS (2007) Predicting gene expression from sequence: A reexamination PLoS Comput Biol 3(11):e243 Index A AdaBoost, 203 Admixed individuals, 63, 195 Akaike information criterion (AIC) Gaussian mixture model, 131–132 multiple regression, 179–182, 185–186 “All versus-all” approach, 238 Annotations, gene function, 43–44 Area under the precision recall curve (AUPRC), 248 Artificial neural networks, 227–228 B Bayesian networks, 11, 121–123 Bayes’ theorem, 24, 61, 63, 130, 216 Benjamini–Hochberg (BH) procedure, 48–49 Bernoulli distribution, 40, 73, 119, 166 Binomial distribution, 40, 73 Bonferroni correction differentially expressed genes, 46–48 eQTLs, 156–157 gene set enrichment analysis, 43–46 Bootstrapping, 99–100 C CART algorithm, decision trees, 235 Central limit theorem, 19, 32–33 Chinese restaurant process, 197 Chi-squared distribution, 39, 79 Class-conditional density, 120 Classification, 6, 203, 241 linear classification clustering, 210 feature space, 205 LDA (see Linear discriminant analysis) logistic regression, 208–210, 215 naïve Bayes (see Naïve Bayes classifier) overfitting, 204 probabilistic models, 206–207 T cells, decision boundaries for, 205 nonlinear classification (see Nonlinear classification) sniffer-dog example, 203–204 Classifiers accuracy, 242–244 TPR and FPR cross-validation, 248–251 LDA, decision boundary, 245–246 leave-one-out cross-validation, 251–254 logistic regression, MAP rule, 244–245 ROC curve and P–R plot, 245–248 sniffer-dog example, 242–243 Clustering, 4–5, 210 agglomerative clustering average linkage, 92 bioinformatics and genomics applications, 91 GeneCluster 3.0., 94 primary data, 95 single linkage, 92 T cells, 93 bootstrapping, 99–100 dimensionality reduction, 113 257 258 ◾ Index DNA and protein sequences BLOSUM62 substitution matrix, 96–97 phylogenetic reconstruction, 95 substitution/similarity matrices, 96 evaluating techniques, 100 Gaussian mixture models (see Gaussian mixture model and hidden variables) genome-scale datasets, 98 graph-based clustering, 110–114 K-means clustering algorithm, 103–104 CD4 and CD8 expression, 108 cluster assignment rule, 100, 102 datapoints, 104–105 indicator variable, 101 iterative procedures, 102 learning signal, 105–106 mean patterns, 101 objective function, 103 symmetrical clusters, 105 k-medoids, 109–110 multivariate distances cosine-distance, 90 Euclidean distance, 89 gene expression, 88 uncentered correlation, 91 vector notation, 89 prior distribution, 196–198 silhouette, 98–99 UPGMA, 98 Complete likelihood (CL), 127 Conditional probability, 9–10, 23–24, 54 Covariance matrix, 69, 75–77, 89–90, 126, 128–129, 131 Covariate shift, 252 Cross-validation, 188, 248–251 Cufflinks software, 136–138 Curse of dimensionality, 131 Cytoscape, 111, 113–114 D Decision trees CART algorithm, 235 four-layer decision tree, 234 pruning, 235–236 random forest and ensemble classifier, 236 Deep neural networks, 227–228 Differential expression analysis, 46–48 Dimensionality reduction, 113 Dirichlet process, 197 Distance-based clustering, see Clustering E Ensemble classifiers, 236 Euclidean distance, 89–90, 93, 101, 109 Exemplar-based clustering, 109–110 Expectation–maximization (E-M) algorithm cluster assignment, 124–125 complete likelihood, 127 covariance matrices, formula for, 126 E-step, formula for, 125 expected complete log-likelihood, 127–130 “generalized” E-M, 130 K-means clustering, 123–124 mixing parameters/relative fractions, formula for, 125 MLEs, 126 M-step, formula for, 125 weighted average, 125 Exponential distribution, 22, 39–40, 192 Expression quantitative trait loci (eQTLs), 5, 155 AMN1 gene expression, 49–51, 156–157 DSE1 gene expression, 49–51 F False discovery rate (FDR), 48–49, 243 False negative (FN), 244 False positive (FP), 244 False positive rate (FPR), 243–244 LDA and SVM cross-validation, ROC curve, 249–251 decision boundary, 245–246 Index ◾ 259 leave-one-out cross-validation, 251–254 T-cell classification, ROC curve and P–R plot, 246–248 logistic regression, MAP rule, 244–245 Fisher information matrix, 59 Fisher’s exact test, 33–34 G Gaussian distribution heavy-tailed approximation, 39 likelihood function, 55–56 mu and sigma values, 54 simplest statistical models, 19 statistical theory, MLEs convergence, 150 Gaussian mixture model and hidden variables AIC, 131–132 Bayesian networks and graphical models, 121–123 class-conditional density, 120 clusters, datapoints, 119–120 conditional distribution, 120–121 curse of dimensionality, 131 E-M algorithm cluster assignment, 124–125 complete likelihood, 127 covariance matrices, formula for, 126 E-step, formula for, 125 expected complete log-likelihood, 127–130 K-means clustering, 123–124 mixing parameters/relative fractions, formula for, 125 MLEs, 126 M-step, formula for, 125 weighted average, 125 vs K-means algorithm, 118 likelihood function, 118–119 mixing parameters, 119 prior probability, 118–119 sequence and expression clusters, 139–140 single-cell RNAseq data, distribution of, 120–121 Gene ontology (GO) gene function, 43–44 P-value, 45–46 Generalized linear models, 165–166 Gene set enrichment analysis Fisher’s exact test, 33–34 functional annotations, 43–44 high/low CD4 expression levels, 33–34 protein kinase-substrates, 35–36 P-value, 34, 43–46 Genomewide association study (GWAS) analysis, 5, 191 Graphical models Bayesian networks and, 121–123 Holmes and Bruno joint model, 140 protein–protein interactions, 222–223 H Hebb’s rule, 227 Hidden Markov models (HMMs), 11, 237 Homology graph, 112 Hyperparameters, 193, 196–197 I “Independent and identically distributed” (i.i.d.), 18, 27, 54, 68–69, 133, 139, 171 Infinite mixture model, 196–198 J Joint probability, 23, 54 K Kernel regression bandwidth, 163 Gaussian kernel, 162 local regression, 162 LOESS method, 164 mRNA and protein data, 163 nearest neighbor regression, 162 RBF kernel, 162 K-means clustering algorithm, 103–104 260 ◾ Index CD4 and CD8 expression, 108 cell-types, 104 cluster assignment rule, 100, 102 datapoints, 104–105 ImmGen data, 103–104 indicator variable, 101 iterative procedures, 102 learning signal, 105–106 machine learning method, 100 mean patterns, 101 objective function, 103 symmetrical clusters, 105 k-nearest neighbor (k-NN) classification definitions, 228 negative datapoints, 228–229 steps, 228 Kolmogorov–Smirnov test (KS-test), 31–33 Kyoto Encyclopedia of Genes and Genomes (KEGG), 43–44 L Lagrange multipliers, 72, 129 LDA, see Linear discriminant analysis Leave-one-out cross-validation, 251–254 Likelihood ratio (LR), 211 Likelihood ratio test (LRT), 79 Linear classification clustering, 210 feature space, 205 LDA (see Linear discriminant analysis) logistic regression, 208–210, 215 naïve Bayes (see Naïve Bayes classifier) overfitting, 204 probabilistic models, 206–207 T cells, decision boundaries for, 205 Linear discriminant analysis (LDA) decision boundary, 213–214 geometric interpretation, 212–213 log-likelihood ratio, 211 multivariate Gaussians, 210–211, 214–215 vs naïve Bayes, 215–216 TPR and FPR cross-validation, ROC curve, 249–251 decision boundary, 245–246 leave-one-out cross-validation, 251–254 T-cell classification, ROC curve and P–R plot, 246–248 whitening, 213–214 LOESS method, 163–164 Logistic regression, 166, 215 positive class, 208 TPR and FPR cross-validation, 249–251 MAP rule, 244–245 two-gene signature, identification of, 208–210 Log-likelihood ratio (log LR), 81, 132, 211, 217 M Malhalanobis distance, 89–91 Mann–Whitney U test, see Wilcoxon Rank-Sum test MAP, see Maximum-Apostiori-Probability Maximum a posteriori (MAP) estimation, 207 logistic regression positive class, 208 TPR and FPR, 244–245 naïve Bayes (see Naïve Bayes classifier) penalized likelihood, 192–193 Maximum-Apostiori-Probability (MAP), 61 Maximum likelihood estimatiors (MLEs) analytic solutions, 56–59 conditional probability, 54 E-M algorithm, 126 joint probability, 54 LDA, 210–214 linear regression, 146–149 multivariate distributions covariance matrix, 75–76 Lagrange multipliers, 72 linear algebra notation, 77 log-likelihood, 71 matrix and vector multiplications, 75 multinomial distribution, 69 optimization, 71 vector/matrix calculus, 77 Index ◾ 261 naïve Bayes, 221 objective function, 54 parameter estimates for, 59–60 probabilistic models, 206–207 second derivatives, matrix of, 150–153 MEME software, 133–136 Mixture models cufflinks software, 136–138 Gaussian models (see Gaussian mixture model and hidden variables) MEME software, 133–136 MLEs, see Maximum likelihood estimatiors Molecular biology bioinformatics and genomics applications classification, clustering, 4–5 regression, examples, 7–8 HMMs, 13 mathematical formulas, 9–11 probabilistic models, Multiple regression features/covariates (Xs), 170–171 feature selection, 179, 182 gene expression levels, transcription factor motifs “combinatorial” logic, 176–177 Gaussian distributions, 175 predicted and observed data, 176 p-values, distribution of, 178–179 REDUCE algorithm, 177–178 overfitting, 179–182 partial correlations and hypothesis testing, 171–174 regularization and penalized likelihood advantage of, 187–188 cross-validation procedure, 188 drawback of, 188 gene expression measurements, 190–191 L1 and L2 penalties, correlated features, 189–190 linear regression, 186–187 MAP objective function, 192–193 prior distribution (see Prior distribution) SAG1 expression, 169–170 Multiple testing Bonferroni correction differentially expressed genes, 46–48 gene set enrichment analysis, 43–46 FDR, BH procedure, 48–49 yeast eQTL experiment, 49–51 Multiplicative noise, 160 Multivariate statistics, parameter estimation and diagonal covariance, 69 genome-wide expression level measurements, 64 geometrical interpretation, 67 ImmGen data, 64 multivariate Gaussians and correlation, 70 vectors, matrices and linear algebra, 65–67 Munich Information Center for Protein Sequences (MIPS), 43–44 N Nadaraya–Watson kernel regression, 162 Naïve Bayes classifier Bayes’ theorem, 216 distributions for each dimension, 218 vs LDA, 215–216 log-likelihood ratio, 217 motif matching, in DNA sequences binding sites, 218–219 negative class, background model for, 219–220 priors, log ratio of, 220 PSSM, 221 probability distribution, 217 protein–protein interactions, 222–223 training, 221–222 “Nearest neighbour” classification, 207 Neural networks, 6, 61, 227–228 Newton’s law of cooling, 16 262 ◾ Index Newton’s law of universal gravitation, 16 Nonlinear classification, 206 data-guided, 226–227 decision trees CART algorithm, 235 four-layer decision tree, 234 pruning, 235–236 random forest and ensemble classifier, 236 k-NN algorithm definitions, 228 negative datapoints, 228–229 predictive model, 229–230 steps, 228 multiclass classification “all versus-all” approach, 238 “one-versus-all” approach, 237 3-D protein folds, 237 neural networks, 227–228 nonlinear transformation, 230–231 QDA, 225–226 simple linear classifiers, 226–227 SVMs, 231–234 O “One-versus-all” approach, 237 Ordinary least squares (OLS) estimators, 154 P PAM, see Partitioning around medoids Parameter estimation Bayesian estimation and prior distributions, 63–64 bias, consistency and efficiency of, 62 data, model fit to, 53–54 Gaussian data, likelihood for, 55–56 GC content, genomes biological effects, 82 LRT, 80 notation, 81 null hypothesis, 80 hypothesis testing, 77–80 maximum likelihood estimation, 54–60 multinomial and categorical distribution, 73–74 and multivariate statistics diagonal covariance, 69 genome-wide expression level measurements, 64 geometrical interpretation, 67 ImmGen data, 64 multivariate Gaussians and correlation, 70 scalars, vectors and matrices, 65 vectors, matrices and linear algebra, 65–67 objective functions, 60–62 Partial correlation analysis, 171–174 Partitioning around medoids (PAM), 109–110 Pearson’s chi-squared test, 34, 39 Pearson’s correlation, 90, 149, 153, 173 Penalized likelihood (PL), 61 advantage of, 187–188 cross-validation procedure, 188 drawback of, 188 gene expression measurements, 190–191 L1 and L2 penalties, correlated features, 189–190 linear regression, 186–187 MAP objective function, 192–193 prior distribution clustering and infinite mixture model, 196–198 heavy-tailed distribution, 193–194 STRUCTURE program, 195 Pfam, 203, 237 Phosphate gene expression model, 176–177 PL, see Penalized likelihood Point Accepted Mutation (PAM), 96 Poisson distribution, 9–11, 39–40, 166, 222 PolyPhen2 software, 203 Position specific scoring matrix (PSSM), 221 Positive predictive value (PPV), 244, 247 Precision–recall (P–R) plot, 246–247 Prior distribution clustering and infinite mixture model, 196–198 heavy-tailed distribution, 193–194 STRUCTURE program, 195 Index ◾ 263 Probabilistic classification models clustering, 210 MAP classification, 207 maximum likelihood classification, 206–207 structure of, 214 Protein evolution, 171–174 Protein–protein interactions, 222–223 Pruning, 235–236 PSIPRED, 203 PSORT, 203 P-value Bonferroni correction differentially expressed genes, 46–48 gene set enrichment analysis, 43–46 eQTL analysis, 49–51 FDR, BH procedure, 48–49 Fisher’s method, 27 hypothesis testing, 78 meta-analysis, 27 multiple regression, 178–179 null hypothesis, 27 observed configuration, 34 permutation test, 37 statistical hypothesis test, 26 t-statistic value, 28 WMW test, 31 Q Quadratic discriminant analysis (QDA), 225–226 R Radial basis function (RBF) kernel, 162, 233 Random forests, 6, 236 Reactome, 43–44 REDUCE algorithm, 177–178 Regression, generalized linear models, 165–166 multiple regression (see Multiple regression) polynomial and local regressions, 161–165 protein and mRNA levels, 157–160 simple linear regression eQTLs, 155–157 hypothesis testing, 149–153 least squares interpretation, 154–155 MLEs, 146–149 probabilistic model, 145–146 S Schrodinger’s equation, 16 Simple linear regression eQTLs, 155–157 hypothesis testing, 149–153 least squares interpretation, 154–155 MLEs, 146–149 probabilistic model, 145–146 Spearman correlation, 153 Statistical modeling Bernoulli distribution, 40 binomial distribution, 40 central limit theorem, 33 chi-squared distributions, 39 exact tests and gene set enrichment analysis, 33–36 exponential distribution, 39 gene expression, 17 hypothesis testing, 26–29 mathematical formula, 17 Newton’s laws, 16 nonparametric tests KS-test, 31–33 WMW test, 30–31 notation, 25–26 permutation tests, 36–38 Poisson distribution, 39–40 probability distributions categorical data, 20 fraction/ratio data, 20 Gaussian distributions, 18–20 i.i.d., 17 mathematical formula, 22 ordinal data, 20 quantile–quantile plot, 20, 22 single-cell RNA-seq experiment, 21–22 randomness, 17 rules of probability, 23–25 264 ◾ Index series of measurements, 15 T-distribution, 39 uniform distribution, 38 STRUCTURE program, 63, 195 Support vector machines (SVMs), 6, 61, 234 positive and negative support vectors, 231–232 T-cell classification, 233 TPR and FPR cross-validation, ROC curve, 249–251 leave-one-out cross-validation, 251–254 ROC curve and P-R plot, 246–248 T T cells linear classification boundaries, 205 linear vs nonlinear SVM, 233 logistic regression, 208–210 nonlinear transformation, 230 P-values, Bonferroni correction, 46–47 QDA, 225–226 ROC curve and P–R plot, 246–248 T-distribution, 28, 39, 153, 194 True negative (TN), 244 True positive (TP), 243 True positive rate (TPR) LDA and SMV cross-validation, ROC curve, 249–251 decision boundary, 245–246 leave-one-out cross-validation, 251–254 T-cell classification, ROC curve and P–R plot, 246–248 logistic regression, MAP rule, 244–245 t-test, 26–29, 38, 50 U Uniform distribution, 38, 138 Uninformative priors, 64 UniProt, 113 Univariate regression generalized linear models, 165–166 polynomial and local regressions, 161–165 protein and mRNA levels, 157–160 simple linear regression eQTLs, 155–157 hypothesis testing, 149–153 least squares interpretation, 154–155 MLEs, 146–149 probabilistic model, 145–146 Unweighted Pair Group Method with Arithmetic Mean (UPGMA), 92 W Wilcoxon Rank-Sum test, 30–31 ... guidebook, some intrepid ◾ Statistical Modeling and Machine Learning for Molecular Biology readers will decide to take a break from the familiar territory of molecular biology for a while and spend... who can formulate and test hypotheses on large datasets are leading the transformation of biology to a data-rich science 14 ◾ Statistical Modeling and Machine Learning for Molecular Biology. .. “P” stands for probability, and the “|” symbol refers to conditional probability So the formula is giving an equation 10 ◾ Statistical Modeling and Machine Learning for Molecular Biology 0.9