PREDICTIVE TOXICOLOGY - CHAPTER 7 ppt

7 Machine Learning and Data Mining STEFAN KRAMER Institut fu ¨ r Informatik, Technische Universita ¨ tMu ¨ nchen, Garching, Mu ¨ nchen, Germany CHRISTOPH HELMA Institute for Computer Science, Universita ¨ t Freiburg, Georges Ko ¨ hler Allee, Freiburg, Germany 1. INTRODUCTION In this chapter, we will review basic techniques from knowledge discovery in databases (KDD), data mining (DM), and machine learning (ML) that are suited for applications in predictive toxicology. We will discuss primarily methods which are capable of providing new insights and theories. Methods, which work well for predictive purposes but do not return models that are easily interpretable in terms of toxicological knowledge (e.g., many connectionist and multivariate approaches), will not be discussed here, but are discussed elsewhere in this book. Also not included in this chapter, yet important, are visualization techniques, which are valuable for giving first 223 © 2005 by Taylor & Francis Group, LLC clues about regularities or errors in the data. The chapter will feature data analysis techniques originating from a variety of fields, such as artificial intelligence, databases, and statistics. From artificial intelligence, we know about the structure of search spaces for patterns and models, and how to search them efficiently. Database literature is a valuable source of information about efficient storage of and access to large volumes of data, provides abstractions of data management, and has contributed the concept of query languages to data mining. Statistics is of utmost importance to data mining and machine learning, since it provides answers to many important questions arising in data analysis. For instance, it is necessary to avoid flukes, that is, patterns or models that are due to chance and do not reflect structure inherent in the data. Also, the issue of prior knowledge has been studied to some extent in the statistical literature. One of the most important lectures in data analysis is that one cannot be too cautious with respect to the conclusions to be drawn from the data. It is never a good idea to rely too much on automatic tools without checking the results for plausibility. Data analysis tools should never be applied naively—the prime directive is ‘‘know your data.’’ Therefore, sanity checks, (statistical) quality control, configuration management, and versioning are a necessity. One should always be aware of the possible threats to validity. Regarding the terminology in the paper, we will talk about instances, cases, examples, observations interchange- ably. Instances are described in terms of attribu- tes=features=variables (e.g., properties of the molecules, LD 50 values), which also will be used as synonyms in the chapter. If we are considering prediction, we are aiming at the prediction of one (or a few) dependent variables (or target classes=variables, e.g., LD 50 values) in terms of the independent variables (e.g., molecular properties). In several cases, we will refer to the computational complexity of the respective methods. The time complexity of an algorithm gives us an asymptotic upper bound on the runtime of the algorithm as a function of the size of the input problem. Thus, it gives us the worst-case behavior of algorithms. It is 224 Kramer and Helma © 2005 by Taylor & Francis Group, LLC written in the O() (‘‘big O’’) notation, which in effect sup- presses constants. If the input size of a dataset is measured in terms of the number of instances n, then O(n) means that the computation scales linearly with n. (Note that we are also interested in the scalability in the number of features, m.) Sometimes, we will refer to the space complexity, which makes statements about the worst-case memory usage of algorithms. Finally, we will assume basic knowledge of statistics and probability theory in the remainder of the chapter (see Ref. 2 for an introductory text). This chapter consists of four main sections: The first part is an introduction to data mining. Among other things, it introduces the terminology used in the rest of the chapter. The second part focuses on so-called descriptive data mining, the third part on predictive data mining. Each class of techniques is described in terms of the inputs and outputs of the respective algorithms, sometimes including examples thereof. We also emphasize the typical usage and the advantages of the algorithms, as well as the typical pitfalls and disadvantages. The fourth part of the chapter is devoted to references to the relevant literature, available tools, and implementations. 1.1. Data Mining (DM) and Knowledge Discovery in Databases This section shall provide a non-technical introduction to data mining (DM). The book Data Mining by Witten and Frank (2) provides an excellent introduction into this area and it is quite readable even for non-computer scientists. A recent review (3) covers DM applications in toxicology. Another recommended reading is Advances in Knowledge Discovery and Data Mining by Fayyad et al. (4). First, we will have to clarify the meaning of DM and its relation to other terms frequently used in this area, namely knowledge discovery in databases (KDD) and machine learning (ML). Common definitions (2,5–8) are:  Knowledge discovery (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ulti- mately understandable structure in data. Machine Learning and Data Mining 225 © 2005 by Taylor & Francis Group, LLC  Data mining (DM) is the actual data analysis step within this process. It consists of the application of statistics, machine learning, and database techniques to the dataset at hand.  Machine learning (ML) is the study of computer algorithms that improve automatically through experi- ence. One ML task of particular interest in DM is classification; that is, to classify new unseen instances on the basis of known training instances. This means that knowledge discovery is the process of supporting humans in their enterprise to make sense of massive amounts of data; data mining is the application of techniques to achieve this goal; and machine learning is one of the techniques suitable for this task. Other DM techniques originate from diverse fields, such, as statistics, visualization, and database research. The focus in this chapter will be primarily on DM techniques based on machine learning. In practice, many of these terms are not used in their strict sense. In this chapter, we will also use sometimes the popular term DM, when we mean KDD or ML. Table 1 shows the typical KDD process as described by Fayyad et al. (6). In the following, we will sketch the ada- pted process for the task of extracting structure–activity Table 1 The Knowledge Discovery (KDD) Process According to Fayyad et al. 1. Definition of the goals of the KDD process 2. Creation or selection of a data set 3. Data cleaning and preprocessing 4. Data reduction and projection 5. Selecting data mining methods 6. Exploratory analysis and model=hypothesis selection 7. Data mining (DM) 8. Interpretation=evaluation 9. Utilization a From Ref. 6. 226 Kramer and Helma © 2005 by Taylor & Francis Group, LLC relationships (SARs) from experimental data. The steps closely resemble those from the generic process by Fayyad: 1. Definition of the goal of the project and the purpose of the SAR models (e.g., predictions for untested compounds, scientific insight into toxicological mechanisms). 2. Creation or selection of the dataset (e.g., by perform- ing experiments, downloading data). 3. Checking the dataset for mistakes and inconsisten- cies, and perform corrections. 4. Selection of the features which are relevant to the project and transformation of the data into a format, which is readable by DM programs. 5. Selection of the DM technique. 6. Exploratory application and optimization of the DM tools to see if they provide useful results. 7. Application of the selected and optimized DM technique to the dataset. 8. Interpretation of the derived model and evaluation of its performance. 9. Application of the derived model, e.g., to predict the activity of untested compounds. The typical KDD setting involves several iterations over these steps. Human intervention is an essential component of the KDD process. Although most research has been focused on the DM step of the process (and the present chapter will not make an exception), the other steps are at least equally important. In practical applications, the data cleaning and preprocessing step is the most laborious and time- consuming task in the KDD process (and therefore often neglected). In the following sections, we will introduce a few general terms that are useful for describing and choosing DM systems on a general level. First, we will discuss the structure of the data, which can be used by DM programs. Then, we will have a closer look at DM as search or optimization in the space of patterns and models. Subsequently, we will distinguish between descriptive and predictive DM. Machine Learning and Data Mining 227 © 2005 by Taylor & Francis Group, LLC 1.2. Data Representation Before feeding the data into a DM program, we have to trans- form it into a computer-readable form. From a computer scientists point of view, there are two basic data representations relevant to DM, both will be illustrated with examples. Table 2 shows a table with physico-chemical properties of chemical compounds. For every compound, there are a fixed number of parameters or features available, therefore it is possible to represent the data in a single table. In this table, each row represents an example and each column an attri- bute. We call this type of representation propositional. Let us assume we want to represent chemical structures by identifying atoms and the connections (bonds) between them. It is obvious that this type of data does not fit into a single table, because each compound may have a different number of atoms and bonds. Instead we may write down the atoms and the relations (bonds) between them as in Fig. 1. This is called a relational representation. Other biologically relevant structures (e.g., genes, proteins) may be represented in a similar manner. The majority of research on ML and DM has been devoted to propositional representations. However, there exists a substantial body of work on DM in relational representations. Work in this area is published under the heading of inductive logic programming and (multi-)relational data mining. As of this writing, only very few commercial products Table 2 Example of a Propositional Representation of Chemical Compounds Using Molecular Properties CAS log P HOMO LUMO MUTAGEN 100-01-6 1.47 À9.42622 À1.01020 1 100-40-3 3.73 À9.62028 1.08193 0 100-41-4 3.03 À9.51833 0.37790 0 99-59-2 1.55 À9.01864 À0.98169 1 999-81-5 À1.44 À9.11503 À4.57044 0 228 Kramer and Helma © 2005 by Taylor & Francis Group, LLC are explicitly dealing with relational representations. Avail- able non-commercial software packages include ACE by the KU Leuven http://www.cs.kuleuven.ac.be/ $ ml/ACE/Doc/ and Aleph by Ashwin Srinivasan http://web.comlab.ox.ac.uk/oucl/ research/areas/machlearn/Aleph/. One of the few commercial products is Safarii http://www.kiminkii.com/safarii.html by the Dutch company Kiminkii. One of the implications of choosing a relational representation is that the complexity of the DM task grows substan- tially. This means that the runtimes of relational DM algorithms are usually larger than those of their propositional Figure 1 Machine Learning and Data Mining 229 © 2005 by Taylor & Francis Group, LLC relatives. For the sake of brevity, we will not discuss relational DM algorithms in the remainder of the chapter. 1.3. DM as Search for Patterns and Models The DM step in the KDD process can be viewed as the search for structure in the given data. In the most extreme case, we are interested in the probability distribution of all variables (i.e., the full joint probability distribution of the data). Know- ing the joint probability distribution, we would be able to answer all conceivable questions regarding the data. If we only want to predict one variable given the other variables, we are dealing with a classification or regression task: Classi- fication is the prediction of one of a finite number of discrete classes (e.g., carcinogens, non-carcinogens), regression is the prediction of a continuous, real-valued target variable (e.g., LD 50 values). In prediction, we just have to model the dependent variable given the independent variables, which requires less data than estimating the full joint probability distribution. In all of the above cases, we are looking for global regularities, that is, models of the data. However, we might just as well be satisfied with local regularities in the data. Local regularities are often called patterns in the DM literature. Fre- quently occurring substructures in molecules fall into this category, for instance. Other examples for patterns are depen- dencies among variables (functional or multivalued depen- dencies) as known from the database literature (9). Again, looking for patterns is an easier task than predicting a target variable or modeling the joint probability distribution. Most ML and DM approaches, at least conceptually, perform some kind of search for patterns or models. In many cases, we can distinguish between (a) the search for the structure of the pattern=model (e.g., a subgroup or a decision tree), and (b) the search for parameters (e.g., of a linear classifier or a Bayesian network). Almost always the goal is to optimize some scoring or loss function, be it simply the absolute or rela- tive frequency, information-theoretic measures that evaluate the information content of a model, numerical error measures such as the root mean squared error, the degree to which the 230 Kramer and Helma © 2005 by Taylor & Francis Group, LLC information in the data can be compressed, or the like. Some- times, we do not explicitly perform search in the space of patterns or models, but, more directly, employ optimization techniques. Given these preliminaries, we can summarize the ele- ments of ML and DM as follows. First, we have to fix the representation of the data and the patterns or models. Then, we often have a partial order and a lattice over the patterns or models that allows an efficient search for patterns and models of interest. With these ingredients, data mining often boils down to search=optimization over the structure=parameters of patterns=models with respect to some scoring=loss function. Finally, descriptive DM is the task to describe and char- acterize the data in some way, e.g., by finding frequently occurring patterns in the data. In contrast, the goal of predictive DM is to make predictions for yet unseen data. Predictive DM mostly involves the search for classification or regression models (see below). Please note that clustering should be categorized as descriptive DM, although some probabilistic var- iants thereof could be used indirectly for predictive purposes as well. Given complex data, one popular approach is to perform descriptive DM first (i.e., to find interesting patterns to describe the data), and perform predictive DM as a second step (i.e., to use these patterns as descriptors in a predictive model). For instance, we might search for frequently occurring substructures in molecules and then use them as features in some statistical models. 2. DESCRIPTIVE DM 2.1. Tasks in Descriptive DM In the subsequent sections, we will discuss two popular tasks in descriptive DM. First, we will sketch clustering, the task of finding groups of instances, such that the similarity within the groups is maximized and the similarity between the groups is minimized. Second, we will sketch Machine Learning and Data Mining 231 © 2005 by Taylor & Francis Group, LLC frequent pattern discovery and its descendants, where the task is to find all patterns with a minimum number of occurrences in the data (the threshold being specified by the user). 2.2. Clustering The task of clustering is to find groups of observations, such that the intragroup similarity is maximized and the inter- group similarity is minimized. There are tons of papers and books on clustering, and it is hard to tell the advantages and disadvantages of the respective methods. Part of the problem is that the evaluation and validation of clustering results is, to some degree, subjective. Clustering is unsuper- vised learning in the sense that there is no target value to be predicted. The content of this section is complementary to that of Marchal et al. (10): We focus on the advantages and disadvantages of the respective techniques, their computational complexity and give references to recent literature. In the section on resources, several pointers to existing implementations will be given. In the following exposition, we will closely follow Witten and Frank (12). Clustering algorithms can be categorized along several dimensions:  Categorical vs. probabilistic: Are the observations assigned to clusters categorically or with some probability?  Exclusive vs. overlapping: Does the algorithm allow for overlapping clusters, or is each instance assigned to exactly one cluster?  Hierarchical vs. flat: Are the clusters ordered hier- archically (nested), or does the algorithm return a flat list of clusters? Practically, clustering algorithms exhibit large differ- ences in computational complexity (the worst-case runtime behavior as a function of the problem size). Methods depend- ing on pair-wise distances of all instances (stored in a 232 Kramer and Helma © 2005 by Taylor & Francis Group, LLC [...]... networks instance-based learning: k-nearest neighbor © 2005 by Taylor & Francis Group, LLC 248 Kramer and Helma inductive learning methods: state-of-the-art rule learners such as PART and RIPPER, a reimplementation of the standard decision tree algorithm C4.5 called J48 support vector machines logistic regression The statistical workbench and programming language R [http://cran.r-project.org/ (31)],... connectivity Inf Processing Lett 2000; 76 : 175 –181 16 Datta S, Datta S Comparisons and validation of statistical clustering techniques for microarray gene expression data Bioinformatics 2003; 19(4):459–466 17 Gustafson DE, Kessel WC Fuzzy clustering with a fuzzy covariance matrix In: Proceedings of the IEEE Conference on Decision and Control IEEE Press, 1 979 ; 76 1 76 6 © 2005 by Taylor & Francis Group,... 2005 by Taylor & Francis Group, LLC Machine Learning and Data Mining 239 is studied in the areas of constraint-based mining and inductive databases Unfortunately, no industrial-strength implementations are available at the time of this writing 3 PREDICTIVE DM 3.1 Tasks in Predictive DM In toxicology, we can differentiate between two basic types of effects: those with a threshold (the majority of toxicological... (http : ==www: norsys:com=) are two of the few industrial-strength implementations of Bayesian networks 5 SUMMARY This chapter started with a brief review of the knowledge discovery process and the role of DM We distinguished between descriptive and predictive DM tasks and described the most important techniques that are suitable for predictive toxicology applications Finally, we recommended some books... Machine Learning The McGraw-Hill Companies, Inc., 19 97 9 O’Neil P, O’Neil E Database: Principles Programming, and Performance 2nd ed Morgan Kaufmann, 2000 10 Marchal K, De Smet F, Engelen K, De Moor B Computational Biology and Toxicogenomics 2004 This volume 11 Pelleg D, Moore A X-means: Extending K-means with efficient estimation of the number of clusters In: Proceeding of the 17th International Conference... http://www.cs.waikato.ac.nz/ml/weka/], in packages of the open-source statistical data analysis system R [(31), see also http://cran.r-project.org], and in commercial products such as Clementine (http://www.spss.com/spssbi/clementine/) The WEKA workbench is a open-source system implemented in JAVA As of this writing, WEKA (version 3.4) includes: clustering methods: k-means, model-based clustering (EM) and a symbolic clustering... relationships for toxity prediction In: Helma C, ed Predictive Toxicology New York: Marcel Dekker, 2004 19 Mannila H, Toivonen H Levelwise search and borders of theories in knowledge discovery Data Mining Knowledge Discovery 19 97; 1(3):241–258 20 Domingos P, Pazzani MJ On the optimality of the simple bay esian classifier under zero-one loss Machine Learning 19 97; 29(2–3):103–130 21 Jensen FV Bayesian Networks... (31)], is the open-source variant of the commercial product S-Plus It includes many packages http://spider.stat.umn.edu/R/doc/html/packages.html and http:// www.bioconductor.org) implementing the most popular clustering and classification techniques: package mva includes: k-means, hierarchical clustering (function hclust, with single, complete and mean linkage) package mclust—model-based clustering... Machine Learning by Mitchell (8) The most instructive and useful chapters for our purposes are those on concept learning (version spaces), Bayesian learning (Naive Bayes and a sketch of Bayesian networks), instance-based learning (k-nearest neighbor and the like), decision trees, and rule learning Also included, but a bit outdated, are the chapters on analytical learning and hybrid analytical=empirical... REFERENCES 1 Frasconi P Artificial Neural Networks and Kernel Machines in Predictive Toxicology 2004 This volume 2 Witten IH, Frank E Data Mining San francisco, CA: Morgan Kaufmann Publishers, 2000 3 Helma C, Gottmann E, Kramer S Knowledge discovery and DM in toxicology Stat Methods Med Res 2000; 9:329–358 4 Fayyad UM, Piatesky-Shaprio G, Smyth P, Uthurusamy R Advances in Knowledge Discovery and Data . LUMO MUTAGEN 10 0-0 1-6 1. 47 À9.42622 À1.01020 1 10 0-4 0-3 3 .73 À9.62028 1.08193 0 10 0-4 1-4 3.03 À9.51833 0. 377 90 0 9 9-5 9-2 1.55 À9.01864 À0.98169 1 99 9-8 1-5 À1.44 À9.11503 À4. 570 44 0 228 Kramer. databases. Unfortunately, no industrial-strength implementations are available at the time of this writing. 3. PREDICTIVE DM 3.1. Tasks in Predictive DM In toxicology, we can differentiate between. terminology used in the rest of the chapter. The second part focuses on so-called descriptive data mining, the third part on predictive data mining. Each class of techniques is described in terms

Định dạng
Số trang	32
Dung lượng	368,09 KB