8 Neural Networks and Kernel Machines for Vector and Structured Data PAOLO FRASCONI Dipartimento di Sistemi e Informatica, Universita ` degli Studi di Firenze, Firenze, Italy 1. INTRODUCTION The linear models introduced earlier in the text offer an important starting point for the development of machine learning tools but are subject to important limitations that we need to overcome in order to cover a wider application spectrum. Firstly, data in many interesting cases are not linearly separable; the need for a complex separation surface is not an artifact due to noise but rather a natural consequence of the representation. For the sake of illustration, let us 255 © 2005 by Taylor & Francis Group, LLC construct an artificial problem involving nonlinear separation [our example is actually a rephrasing of the famous XOR pro- blem (1)]. Suppose that we are given a problem involving the discrimination between active and nonactive chemical com- pounds and suppose that we are using as features that char- acterize each compound two physico-chemical properties expressed as real numbers (say, charge and hydrophobicity). In this way, each compound is represented by a two-dimen- sional real vector, as in Fig. 1. Now suppose that active com- pounds (points marked by þ) have either low charge and high hydrophobicity, or low hydrophobicity and high charge, while nonactive compounds () have either high charge and low hydrophobicity, or high hydrophobicity and low charge, as in Fig. 1b. It is easy to realize that in this situation there is no possible linear separation between active and nonactive instances. By contrast, if active compounds had both high charge and high hydrophobicity (as in Fig. 1a), then linear separation would have been possible. Figure 1 Artificial problems illustrating linear and nonlinear separability. Here the in stance space X is R 2 and the function f is realized by a hyperplane h that divides X into a positive and a negative semispace. If positive and negative points are arranged like they are in diagram (b) then no separating hyper- plane exists. 256 Frasconi © 2005 by Taylor & Francis Group, LLC Secondly, data may not necessarily come in the form of real vectors or anyway in the form of attribute-value repre- sentations that can be easily converted into a fixed-size vec- torial representation. a For example, a natural and appealing representation for a chemical compound is by means of an attributed graph with vertices associated with atoms and edges representing chemical bonds. Representing chemical structures in attribute-value form can be done, for example, using topological descriptors. However, while these ‘‘ad-hoc’’ representations may embody domain knowledge from experts, it is not obvious that they preserve all the information that would be useful to the learning process. In this chapter, we focus on extending the basic models presented earlier to specifically address the two above issues. We will briefly review neural networks (in particular, the multilayered perceptron) and kernel methods for nonlinearly separable data. Then we will extend basic neural networks to obtain architectures that are suitable for several interesting classes of labeled graphs. For several years, neural networks have been the most common (but also controversial) among an increasingly large set of tools available in machine learning. The basic models have their roots in the early cybernetis studies of the 1940s and 1950s (e.g., Refs. 2 and 3) but the interest in neural net- works and related learning algorithms remained relatively low for several years. After Minsky and Papert (4) published their extensive critical analysis of Rosenblatt’s perceptron, the attention of the artificial intelligence community mainly diverted toward symbolic approaches. In the mid 1980s, the popularity of neural networks boosted again, but this time as an interdisciplinary tool that captured the interest of cognitive scientists, computer scientists, physicists, and biologists. The backpropagation algorithm (5–8) is often mentioned as the main cause for this renaissance and the a In attribute-value representations each instance consists of values assigned to a fixed repertoire of attributes or features. Since each value belongs to a specified set (discrete or continuous), conversion to fixed-size vectors is typically straightforward. Neural Networks and Kernel Machines for Vector and Structured Data 257 © 2005 by Taylor & Francis Group, LLC 1990s witnessed a myriad of real-world applications of neural- networks, many of which quite successful. During the same decade, Cortes and Vapnik (9) developed previous ideas on statistical learning theory and introduced support vector machines and kernel methods, which still today represent a very active area of research in machine learning. The material covered in this chapter is rather technical and it is assumed that the reader is knowledgeable about basic concepts of calculus, linear algebra, and probability theory. Textbooks such as Refs. 10–14 may be useful for read- ers who need further background in these areas. 2. SUPERVISED LEARNING In supervised learning, we are interested in the association between some input instance x 2 X and some output random variable Y 2 Y. The input instance x is a representation of the object we are interested in making predictions about, while the output Y is the predicted value. 2.1. Representation of the Data The set X is called the instance space and consists of all the possible realizations of the input portion of the data. In prac- tice, if we are interested in making predictions about chemical compounds, then each instance x: is a suitable representation of a particular compound. One possibility is to use a vector- based representation, i.e., x 2R n . This means that, in our chemical example, each component of the input vector might be an empirical or a nonempirical descriptor for the molecule (see elsewhere in this book for details on how to represent chemicals by descriptors). However, since, in general, more expressive representations are possible at the most abstract level, we may assume that X is any set. Indeed, in Sec. 5, we will present architectures capable of exploiting directly graph-based representations of instances that could be interesting in chemical domains. The type of Y depends on the nature of the prediction problem. For example, if we are interested in the prediction 258 Frasconi © 2005 by Taylor & Francis Group, LLC of the normal boiling points of halogenated aliphatics, then the output Y is a real-valued variable representing the boiling temperature. By contrast, if we are interested in the discrimi- nation between potentially drug-like and nondrug-like candi- dates, then we will use Y ¼f0,1g, i.e., the output Y is in this case a Bernoulli variable. b When Y is a continuous set, we talk about regression problems. When Y is a discrete set, we talk about classification problems. In particular, if Y¼f0,1g we have binary classification, and more in general, if Y ¼f1,2, , Kg we have multiclass classification. Both regression and classification problems can be formulated in the framework of statistical learning that will be introduced in the next section. 2.2. Basic Ideas of Statistical Learning Theory When introducing supervised learning systems we typically assume that some unknown (but fixed) probability measure p is defined on X Y where denotes the Cartesian product of sets. c The purpose of learning is, in a sense, to ‘‘identify’’ the unknown p. To this end, we are given a data set of i.i.d. examples drawn from p in the form of input output pairs: D m ¼fðx i ; y i Þg m i¼1 . The supervised learning problem consists of seeking a function f (x) for predicting Y on new (unseen) cases. This function is sometimes referred to as the hypothesis. In order to measure the quality of the learned function we introduce a nonnegative loss function L: Y Y ! R þ , where L(y, f(x)) is the cost of predicting f(x) when the correct prediction is y. For example, in the case of regression, we may use the quadratic loss Lðy; f ðxÞÞ ¼ ðy f ðxÞÞ 2 ð1Þ b A Bernoulli variable is simply a random variable having two possible rea- lizations, as in tossing a coin. c Given two sets X and Y the Cartesian product X Y is the set of all pairs (x,y) with x 2 X and y 2 Y. Neural Networks and Kernel Machines for Vector and Structured Data 259 © 2005 by Taylor & Francis Group, LLC In the case of classification, we could use the 0–1 loss Lðy; f ðxÞÞ ¼ 0ify ¼ f ðxÞ 1 otherwise & ð2Þ that can be easily generalized to quantify the difference between false positive and false negative errors. The empirical error associated with f is the observed loss on the training data: E m ðf Þ¼ X m i¼1 Lðy i ; f ðx i ÞÞ ð3Þ For the 0–1 loss of Eq. (2), this is simply the number of misclassified training examples (i.e., the sum of false positives and false negatives). The generalization error can be measured by the expected loss associated with f, where the expectation is taken over the joint probability p of inputs and outputs: Eðf Þ¼E p ½LðY; f ðX ÞÞ ¼ Z Lðy; f ðxÞÞpðx; yÞd x dy: ð4Þ Thus, it immediately appears that learning has essen- tially and fundamentally to do with the ability of estimating probability distributions in the joint space of the inputs X and the outputs Y. To better illustrate the idea, suppose that X ¼ R and Y ¼f0,1g (a binary classification problem with inputs repre- sented by a single real number). In this case, under the 0–1 loss, Eq. (4) reduces to Eðf Þ¼ Z x2F 1 pðY ¼ 0 ; xÞdx þ Z x2F 0 pðY ¼ 1; xÞdx ð5Þ where F 0 and F 1 are the regions that f classifies as negative and positive, respectively. Thus the first integral in Eq. (5) measures the error due to false positives (weighted by the probability p(Y ¼0, x) that x is actually a negative instance) and the second integral measures the error due to false nega- tives (weighted by the probability p (Y ¼1, x) that x is actually a positive instance). The error in Eq. (5) is also known as the 260 Frasconi © 2005 by Taylor & Francis Group, LLC Bayes error of f. The Bayes optimal classifier f is defined as f ðxÞ¼ 0ifpðY ¼ 0jxÞ > pðY ¼ 1jxÞ 1 otherwise & ð6Þ and has the important property that it minimizes the error in Eq. (5). It can be seen as a sort of theoretical limit since no other classifier can do better. An example of Bayes optimal classifier is illustrated in Fig. 2. In this case, the decision function is simply f ðxÞ¼ 0ifx < o 1 otherwise & ð7Þ where o is a parameter. According to Eq. (6), the best choice o satisfies p(Y ¼0jo ) ¼p(Y ¼ljo ). Using Bayes theorem, we see that p( Y jx) is proportional to p( xjY )p( Y ) and assum- ing that p(Y ¼0) ¼p(Y ¼1) ¼0.5 the best choice also satis- fies p(o jY¼0) ¼p(o jY ¼1). The densities p(xjY ¼0) and p(x jY ¼1) thus contain all the information needed to con- struct the best possible classifier. They are usually called class conditional densities. In Fig. 2 we have assumed that, they are normal distributions. The Bayes error [Eq. (5)] is measured Figure 2 The Bayes optimal classifier. Neural Networks and Kernel Machines for Vector and Structured Data 261 © 2005 by Taylor & Francis Group, LLC in this case by the shaded area below the intersection of the two curves. Of course, in practice, we do not know p(Yjx) and we cannot achieve the theoretical optimum error E(f ). A learning algorithm then proceeds by searching its solu- tion f in a suitable set of functions F, called the hypothesis space, using the empirical error E m ( f ) as a guide. Unfortu- nately, the problem of minimizing the training set error E m (f) by choosing f 2 F is not well posed, in the sense that the solution is not necessarily unique. For example, going back to our toy binary classification problem of Fig. 2, suppose we are given the following training set of five points D m ¼f(1.1, 0), (1.9, 0), (2.3, 0), (8.2, 1)(8.9, 1)g (two positive and three negative examples) and suppose we continue to use the decision function of Eq. (7). Clearly any o in the inter- val (2.3, 8.2) will bring the training set error to 0, but the asso- ciated generalization error would depend on o. Without additional information (i.e., besides the training data) or constraints, we have no way of making a ‘‘good’’ choice for o (i.e., picking a value closer to the theoretical optimum o ). Intuitively, complex models can easily fit the training data but could behave poorly on future data. As an extreme example, if F contains all possible functions on X, then we can always bring E m ( f ) to 0 using a lookup table that just stores training examples, but this would be a form of rote learning yielding little or no chance of generalization to new instances. Indeed, one of the main issues in statistical learn- ing theory consists of understanding under which conditions a small empirical error also implies a small generalization error. Intuitively, in order to achieve this important goal, the function f should be ‘‘stable,’’ i.e., should give ‘‘similar’’ predictions for ‘‘similar’’ inputs. To illustrate this idea, let us consider another simple example. Suppose we have a scalar regression problem from real numbers to real numbers. If we are given m training points, we can always find a perfect solution by fitting a poly- nomial of degree m 1. Again, this would resemble a form of rote learning. If data points are noisy, they can appear as if they had been generated according to a more complex mechanism than the real one. In such a situation, 262 Frasconi © 2005 by Taylor & Francis Group, LLC a polynomial of high degree is likely to be a wildly oscillating function that passes on the assigned points but varies too much between the examples and will yield large error on new data points. This phenomenon is commonly referred to as overfitting. Smoother solutions such as polynomial of small degree are more likely to yield smaller generalization error although training points may not be fitted perfectly (i.e., the empirical error may be greater than 0). Of course, if the degree of the polynomial is too small, the true solution cannot be adequately represented and underfitting may occur. In Sec. 3.4, we will discuss how the tradeoff between overfitting and model complexity can be controlled when using neural net- works. Moreover, in general, regularization theory is a prin- cipled approach for finding smoother solutions (15,16) and is also one of the foundations of support vector machines that we will later discuss in Sec. 4. Of course, we expect that better generalization will be achieved as more training examples are available. Techni- cally, this fact can be characterized by studying the uniform convergence of a learner, i.e., understanding how the differ- ence between empirical and generalization error goes to 0 as the number of training examples increases. Standard results [see, e.g., (Refs. 17–19)] indicate that uniform convergence of the empirical error to the generalization error essentially depends on the capacity of the class of functions F from which f is chosen. Intuitively, the capacity of a class of functions F is the number of different functions contained in F. Going back to our scalar regression example where F is a set of polynomials, we intuitively expect that the capacity increases with the maximum allowed degree of a polynomial. In the case of binary classification, F is a set of dicho- tomies (i.e., a set of functions that split their domain in two nonoverlapping sets) and it is possible to show that with prob- ability at least 1d it holds EðFÞE m ðf Þþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 8 m d ln 2m d þ 1 ln d 4 s ð8Þ Neural Networks and Kernel Machines for Vector and Structured Data 263 © 2005 by Taylor & Francis Group, LLC In the above equation, the capacity of the learner is measured by d, an integer defined as the cardinality of the largest set of points that can be labeled arbitrarily by one of the dichoto- mies in F; d is called the dimension of Vapnik and Cervonen- kis (VC-dimension) of F. For example, if F is the set of separating lines in the two-dimensional plane (as in Fig. 1), it is easy to see that the VC-dimension is 3. Note that for a given set of three points, there are 2 3 ¼8 possible dichotomies. It is immediate to see that in this case, there is a set of three points that can be arbitrarily separated by a line, regardless of the chosen dichotomy (see Fig. 3). However, given a set of 4 points, there are 16 possible dichotomies and two of them are nonlinearly separable. Uniform convergence of E m (f)toE(f) ensures consistency, that is, in the limit of infinite examples, the minimization of E m (f) will re sult in a generalization error as low as the error a sso- ciated with the best possible function in F. As Eq. (8) shows, the behavior of the learner is essentially controlled by the ratio between ca pacity and number of training exa mples and thus we cannot afford a c omplex model if w e d o n ot have eno ugh data. 2.3. The Logistic Function for Classification There are two general approaches for developing statistical learning algorithms. In the generative approach, we create a Figure 3 Illustration of the VC-dimension. Left: set of three points that can be classified arbitrarily using a two-dimensional hyperplane (i.e., a line). Right: no set of four points can be arbitra- rily classified by a hyperplane. 264 Frasconi © 2005 by Taylor & Francis Group, LLC [...]... proposed a divideand-conquer technique that splits the training data into smaller subsets and subsequently recombines the solutions ( 38) 4.4 Support Vector Regression The SVM regression can be formulated as the problem of minimizing the functional Hðf Þ ¼ m XC i¼1 1 j yi À f ðxi Þj E þ kf k2 K m 2 ð 68 where jajE ¼ 0 if jaj < E and jxj À E otherwise is the E-insensitive loss As depicted in Fig 8 for the linear... kf k2 þ C ðxi þ xÃ Þ K i 2 i¼1 f ðxi Þ À yi ! E þ xi ; i ¼ 1; ; m xà ; i i ¼ 1; ; m yi À f ðxi Þ ! E þ xi ! © 2005 by Taylor & Francis Group, LLC 0; xà i ! 0; i ¼ 1; ; m ð69Þ 288 Frasconi Figure 8 Left: The 2-tube of SVM regression Right: The SVM loss function associated with violations whose QP dual formulation is 1 À ða À aà ÞT Gða À aÃ Þ 2 m X À Eðai þ aÃ Þ þ yi ðai À aÃ Þ i i max à a;a... in this chapter) are conceived for data represented as real vectors or in the so-called attribute-value (or propositional) representation, where each instance is a tuple of attributes that may be either continuous or categorical Mapping tuples to real vectors is straightforward and, in © 2005 by Taylor & Francis Group, LLC Neural Networks and Kernel Machines for Vector and Structured Data 289 the case... À yi Þ log 1 À f ðxi ; yÞ i¼1 ð 28 which has the form of a cross-entropy function (a.k.a Kullback Leibler divergence) (24) between the distribution obtained by reading the output of the network and the (degenerate) distribution that concentrates all the probability on the known targets yi © 2005 by Taylor & Francis Group, LLC 272 Frasconi We can generalize the cross-entropy to the multiclass case by... inputs (normal, Bernoulli, and multinomial, respectively) (26) Interestingly, if we had used a least squared error for classification instead of a cross-entropy, we would have obtained a different result: di ¼ Àðyi À f ðxi ; yÞÞf ðxi ; yÞð1 À f ðxi ; yÞÞ ð 38 The latter form may be less desirable Suppose, for some weight assignment, we have yi ¼ 1 but f(xi) % 0, i.e., we are making a large error on the... Taylor & Francis Group, LLC 2 78 Frasconi term is the so called prior on the parameters Adding a prior can be seen as a form of regularization that biases toward simpler or smoother solutions to the learning problem (see, e.g., Ref 31 for a more general discussion of the role of inductive bias in machine learning) Weight decay is obtained if we assume that p(y) is a zero-mean Gaussian with covariance... earlier foundational work by Vapnik on statistical learning theory (33) In particular, SVMs attempt to give a well-posed formulation to the learning problem through the principle of structural risk minimization, embodying a more general principle of parsimony that is also the foundation of Occam-Razor regularization theory, and the Bayesian approach to learning 4.1 Maximal Margin Hyperplane For simplicity,... b0 ð 48 where b 2 Rn and b0 2 R are adjustable parameters just like in logistic regression Let Dm be a linearly separable data set, i.e., f(xi)yi > 0, i ¼ 1, ,m We call the margin M of the classifier the distance between the separating hyperplane and the closest training example The optimal separating hyperplane is defined as the one having maximum margin © 2005 by Taylor & Francis Group, LLC 280 Frasconi... points are called support vectors The decision function f(x) can be computed via Eq ( 48) or, equivalently, from the following dual form f ðxÞ ¼ m X yi ai xT xi þ b0 ð55Þ i¼1 The maximum margin hyperplane has two important properties First, it is unique for a given linearly separable Dm Second, it can be shown that the VC-dimension of the class of ‘‘thick’’ hyperplanes having thickness 2M and support vectors... is bounded by d < M2=R2 þ l When plugged into uniform convergence bounds like Eq (8) , this suggests that the complexity of the hypothesis space (and consequently the convergence of the empirical error to the generalization error) does not necessarily depend on the dimension n of the © 2005 by Taylor & Francis Group, LLC 282 Frasconi input space In some application domains (e.g., gene expression data . boiling temperature. By contrast, if we are interested in the discrimi- nation between potentially drug-like and nondrug-like candi- dates, then we will use Y ¼f0,1g, i.e., the output Y is in this case. 0), (2.3, 0), (8. 2, 1) (8. 9, 1)g (two positive and three negative examples) and suppose we continue to use the decision function of Eq. (7). Clearly any o in the inter- val (2.3, 8. 2) will bring. arbitrarily by one of the dichoto- mies in F; d is called the dimension of Vapnik and Cervonen- kis (VC-dimension) of F. For example, if F is the set of separating lines in the two-dimensional plane (as