190 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni By repeating this procedure for each case in the database, we compute fitted values for each variable Y i , and then define the blanket residuals by r ik = y ik − ˆy ik for numerical variables, and by c ik = δ (y ik , ˆy ik ) for categorical variables, where the function δ (a,b) takes value δ = 0 when a = b and δ = 1 when a = b. Lack of significant patterns in the residuals r ik and approxi- mate symmetry about 0 will provide evidence in favor of a good fit for the variable Y i , while anomalies in the blanket residuals can help to identify weaknesses in the dependency structure that may be due to outliers or leverage points. Significance testing of the goodness of fit can be based on the standardized residuals: R ik = r ik V (y i ) where the variance V (y i ) is computed from the fitted values. Under the hypothesis that the network fits the data well, we would expect to have approximately 95% of the standardized residuals within the limits [-2,2]. When the variable Y i is categorical, the residuals c ik identify the error in reproducing the data and can be summarized to compute the error rate for fit. Because these residuals measure the difference between the observed and fit- ted values, anomalies in the residuals can identify inadequate dependencies in the networks. However, residuals that are on average not significantly different from 0 do not necessarily prove that the model is good. A better validation of the network should be done on an independent test set to show that the model induced from one particular data set is reproducible and gives good predictions. Measures of the predictive accuracy can be the monitors based on the logarithmic scoring function (Good, 1952). The basic intuition is to measure the degree of surprise in predicting that the variable Y i will take a value y ih in the hth case of an independent test set. The measure of surprise is defined by the score s ih = −log p(y ih |MB(y i ) h ) where MB(y i ) h is the configuration of the Markov blanket of Y i in the test case h, p(y ih |MB(y i ) h ) is the predictive probability computed with the model induced from data, and y ih is the value of Y i in the hth case of the test set. The score s ih will be 0 when the model predicts y ih with certainty, and increases as the probability of y ih decreases. The scores can be summarized to derive local and global monitors and to define tests for predictive accuracy (Cowell et al., 1999). In absence of an independent test set, standard cross validation techniques are typically used to assess the predictive accuracy of one or more nodes (Hand, 1997). In K-fold cross validation, the data are divided into K non-overlapping sets of ap- proximately the same size. Then K −1 sets are used for retraining (or inducing) 10 Bayesian Networks 191 the network from data that is then tested on the remaining set using monitors or other measures of the predictive accuracy (Hastie et al., 2001). By repeating this process K times, we derive independent measures of the predictive accuracy of the network induced from data as well as measures of the robustness of the network to sampling variability. Note that the predictive accuracy based on cross-validation is usually an over-optimistic measure, and several authors have recently argued that cross-validation should be used with caution (Braga-Neto and Dougerthy, 2004), particularly with small sample sizes. 10.5 Bayesian Networks in Data Mining This section describes the use of Bayesian networks to undertake other typical Data Mining tasks such as classification, and for modeling more complex models, such as nonlinear and temporal dependencies. 10.5.1 Bayesian Networks and Classification The term “supervised classification” covers two complementary tasks: the first is to identify a function mapping a set of attributes onto a class, and the other is to assign a class label to a set of unclassified cases described by attribute values. We denote by C the variable whose states represent the class labels c i , and by Y i the attributes. Classification is typically performed by first training a classifier on a set of la- belled cases (training set) and then using it to label unclassified cases (test set). The supervisory component of this classifier resides in the training signal, which pro- vides the classifier with a way to assess a dependency measure between attributes and classes. The classification of a case with attribute values y 1k , ,y vk is then per- formed by computing the probability distribution p(C | y 1k , ,y vk ) of the class variable, given the attribute values, and by labelling the case with the most probable label. Most of the algorithms for learning classifiers described as Bayesian networks impose a restriction on the network structure, namely that there cannot be arcs pointing to the class variable. In this case, by the local Markov property, the joint probability p(y 1k , ,y vk ,c k ) of class and attributes is factorized as p(c k )p(y 1k , ,y vk | c k ). The simplest example is known as a Na ¨ ıve Bayes classifier ( NBC) (Duda and Hart, 1973,Langley et al., 1992), and makes the further simplifica- tion that the attributes Y i are conditionally independent given the class C so that p(y 1k , ,y vk |c k )= ∏ i p(y ik |c k ). Figure 10.5 depicts the directed acyclic graph of a NBC. Because of the restriction on the network topology, the training step for a NBC classifier consists of estimating the conditional probability distributions of each attribute, given the class, from a training data set. When the attributes are discrete or continuous variables and follow Gaussian distributions, the parameters are learned by using the procedure described in Section 10.4. Once trained, the NBC classifies a case by computing the posterior probability 192 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni distribution over the classes via Bayes’ Theorem and assigns the case to the class with the highest posterior probability. When the attributes are all continuous and modelled by Gaussian variables, and the class variable is binary, say c k = 0,1, the classification rule induced by a NBC is very similar to the Fisher discriminant rule and turns out to be a function of r = ∑ i {log( σ 2 i0 / σ 2 i0 ) −(y i − μ i1 ) 2 / σ 2 i1 +(y i − μ i0 ) 2 / σ 2 i0 } where y i is the value of attribute i in the new sample to classify and the parameters σ 2 ik and μ ik are the variance and mean of the attribute Gaussian distribution, conditional on the class membership, that are usually estimated by Maximum likelihood. Other classifiers have been proposed to relax the assumption that attributes are conditionally independent given the class. Perhaps the most competitive one is the Tree Augmented Na ¨ ıve Bayes( TAN) classifier (Friedman et al., 1997) in which all the attributes have the class variable as a parent as well as another attribute. To avoid cycles, the attributes have to be ordered and the first attribute does not have other parents beside the class variable. Figure 10.6 shows an example of a TAN classifier with five attributes. An algorithm to infer a TAN classifier needs to choose both the dependency structure between attributes and the parameters that quantify this depen- dency. Due to the simplicity of its structure, the identification of a TAN classifiers does not require any search but rather the construction of a tree among the attributes. An “ad hoc” algorithm called Construct-TAN CTAN was proposed in (Friedman et al., 1997). One limitation of the CTAN algorithm to build TAN classifiers is that it applies only to discrete attributes, and continuous attributes need to be discretized. Other extensions of the NBC try to relax some of the assumptions made by the NBC or the TAN classifiers. Some examples are the l-Limited Dependence Bayesian classifier (l- LDB) in which the maximum number of parents that an attribute can have is l (Sahami, 1996). Another example is the unrestricted Augmented Na ¨ ıve Bayes classifier ( ANB) in which the number of parents is unlimited but the scoring metric used for learning, the minimum description length criterion, biases the search toward models with small number of parents per attribute (Friedman et al., 1997). Due to the high dimensionality of the space of different ANB networks, algorithms that build Fig. 10.5. The structure of the Na ¨ ıve Bayes classifier. 10 Bayesian Networks 193 Fig. 10.6. The structure of a TAN classifier. this type of classifiers must rely on heuristic searches. More examples are reported in (Friedman et al., 1997). 10.5.2 Generalized Gamma Networks Most of the work on learning Bayesian networks from data has focused on learning networks of categorical variables, or networks of continuous variables modeled by Gaussian distributions with linear dependencies. This section describes a new class of Bayesian networks, called Generalized Gamma networks ( GGN), able to describe possibly nonlinear dependencies between variables with non-normal distributions (Sebastiani and Ramoni, 2003). In a GGN the conditional distribution of each variable Y i given the parents Pa(y i )= {Y i1 , ,Y ip(i) }follows a Gamma distribution Y i |pa(y i ), θ i ∼Gamma( α i , μ i (pa(y i ), β i )), where μ i (pa(y i ), β i ) is the conditional mean of Y i and μ i (pa(y i ), β i ) 2 / α i is the con- ditional variance. We use the standard parameterization of generalized linear models (McCullagh and Nelder, 1989), in which the mean μ i (pa(y i ), β i ) is not restricted to be a linear function of the parameters β ij , but the linearity in the parameters is enforced in the linear predictor η i , which is itself related to the mean function by the link function μ i = g( η i ). Therefore, we model the conditional density function as: p(y i |pa(y i ), θ i )= α α i i Γ ( α i ) μ α i i y α i −1 i e − α i y i / μ i , y i ≥ 0 (10.4) where μ i = g( η i ) and the linear predictor η i is parameterized as η i = β i0 + ∑ j β ij f j (pa(y i )) and f j (pa(y i )) are possibly nonlinear functions. The linear predictor η i is a function linear in the parameters β , but it is not restricted to be a linear function of the parent values, so that the generality of Gamma networks is in the ability to encode general non-linear stochastic dependency between the node variables. Table 10.1 shows ex- ample of non-linear mean functions. Figure 10.7 shows some examples of Gamma 194 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni LINK g(·) LINEAR PREDICTOR η IDENTITY μ = ηη i = β i0 + ∑ j β ij y ij INVERSE μ = η −1 η i = β i0 + ∑ j β ij y −1 ij LOG μ = e η η i = β i0 + ∑ j β ij log(y ij ) Table 10.1. Link functions and parameterizations of the linear predictor. density functions, for different shape parameters α = 1,1.5,5 and mean μ = 400. Note that approximately symmetrical distributions are obtained for particular values of the shape parameter α . Fig. 10.7. Example of Gamma density functions for shape parameters α = 1 (continuous line), α = 1.5 (dashed line), and α = 5 (dotted line) and mean μ = 400. For fixed mean, the parameter α determines the shape of the distribution that is skewed to the left for small α and approaches symmetry as α increases. Unfortunately, there is no closed form solution to learn the parameters of a GGN and we have therefore to resort to Markov Chain Monte Carlo methods to compute stochastic estimates (Madigan and Ridgeway, 2003), or to maximum likelihood to compute numerical approximation of the posterior modes (Kass and Raftery, 1995). A well know property of generalized linear models is that the parameters β ij can be estimated independently of α i , which is then estimated conditionally on β ij (McCul- lagh and Nelder, 1989). To compute the maximum likelihood estimates of the parameters β ij within each family (Y i ,Pa(y i )), we need to solve the system of equations 10 Bayesian Networks 195 ∂ log p(D | θ i )/ ∂β ij = 0. The Fisher Scoring method is the most efficient algorithm to find the solution of the system of equations. This iterative procedure is a general- ization of the Newton Raphson procedure in which the Hessian matrix is replaced by its expected value. This modification speeds up the convergence rate of the iterative procedure that is known for being usually very efficient — it usually converges in 5 steps for appropriate initial values. Details can be found for example in (McCullagh and Nelder, 1989). Once the ML estimates of β ij are known, say ˆ β i , we compute the fitted means ˆ μ ik = g( ˆ β i0 + ∑ j ˆ β ij f j (pa(y i )) and use these quantities to estimate the shape param- eter α i . Estimation of the shape parameter in Gamma distributions is an open is- sue, and authors have suggested several estimators (see for example (McCullagh and Nelder, 1989)). Popular choices are the deviance-based estimator that is defined as ˜ α i = n −q ∑ k (y ik − ˆ μ ik ) 2 / ˆ μ 2 ik where q is the number of parameters β ij that appear in the linear predictor. The maximum likelihood estimate ˆ α i of the shape parameter α i would need the solution of the equation n +nlog( α i )+n Γ ( α i ) Γ ( α i ) + − ∑ k log( ˆ μ ik )+ ∑ k log(y ik ) − ∑ i y ik ˆ μ ik = 0 with respect to α i . We have an approximate closed form solution to this equation based on a Taylor expansion that is discussed in (Sebastiani, Ramoni, and Kohane, 2003, Sebastiani et al., 2004, Sebastiani, Yu, and Ramoni, 2003). Also the model selection process requires the use of approximation methods. In this case, we use the Bayesian information criterion (BIC) (Kass and Raftery, 1995) to approximate the marginal likelihood by 2 log p(D| ˆ θ ) −n p log(n) where ˆ θ is the maximum likelihood estimate of θ , and n p is the overall number of parameters in the network. BIC is independent of the prior specification on the model space and trades off goodness of fit — measured by the term 2 log p(D | ˆ θ ) — and model complexity — measured by the term n p log(n). We note that BIC factorizes into a product of terms for each variable Y i and makes it possible to conduct local structural learning. While the general type of dependencies in Gamma networks makes it possible to model a variety of dependencies within the variables, exact probabilistic reason- ing with the network becomes impossible and we need to resort to Gibbs sampling (see Section 10.2). Our simulation approach uses the adaptative rejection metropo- lis sampling (ARMS) of (Gilks and Roberts, 1996) when the conditional density p(y i |Y \y i , ˆ θ ) is log-concave, and adaptive rejection with Metropolis sampling in the other cases (Sebastiani and Ramoni, 2003). 10.5.3 Bayesian Networks and Dynamic Data One of the limitations of Bayesian networks is the inability to represents forward loops: by definition the directed graph that encodes the marginal and conditional 196 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni independencies between the network variables cannot have cycles. This limitation makes traditional Bayesian networks unsuitable for the representation of many sys- tems in which feedback controls are a critical aspect of many application domains, from control engineering to biomedical sciences. Dynamic Bayesian networks pro- vide a general framework to integrate multivariate time series and to represent feed- forward loops and feedback mechanisms. Fig. 10.8. A directed acyclic graph that represents the temporal dependency of three categori- cal variables describing positive (+) and negative (-) regulation. A dynamic Bayesian network is defined by a directed acyclic graph in which nodes continue to represent stochastic variables and arrows represent temporal de- pendencies that are quantified by probability distributions. The crucial assumption is that the probability distributions of the temporal dependencies are time invariant, so that the directed acyclic graph of a dynamic Bayesian network represents only the necessary and sufficient time transitions to reconstruct the overall temporal process. Figure 10.8 shows the directed acyclic graph of a dynamic Bayesian network with three variables. The subscript of each node denotes the time lag, so that the arrows from the nodes Y 2(t−1) and Y 1(t−1) to the node Y 1(t) describe the dependency of the probability distribution of the variable Y 1 at time t on the value of Y 1 and Y 2 at time t −1. Similarly, the directed acyclic graph shows that the probability distribution of the variable Y 2 at time t is a function of the value of Y 1 and Y 2 at time t −1. This symmetrical dependency allows us to represent feedback loops and we used it to de- scribe the regulatory control of glucose in diabetic patients (Ramoni et al., 1995). A dynamic Bayesian network is not restricted to represent temporal dependency of order 1. For example the probability distribution of the variable Y 3 at time t depends on the value of the variable at time t −1 as well as the value of the variable Y 2 at time t −2. The conditional probability table in Figure 10.8 shows an example when the variables Y 2 ,Y 3 are categorical. By using the local Markov property, the joint probability distribution of the three variables at time t, given the past history 10 Bayesian Networks 197 h t := y 1(t−1) , ,y 1(t−l) ,y 2(t−1) , ,y 2(t−l) ,y 3(t−1) , ,y 3(t−l) is given by the product of the three factors: p(y 1(t) |h t )=p(y 1(t) |y 1(t−1) ,y 2(t−1) ) p(y 2(t) |h t )=p(y 2(t) |y 1(t−1) ,y 2(t−1) ) p(y 3(t) |h t )=p(y 3(t) |y 3(t−1) ,y 2(t−2) ) that represent the probability of transition over time. By assuming that these prob- ability distributions are time invariant, they are sufficient to compute the proba- bility that a process that starts from known values y 1(1) ,y 2(1) ,y 3(0) ,y 3(1) evolves into y 1(T ) ,y 2(T ) ,y 3(T ) , by using one of the algorithms for probabilistic reasoning de- scribed in Section 10.2. The same algorithms can be used to compute the probability that a process with values y 1(T ) ,y 2(T ) ,y 3(T ) at time T started from the initial states y 1(1) ,y 2(1) ,y 3(0) ,y 3(1) . Fig. 10.9. Modular learning of the dynamic Bayesian network in Figure 10.8. First a regressive model is learned for each of the three variables at time t, and then the three models are joined by their common ancestors Y 1(t−1) ,Y 2(t−2) and Y 2(t−2) to produce the directed acyclic graph in Figure 10.8. Learning dynamic Bayesian networks when all the variables are observable is a straightforward parallel application of the structural learning described in Section 198 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni 10.4. To build the network, we proceed by selecting the set of parents for each vari- able Y i at time t, and then the models are joined by the common ancestors. An ex- ample is in Figure 10.9. The search of each local dependency structure is simplified by the natural ordering imposed on the variables by the temporal frame (Friedmanm et al., 1998) that constrains the model space of each variable Y i at time t: the set of candidate parents consists of the variables Y i(t−1) , ,Y i(t−p) as well as the variables Y h(t−j) for all h = i, and j = 1, ,p. The K2 algorithm (Cooper and Herskovitz, 1992) discussed in Section 10.4 appears to be particularly suitable for exploring the space of dependency for each variable Y i(t) . The only critical issue is that the selec- tion of the largest temporal order to explore depends on the sample size, because each temporal lag of order p leads to the loss of the first p temporal observations in the data set (Yu et al., 2002). 10.6 Data Mining Applications Bayesian networks have been used by us and others as knowledge discovery tools in a variety of fields, ranging from survey data analysis (Sebastiani and Ramoni, 2000, Sebastiani and Ramoni, 2001B) to customer profiling (Sebastiani et al., 2000) and bioinformatics (Friedman, 2004,Sebastiani et al., 2004,2). Here we describe two Data Mining and knowledge discovery applications based on Bayesian networks. 10.6.1 Survey Data A major goal of surveys conducted by Federal Agencies is to provide citizens, con- sumers and decision makers with useful information in a compact and understand- able format. Data are expected to improve the understanding of institutions, busi- nesses, and citizens of the current state of affairs in the country and play a key role in political decisions. But the size and structure of this fast-growing databases pose the challenge of how effectively extracting and presenting this information to enhance planning, prediction, and decision making. An example of fast growing database is the Current Population Survey (CPS) database that collects monthly surveys of about 50,000 households conducted by the U.S. Bureau of the Census. These surveys are the primary source of information on the labor force characteristics of the U.S. pop- ulation, they provide estimates for the nation as a whole and serve as part of model- based estimates for individual states and other geographic areas. Estimates obtained from the CPS include employment and unemployment, earnings, hours of work, and other indicators, and are often associated with a variety of demographic character- istics including age, sex, race, marital status, and education. CPS data are used by government policymakers and legislators as important indicators of the nations’s eco- nomic situation and for planning and evaluating many government programs. For most of the surveys conducted by the U.S Census Bureau, users can ac- cess both the microdata or summary tables. Summary tables provide easy access to findings of interest by relating a small number of preselected variables. In so do- ing, summary tables disintegrate the information contained in the original data into 10 Bayesian Networks 199 Fig. 10.10. Bayesian network induced from a portion of the 1996 General Household Survey, conducted between April 1996 and March 1997 by the British Office of National Statistics in Great Britain. micro-components and fail to convey an overall picture of the process underlying the data. A different approach to the analysis of survey data would be to employ Data Mining tools to generate hypothesis and hence to make new discoveries in an automated way (Hand et al., 2001, Hand et al., 2002). As an example, Figure 10.10 shows a Bayesian network learned from a data set of 13 variables extracted from the 1996 General Household Survey conducted between April 1996 and March 1997 by the British Office of National Statistics in Great Britain. Variables and their states are summarized in Table 10.2. The network struc- ture shows interesting, directed dependencies and conditional independencies. For example, there is a dependency between the ethnic group of the heads of the house- holds and the region of birth (variables Region and HoH origin ) and the conditional probability table that shapes this dependency reveals a more cosmopolitan society in England than Wales and Scotland, with a larger proportion of Blacks and Indians as household heads. The working status of the head of the household ( Hoh status )is independent of the ethnic group given gender and age. The conditional probability table that shapes this dependency shows that young female heads of household are much more likely to be inactive than male heads of household (40% compared to 6% when the age group is 17–36). This difference is attenuated as the age of the head . (Sebastiani and Ramoni, 20 00, Sebastiani and Ramoni, 20 01B) to customer profiling (Sebastiani et al., 20 00) and bioinformatics (Friedman, 20 04,Sebastiani et al., 20 04 ,2) . Here we describe two Data Mining. = ∑ i {log( σ 2 i0 / σ 2 i0 ) −(y i − μ i1 ) 2 / σ 2 i1 +(y i − μ i0 ) 2 / σ 2 i0 } where y i is the value of attribute i in the new sample to classify and the parameters σ 2 ik and μ ik are the variance and. in the data set (Yu et al., 20 02) . 10.6 Data Mining Applications Bayesian networks have been used by us and others as knowledge discovery tools in a variety of fields, ranging from survey data analysis