1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 98 pot

10 100 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 404,31 KB

Nội dung

950 Hong Yao, Cory J. Butz, and Howard J. Hamilton shown that FD logically implies CI (Butz et al., 1999). We show how to combine the ob- tained FDs with the chain rule of probability to construct a DAG of a CN. Given a set of FDs obtained from data, an ordering of variables is obtained such that the Markov boundaries of some variables are determined. Representing joint probability distribution of variables in the resulting ordering by the chain rule, the Markov boundaries of the other variables are deter- mined. A DAG of a CN is constructed by designating each Markov boundary of variable as the parent set of the variable. During this process, we take full advantage of known results in both CNs (Pearl, 1988) and relational databases (Maier, 1983). We demonstrate the effective- ness of our approach using fifteen real-world datasets. The DAG constructed in our approach can also be used as an initial DAG for previous approaches. The work here further illustrates the intrinsic relationship between CNs and relational databases (Wong et al., 2000, Wong and Butz, 2001). The remainder of this chapter is organized as follows. Background knowledge is given in Section 49.2. In Section 49.3, the theoretical foundation of our approach is provided. The algorithm to construct a CN is developed in Section 49.4. In Section 49.5, the experimental results are presented. Conclusions are drawn in Section 49.6. 49.2 Background Knowledge Let U be a finite set of discrete variables, each with a finite domain. Let V be the Carte- sian product of the variable domains. A joint probability distribution (Pearl, 1988) p(U) is a function p on V such that 0 ≤ p(v) ≤ 1 for each configuration v ∈ V and ∑ v∈V p(v)=1.0. The marginal distribution p(X) for X ⊆U is defined as ∑ U−X p(U).Ifp(X) > 0, then the conditional probability distribution p(Y |X) for X,Y ⊆ U is defined as p(XY)/p(X). In this chapter, we may write a i for the singleton set {a i }, and we use the terms attribute and variable interchangeably. Similarly, for the terms tuple and configuration. Definition 1. A causal network (CN) is a directed acyclic graph (DAG) D together with a conditional probability distribution (CPD) p(a i |P i ) for each variable a i in D, where P i denotes the parent set of a i in D. The DAG D graphically encodes CIs regarding the variables in U. Definition 2. (Wong et al., 2000). Let X, Y, and Z be three disjoint sets of variables. X is said to be conditionally independent of Y given Z, denoted I(X,Z,Y ),ifp(X|Y,Z)=p(X |Z). As previously mentioned, the CIs encoded in the DAG D indicate that the product of the given CPDs is a joint probability distribution p(U). Example 1. One CN on the set U = {a 1 ,a 2 ,a 3 ,a 4 ,a 5 ,a 6 }is the DAG in Figure 49.1(i) together with the CPDs p(a 1 ),p(a 2 |a 1 ),p(a 3 |a 1 ),p(a 4 |a 2 ),p(a 5 |a 3 ), and p(a 6 |a 4 ,a 5 ). This DAG encodes, in particular, I(a 3 ,a 1 ,a 2 ),I(a 4 ,a 2 ,a 1 a 3 ),I(a 5 ,a 3 ,a 1 a 2 a 4 ) and I(a 6 ,a 4 a 5 ,a 1 a 2 a 3 ). By the chain rule, the joint probability distribution p(U) can be expressed as: p(U)=p(a 1 )p(a 2 |a 1 )p(a 3 |a 1 ,a 2 )p(a 4 |a 1 ,a 2 ,a 3 )p(a 5 |a 1 ,a 2 ,a 3 ,a 4 ) p(a 6 |a 1 ,a 2 ,a 3 ,a 4 ,a 5 ). The above CIs can be used to rewrite p(U) as: p(U)=p(a 1 )p(a 2 |a 1 )p(a 3 |a 1 )p(a 4 |a 2 )p(a 5 |a 3 )p(a 6 |a 4 ,a 5 ). 49 Causal Discovery 951 (i) (ii) a 1 a 3 a 6 a 5 a 4 a 2 a 1 a 3 a 6 a 5 a 4 a 2 Fig. 49.1. Two causal networks. The term CN is somewhat misleading because it may be possible to reverse a directed edge without disturbing the encoded CI information. For example, the two CNs in Figure 49.1 en- code the same independency information. Thus, it is perhaps better to view a CN as encoding independency information rather than causal relationships. Using CI information encoded in the DAG, the Markov boundary of a variable can be defined. Definition 3. Let U = {a 1 , ,a n }, and O be an ordering a 1 , ,a n  of variables of U. Let U i = {a 1 , ,a i−1 } be a subset of U with respect to the ordering O.AMarkov boundary of a variable a i over U i , denoted B i , is any subset X of U i such that p(U i ) satisfies I(a i ,X,U i −X − a i ) and a i ∈ X,butp(U i ) does not satisfy I(a i ,X  ,U i −X  −a i ) for any X  ⊂ X. Example 2. Recall the DAG in Figure 49.1(i). Let the ordering O be a 1 ,a 2 ,a 3 ,a 4 ,a 5 ,a 6 . Then U 1 = {},U 2 = {a 1 },U 3 = {a 1 ,a 2 },U 4 = {a 1 ,a 2 ,a 3 },U 5 = {a 1 ,a 2 ,a 3 ,a 4 }, and U 6 = {a 1 ,a 2 ,a 3 ,a 4 ,a 5 }. The Markov boundary of variable a 3 over U 3 is B 3 = {a 1 }, since p(U 3 ) satisfies I(a 3 ,a 1 ,a 2 ), but does not satisfy I(a 3 , φ ,a 1 a 2 ). The Markov boundary B i of a variable a i over U i encodes CI I(a i ,B i ,U i −B i −a i ) over p(U i ). Using the Markov boundary of each variable, a boundary DAG is defined as follows. 952 Hong Yao, Cory J. Butz, and Howard J. Hamilton Definition 4. Let p(U) be a joint probability distribution over U, O be an ordering a 1 , ,a n  of variables of U, and U i = {a 1 , ,a i−1 } be a subset of U with respect to the ordering O . {B 1 , ,B n } is an ordered set of subsets of U such that each B i is a Markov boundary of a i over the U i . The DAG created by designating each B i as a parent set of variable a i is called a boundary DAG of p(U) relative to O. The next theorem (Pearl, 1988) indicates that the boundary DAG of p(U) relative to an ordering O is a DAG of CN of p(U). Theorem 1. Let p(U ) be a joint probability distribution and O be an ordering of variables of U.IfD is a boundary DAG of p(U) relative to O, then D is a DAG of CN of p(U). It is important to realize that we can construct a DAG of a CN after the Markov boundary of each variable has been obtained according to Definition 4 and Theorem 1. Example 3. Let U = {a 1 , ,a 6 } and O = a 2 ,a 1 ,a 3 ,a 4 ,a 5 ,a 6  be an ordering of the vari- ables of U . With respect to the ordering O,U 1 = {a 2 },U 2 = {},U 3 = {a 1 ,a 2 },U 4 = {a 1 ,a 2 ,a 3 }, U 5 = {a 1 ,a 2 ,a 3 ,a 4 }, and U 6 = {a 1 ,a 2 ,a 3 ,a 4 ,a 5 }. Supposing we assign B 1 = {a 2 },B 2 = {}, B 3 = {a 1 },B 4 = {a 2 },B 5 = {a 3 } and B 6 = {a 4 ,a 5 }, then the DAG shown in Figure 49.1 (ii) is the learned DAG of CN of p(U). 49.3 Theoretical Foundation In this section, several theorems relevant to our approach are provided. We define a relation r(U) as a finite set of tuples over U. We begin with functional dependency (Maier, 1983). Definition 5. Let r(U) be a relation over U and X ,Y ⊆ U. The functional dependency (FD) X →Y is satisfied by r(U) if every two tuples t 1 and t 2 of r(U) that agree on X also agree on Y . If a relation r(U) satisfies the FD X →Y, but not X  →Y for every X  ⊂ X , then X →Y is called left-reduced (Maier, 1983). The next theorem shows that FD logically implies CI. Theorem 2. (Butz et al., 1999). Let r(U) be a relation over U. Let p(U) be a joint distribution over r(U). Let X,Y ⊆U and Z = U −XY. Having FD X →Y satisfied by r(U) is a sufficient condition for the CI I(Y,X ,Z) to be satisfied by p(U). By exploiting the implication relationship between functional dependency and conditional independency, we can relate a left-reduced FD X →a i to the Markov boundary of variable a i . Theorem 3. Let U = {a 1 , ,a n }, U i be a subset of U , and X ⊆ U i .IfFDX → a i is left- reduced, then X is the Markov boundary of variable a i over U i . Proof: Since X →a i is a FD and X ⊆U i , according to Definition 5, X →a i holds over U i ∪{a i }. Since FD X → a i is left-reduced, by Theorem 2, p(U i ) satisfies I(a i ,X ,U i −X −a i ) but not I(a i ,X  ,U i −X  −a i ) for any X  ⊂ X. Theorem 3 indicates that the Markov boundary of variable a i over U i can be learned from a left-reduced FD X →a i ,ifX is a subset of U i . We define those variables that can be learned from a set of left-reduced FDs as follows. 49 Causal Discovery 953 Definition 6. Let U = {a 1 , ,a n }, X ⊂U, a i /∈X, and F be a set of left-reduced FDs over U . If exists X → a i ∈ F, then a i is a decided variable. Otherwise, a i is an undecided variable. Considering a variable as decided indicates that its Markov boundary can be learned from FDs implied in data. 49.4 Learning a DAG of CN by FDs In this section, we use learned FDs to construct a CN. We illustrate our algorithm using the heart disease dataset, which contains 13 attributes and 230 rows, from the UCI Machine Learn- ing Repository (Blake and Merz, 1998). Example 4. The heart disease dataset has U = {a 1 , ,a 13 }. Using FD Mine, the discovered set of left-reduced FDs is F = {a 1 a 5 → a 3 ,a 1 a 5 → a 6 ,a 1 a 5 → a 11 ,a 1 a 5 → a 13 ,a 1 a 8 → a 7 , a 4 a 5 a 9 → a 2 ,a 1 a 5 a 10 → a 4 ,a 1 a 2 a 5 → a 8 ,a 1 a 5 a 10 → a 9 ,a 1 a 5 a 10 → a 12 }. As indicated by Theorem 1, a DAG of a CN is constructed when the Markov boundary of each variable relative to an ordering O is obtained. We obtain the Markov boundary of each variable in two steps. First, in Section 49.4.1, we show how to obtain an ordering O such that the Markov boundary of each decided variable with respect to O can be obtained from the given FDs. Second, we determine the Markov boundary of each undecided variable by the chain rule in Section 49.4.2. 49.4.1 Learning an Ordering of Variables from FDs Given a set F of left-reduced FDs, the algorithm in Figure 49.4.1 will determine an ordering O of the variables of U such that the Markov boundary of each decided variable with respect to O can be obtained from F. We use the FDs in Example 4 to demonstrate how Algorithm 1 works. Example 5. First, the FD a 1 a 5 → a 3 is selected in line 2 by Algorithm 1. In line 3, since a 3 is not in Y for any Y → a i (a i = a 3 ) in F, variable a 3 is removed for U in line 4, thus U = {a 1 ,a 2 ,a 4 , ,a 13 }. We obtain O = a 3  in line 5. In line 6, FD a 1 a 5 → a 3 is re- moved from F. Because F is not empty, Algorithm 1 continues performing lines 2 −8.At this time, the FD a 1 a 5 → a 6 is selected in line 2. Variable a 6 is removed from U in line 4, U = {a 1 ,a 2 ,a 4 ,a 5 ,a 7 , ,a 13 }. We obtain O = a 6 ,a 3  in line 5. By recursively performing lines 2 to 8 until F is empty, we obtain an ordering O = a 12 ,a 9 ,a 4 ,a 2 ,a 8 ,a 7 ,a 13 ,a 11 ,a 6 ,a 3  and U = {a 1 ,a 5 ,a 10 }. In line 9, we prepend U to the head of O. Thus, for the variables in U, O = a 1 ,a 5 ,a 10 ,a 12 ,a 9 ,a 4 ,a 2 ,a 8 ,a 7 ,a 13 ,a 11 ,a 6 ,a 3 . The next theorem guarantees that the Markov boundary of each decided variable is deter- mined by O as obtained by Algorithm 1. Theorem 4. Let U = {a 1 , ,a n }, O = a 1 , ,a n  be an ordering obtained by the Algorithm 1, and U i = {a 1 , ,a i−1 } be a subset of U with respect to O. Let B i be the Markov boundary of a i over U i .Ifa i is a decided variable, then B i ⊆U i . Proof: Since a i is a decided variable, then there exists a FD X → a i . When Algorithm 1 per- forms lines 2-8, then a i is always deleted from U before any variable of X according to lines 3-5 of Algorithm 1. Thus, for any a j ∈ X, a j is always before a i in O,sowehaveX ⊆ U i . Since B i = X, according to Theorem 3, then B i ⊆U i . 954 Hong Yao, Cory J. Butz, and Howard J. Hamilton Algorithm 1. Input: U = {a 1 , ,a n }, and a set F of left-reduced FDs. Output: an ordered list O of the variables of U. Begin 1. O = . 2. for each X →a i ∈ F 3. if a i /∈Y for all Y → y ∈F (y = a i ) 4. U = U −{a i }. 5. prepend a i to the head of O. 6. F = F −{Y → a i |Y → a i ∈ F}. 7. end if 8. end for 9. prepend U to the head of O. 10. return(O) End Fig. 49.2. An algorithm to obtain an ordering O of the variables of U. 49.4.2 Learning the Markov Boundaries of Undecided Variables Once an ordering O = a 1 , ,a n  of variables of U is obtained by Algorithm 1, the joint probability distribution p(U) can be expressed using the chain rule as follows. p(U)=p(a 1 ) p(a j |a 1 , ,a j−1 ) p(a n |a 1 , ,a n−1 ). (49.1) If a i is an undecided variable, then there is no FD X → a i ∈F. According to Algorithm 1, a i is not deleted from U and is prepended to the head of O at the line 9 of Algorithm 1. This indicates that all undecided variables appear before all decided variables in O. Suppose a 1 , ,a j are all the undecided variables, and B j+1 , ,B n are the Markov bound- aries of all the decided variables. By Definition 3, CI I(a i ,B i ,U i −B i −a i ) holds for each vari- able a i , j +1 ≤i ≤ n. Thus, each p(a i |a 1 , ,a i−1 )=p(a i |B i ). Equation 63.1 can be rewritten as: p(U)=p(a 1 ) p(a j |a 1 , ,a j−1 )p(a j+1 |B j+1 ) p(a n |B n ). (49.2) By assigning B k = {a 1 , ,a k−1 } as the Markov boundary of each undecided variable a k ,1 ≤ k ≤ j, Equation 63.2 can be expressed as: p(U)=p(a 1 |B 1 ) p(a j |B j ) p(a n |B n )= ∏ a i ∈U p(a i |B i ). (49.3) Equation 63.5 indicated that a joint probability distribution p(U) can be represented by the Markov boundary of all variables of U. Thus, a boundary DAG relative to D can be con- structed. Based on the above analysis, we developed the algorithm shown in Figure 49.4.2, called FD2CN, which learns a DAG D of a CN from a dataset r(U). 49 Causal Discovery 955 Algorithm 2. FD2CN Input: A dataset r(U) over variable set U. Output: A DAG DU,E of a CN learned from r(U ). Begin 1. F=FDMine(r(U)). //return a set of left-reduced FDs 2. Obtaining an ordering O using Algorithm 1. 3. U i = {}. 4. while O is not empty 5. a i = pophead(O). 6. if there exists a FD X → a i ∈ F then 7. B i = X 8. else 9. B i = U i . 10. U i = U i ∪{a i }. 11. end while 12. Constructing a DAG DU,E such that {(b, a i ) ∈ E|b ∈ B i , a i ,b ∈U}. End Fig. 49.3. The algorithm, FD2CN, to learn a DAG of a CN from data. In line 6 of Algorithm FD2BN, we determined whether or not a variable is decided. If so, in the line 7, we obtain its Markov boundary. If not, in line 9, we obtain its Markov boundary. In the line 12, a DAG of a CN is constructed by making each B i as the parent set of a i of U.In other words, if b i ∈ B i , then we add a edge b → a i in DAG D. According to Definition 4 and Theorem 1, we know the constructed DAG D isaDAGofaCN. Example 6. Applying Algorithm FD2CN on the heart disease dataset, F in line 1 is obtained as in Example 4. O in line 2 is obtained as in Example 5. According to obtained O,we obtain B 1 = {},B 5 = {a 1 },B 10 = {a 1 ,a 5 },B 12 = {a 1 ,a 5 ,a 10 },B 9 = {a 1 ,a 5 ,a 10 },B 4 = {a 1 ,a 5 ,a 10 },B 2 = {a 4 ,a 5 ,a 9 },B 8 = {a 1 ,a 5 ,a 2 },B 7 = {a 1 ,a 8 },B 13 = {a 1 ,a 5 },B 11 = {a 1 ,a 5 }, B 6 = {a 1 ,a 5 }, and B 3 = {a 1 ,a 5 }. By making each B i as the parent set of a i of U, a D is con- structed. For example, since B 2 = {a 4 ,a 5 ,a 9 }, then D have edges a 4 → a 2 ,a 5 → a 2 , and a 9 → a 2 . the DAG of a CN learned from the heart disease dataset in Figure 49.5 is depicted in Figure 49.4. 49.5 Experimental Results Experiments were carried out on fifteen real-world datasets obtained from the UCI Machine Learning Repositories (Blake and Merz, 1998). The results are shown in Figure 49.5. The last column gives the elapsed time to construct a CN, measured on a 1GHz Pentium III PC with 256 MB RAM. The results show that the processing time is mainly determined by the 956 Hong Yao, Cory J. Butz, and Howard J. Hamilton a 2 a 9 a 6 a 3 a 8 a 1 a 5 a 7 a 4 a 12 a 10 a 13 a 11 Fig. 49.4. The learned DAG of a CN from the heart disease dataset. number of attributes. Since it also indicates that many FDs hold in some datasets, our proposed approach is a feasible way to learn a CN. 49 Causal Discovery 957 Dataset Name # of attributes # of rows # of FDs Time (seconds) Abalone 8 4,177 60 1 Breast-cancer 10 191 3 0 Bridge 13 108 62 2 Cancer-Wisconsin 10 699 19 1 Chess 7 28,056 1 3 Crx 16 690 1,099 10 Echocardiogram 13 132 583 0 Glass 10 142 119 0 Heart disease 13 270 10 2 Hepatitis 20 155 8,250 1,327 Imports-85 26 205 4,176 8,322 Iris 5 150 4 0 Led 8 50 11 0 Nursery 9 12,960 1 16 Pendigits 17 7,494 29,934 920 Fig. 49.5. Experimental results using fifteen real-world datasets. 49.6 Conclusion In this chapter, we presented a novel method for learning a CN. Although a CN encodes probabilistic conditional independencies, our method is based on learning FDs (Yao et al., 2002). Since functional dependency logically implies conditional independency (Butz et al., 1999), we described how to construct a CN from data dependencies. We implemented our approach and encouraging experimental results have been obtained. Since functional dependency is a special case of conditional independency, we acknowl- edge that our approach may not utilize all the independency information encoded in the sample data. However, previous methods also suffer from this disadvantage as learning all CIs from sample data is a NP-hard problem (Bouckaert, 1994). References Bouckaert, R. (1994). Properties of learning algorithms for Bayesian belief networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, 102–109. Butz, C. J., Wong, S. K. M., and Yao, Y. Y. (1999) On Data and Probabilistic Dependencies, IEEE Canadian Conference on Electrical and Computer Engineering, 1692-1697. Maier, D. (1983). The Theory of Relational Databases, Computer Science Press. Neapolitan, R. E. (2003) Learning Bayesian Networks, Prentice Hall. Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer- ence, Morgan Kaufmann Publishers. 958 Hong Yao, Cory J. Butz, and Howard J. Hamilton Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Com- puter Science. Wong, S. K. M., Butz, C. J., and Wu, D. (2000). On the Implication Problem for Probabilistic Conditional Independency, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 30(6), 785-805. Wong, S. K. M., and Butz, C. J. (2001), Constructing the Dependency Structure of a Multi- Agent Probabilistic Network, IEEE Transactions on Knowledge and Data Engineering, 13(3), 395-415. Yao, H., Hamilton, H. J., and Butz, C. J. (2002). FD Mine: Discovering Functional Depen- dencies in a Database Using Equivalences, In Proceedings of the Second IEEE Interna- tional Conference on Data Mining, 729-732. 50 Ensemble Methods in Supervised Learning Lior Rokach Department of Information Systems Engineering Ben-Gurion University of the Negev liorrk@bgu.ac.il Summary. The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used for improving prediction performance. In this chapter we provide an overview of ensemble methods in classification tasks. We present all important types of ensemble methods including boosting and bagging. Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed. Key words: Ensemble, Boosting, AdaBoost, Windowing, Bagging, Grading, Arbiter Tree, Combiner Tree 50.1 Introduction The main idea of ensemble methodology is to combine a set of models, each of which solves the same original task, in order to obtain a better composite global model, with more accurate and reliable estimates or decisions than can be obtained from using a single model. The idea of building a predictive model by integrating multiple models has been under investigation for a long time. B ¨ uhlmann and Yu (2003) pointed out that the history of ensemble methods starts as early as 1977 with Tukeys Twicing, an ensemble of two linear regression models. Ensemble methods can be also used for improving the quality and robustness of clustering algorithms (Dimitriadou et al., 2003). Nevertheless, in this chapter we focus on classifier ensembles. In the past few years, experimental studies conducted by the machine-learning commu- nity show that combining the outputs of multiple classifiers reduces the generalization er- ror (Domingos, 1996, Quinlan, 1996, Bauer and Kohavi, 1999, Opitz and Maclin, 1999). En- semble methods are very effective, mainly due to the phenomenon that various types of classi- fiers have different “inductive biases” (Geman et al., 1995, Mitchell, 1997). Indeed, ensemble methods can effectively make use of such diversity to reduce the variance-error (Tumer and Ghosh, 1999,Ali and Pazzani, 1996) without increasing the bias-error. In certain situations, an ensemble can also reduce bias-error, as shown by the theory of large margin classifiers (Bartlett and Shawe-Taylor, 1998). O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_50, © Springer Science+Business Media, LLC 2010 . (Bartlett and Shawe-Taylor, 1 998) . O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_50, © Springer Science+Business Media, LLC 20 10. 62 2 Cancer-Wisconsin 10 699 19 1 Chess 7 28 ,056 1 3 Crx 16 690 1,099 10 Echocardiogram 13 1 32 583 0 Glass 10 1 42 119 0 Heart disease 13 27 0 10 2 Hepatitis 20 155 8 ,25 0 1, 327 Imports-85 26 20 5. {a 1 ,a 2 },U 4 = {a 1 ,a 2 ,a 3 }, U 5 = {a 1 ,a 2 ,a 3 ,a 4 }, and U 6 = {a 1 ,a 2 ,a 3 ,a 4 ,a 5 }. Supposing we assign B 1 = {a 2 },B 2 = {}, B 3 = {a 1 },B 4 = {a 2 },B 5 = {a 3 } and B 6 =

Ngày đăng: 04/07/2014, 05:21