Learning II: Lecture 21 - Introduction to Artificial Intelligence CS440/ECE448 Inductive learning method, Decision Trees, Learning Decision Trees, How can we do the classification? An ID tree consistent with the data.
John Sunburned Not Sunburned Decision Tree Learning Algorithm • Problem: – For practical problems, it is unlikely that any test will produce one completely homogeneous subset • Solution: – Minimize a measure of inhomogeneity or disorder – Available from information theory Information • Let’s say we have a question which has n possible answers and call them vi • Let’s say that answer vi occurs with probability P(vi), then the information content (entropy) measured in bits of knowing the answer is: n I ( P(v1 ), , P(vn )) P(vi ) log P(vi ) i 1 • One bit of information is enough information to answer a yes or no question • E.g consider flipping a fair coin, how much information you have if you know which side comes up? I(½, ½) = - (½ log2½ + ½ log2½) = 1bit Information at a node • In our decision tree for a given feature (e.g hair color), we have – b: number of branches (e.g possible values for the feature) – Nb: number of samples in branch – Np: number of samples in all branches – Nbc: number of samples in class c in branch b • Using frequencies as an estimate of the probabilities, we have Nb N bc N bc Information log Nb Nb b Np c • For a single branch, the information is simply Information c N bc N log bc Nb Nb Example • Consider a single branch (b=1) which only contains members of two classes A and B – If half of the points belong to A and half belong to B: Nb N bc N bc log N 2N b Np c b b 1 1 log log 2 2 1 1 2 Information – What if all the points belong to A (or to B): Information b Nb N N bc log bc Np c Nb Nb 1log log 00 • Note : lim x log x x 0 We like the latter situation since the branches are homogeneous, so less information is needed to make a decision (maximize information gain) What is the amount of information required for classification after we have used the hair test? Information Nb N bc N bc log b N c N N p b b Hair Color Blond Sarah Annie Dana Katie - 2/4 log22/4 - 2/4 log22/4 =1 Red Emily -1 log21 -0 log20 =0 Brown Alex Pete John - log20 - 3/3 log23/3 =0 Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5 Selecting top level feature • • • Using the samples we have so far, we get: Test Information Hair 0.5 Height 0.69 Suit Color 0.94 Lotion 0.61 Hair wins, least additional information needed for rest of classification This is used to build the first level of the identification tree: Hair Color Blond Sarah Annie Dana Katie Red Emily Brown Alex Pete John Selecting second level feature Hair Color Blond Sarah Annie Dana Katie • • Red Emily Brown Alex Pete John Let’s consider the remaining features for the blond branch (4 samples) Test Information Height 0.5 Suit Color Lotion Lotion wins, least additional information Thus we get to the tree we had arrived at earlier Hair Color Blond Lotion Used No Sarah Annie Brown Red Emily Yes Dana Katie Alex Pete John Sunburned Not Sunburned ... Annie Dana Katie - 2/4 log22/4 - 2/4 log22/4 =1 Red Emily -1 log21 -0 log20 =0 Brown Alex Pete John - log20 - 3/3 log23/3 =0 Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5 Selecting top level feature... the points belong to A and half belong to B: Nb N bc N bc log N 2N b Np c b b 1 1 log log 2 2 1 1 2 Information – What if all the points belong to A (or to B): Information... information is enough information to answer a yes or no question • E.g consider flipping a fair coin, how much information you have if you know which side comes up? I(½, ½) = - (½ log2½ + ½ log2½) = 1bit