xử lý ngôn ngữ tự nhiên,regina barzilay,ocw mit edu 6 864 Lecture 6 (September 27th, 2005) The EM Algorithm Part II CuuDuongThanCong com https //fb com/tailieudientucntt http //cuuduongthancong com?sr[.]
6.864: Lecture (September 27th, 2005) The EM Algorithm Part II CuuDuongThanCong.com https://fb.com/tailieudientucntt Hidden Markov Models A hidden Markov model (N, �, �) consists of the following elements: • N is a positive integer specifying the number of states in the model Without loss of generality, we will take the N ’th state to be a special state, the final or stop state • � is a set of output symbols, for example � = {a, b} • � is a vector of parameters CuuDuongThanCong.com https://fb.com/tailieudientucntt • � is a vector of parameters parameters: It contains three types of – �j for j = N is the probability of choosing state j as an initial state – aj,k for j = (N − 1), k = N , is the probability of transitioning from state j to state k – bj (o) for j = (N − 1), and o � �, is the probability of emitting symbol o from state j Thus it can be seen that � is a vector of N + (N − 1)N + (N − 1)|�| parameters • Note that we have the following constraints: �N – j=1 �j = �N – for all j, k=1 aj,k = � – for all j, o�� bj (o) = CuuDuongThanCong.com https://fb.com/tailieudientucntt Hidden Markov Models • An HMM specifies a probability for each possible (x, y) pair, where x is a sequence of symbols drawn from �, and y is a sequence of states drawn from the integers (N − 1) The sequences x and y are restricted to have the same length • E.g., say we have an HMM with N = 3, � = {a, b}, and with some choice of the parameters � Take x = �a, a, b, b∈ and y = �1, 2, 2, 1∈ Then in this case, P (x, y|�) = �1 a1,2 a2,2 a2,1 a1,3 b1 (a) b2 (a) b2 (b) b1 (b) CuuDuongThanCong.com https://fb.com/tailieudientucntt Hidden Markov Models In general, if we have the sequence x = x1 , x2 , xn where each xj � �, and the sequence y = y1 , y2 , yn where each yj � (N − 1), then P (x, y|�) = �y1 ayn ,N n � j =2 CuuDuongThanCong.com ayj−1 ,yj n � byj (xj ) j =1 https://fb.com/tailieudientucntt EM: the Basic Set-up • We have some data points—a “sample”—x , x2 , xm • For example, each xi might be a sentence such as “the dog slept”: this will be the case in EM applied to hidden Markov models (HMMs) or probabilistic context-free- grammars (PCFGs) (Note that in this case each xi is a sequence, which we will sometimes write xi1 , x i2 , x ini where ni is the length of the sequence.) • Or in the three coins example (see the lecture notes), each xi might be a sequence of three coin tosses, such as HHH, THT, or TTT CuuDuongThanCong.com https://fb.com/tailieudientucntt • We have a parameter vector � For example, see the description of HMMs in the previous section As another example, in a PCFG, � would contain the probability P (� � �|�) for every rule expansion � � � in the context-free grammar within the PCFG CuuDuongThanCong.com https://fb.com/tailieudientucntt • We have a model P (x, y|�): A function that for any x, y, � triple returns a probability, which is the probability of seeing x and y together given parameter settings � • This model defines a joint distribution over x and y, but that we can also derive a marginal distribution over x alone, defined as P (x|�) = P (x, y|�) y CuuDuongThanCong.com https://fb.com/tailieudientucntt • Given the sample x1 , x2 , xm , we define the likelihood as L� (�) = m � P (x i |�) = m � P (x i , y|�) i=1 y i=1 and we define the log-likelihood as � L(�) = log L (�) = m i=1 CuuDuongThanCong.com i log P (x |�) = m i=1 log P (x i , y|�) y https://fb.com/tailieudientucntt • The maximum-likelihood estimation problem is to find �M L = arg max L(�) ��� where � is a parameter space specifying the set of allowable parameter settings In the HMM example, � would enforce �N the restrictions j =1 �j = 1, for all j = (N − 1), �N � o�� bj (o) = k=1 aj,k = 1, and for all j = (N − 1), CuuDuongThanCong.com https://fb.com/tailieudientucntt ... transitioning from state j to state k – bj (o) for j = (N − 1), and o � �, is the probability of emitting symbol o from state j Thus it can be seen that � is a vector of N + (N − 1)N + (N − 1)|�|