manning schuetze statisticalnlp phần 6 ppsx

328 9 Markov Models s2 State 3 1 2 3 Time, t T+l Figure 9.5 Trellis algorithms. The trellis is a square array of states versus times. A node at (Si, f) can store information about state sequences which in- clude X, = i. The lines show the connections between nodes. Here we have a fully interconnected HMM where one can move from any state to any other at each step. The backward procedure It should be clear that we do not need to cache results working forward through time like this, but rather that we could also work backward. The BACKWARD buckward procedure computes backward variables which are the total PROCEDURE probability of seeing the rest of the observation sequence given that we were in state Si at time t. The real reason for introducing this less intu- itive calculation, though, is because use of a combination of forward and backward probabilities is vital for solving the third problem of parameter 9.3 The Three Fundamental Questions for HMMS 329 SN 0 O(N(t) t t-k1 Figure 9.6 Trellis algorithms: Closeup of the computation of forward probabilities at one node. The forward probability aj (t + 1) is calculated by summing the product of the probabilities on each incoming arc with the forward probability of the originating node. reestimation. Define backward variables (9.11) Bi(t) = P(Ot . . . OTlXt = i,/ l) Then we can calculate backward variables working from right to left through the trellis as follows: 330 9 Markov Models Time (t): wp(t) WP (t) P(Ol . . * R-1) &P(t) PIP(t) P(Ol - . . OT) YCP (t) YIP(t) xt &p(t) 6IP (t) v-VP(t) UJIP (t) ^ xt P(% output lem ice-t cola 1 2 3 4 1.0 0.21 0.0462 0.021294 0.0 0.09 0.0378 0.010206 1.0 0.3 0.084 0.0315 0.0315 0.045 0.6 1.0 0.029 0.245 0.1 1.0 0.0315 1.0 0.3 0.88 0.676 0.0 0.7 0.12 0.324 CP IP CP CP 1.0 0.21 0.0315 0.01323 0.0 0.09 0.0315 0.00567 CP IP CP CP IP CP CP IP CP CP 0.019404 Table 9.2 Variable calculations for 0 = (lem, ice-t, cola). 1. Initialization Bi(T+l)=l, l<i<N 2. Induction Pi(t) = ,?J aijbijotfij(t + l), llt<T,lli<N j=l 3. Total P(OI/J) = $j niBi i=l Table 9.2 shows the calculation of forward and backward variables, and other variables that we will come to later, for the soft drink machine from example 1, given the observation sequence 0 = (lem, ice-t, cola). 9.3 The Three Fundamental Questions for HMMs 331 Combining them More generally, in fact, we can use any combination of forward and backward caching to work out the probability of an observation sequence. Observe that: P(O,Xt = ilp) = P(ol . . . o~,Xt = ilp) = P(Ol. *. ot_1,x* = i,ot . . . OT(/.f) = P(Ol . ot-I,& = ilp) xP(ot . . . OTJ01 * . . ot_1,xt = C/J) = P(Ol . ot-1,Xt = ilp)P(ot . . .o~lXt = i,p) = ai(t)Bi(t) Therefore: (9.12) P(OIP) = 2 aiCt)flittJ, lst5T+l i=l The previous equations were special cases of this one. 9.32 Finding the best state sequence ‘T he second problem was worded somewhat vaguely as “f inding the state sequence that best explains the observations. ” That is because there is more than one way to think about doing this. One way to proceed would be to choose the states individually. That is, for each t, 1 I t 5 T + 1, we would find X1 that maximizes P(Xt(O,p). Let (9.13) ri(t) = P(Xt = ilO,p) P(Xt = i, O/p) = P(OIF) Bi (t)Bi (t) = Cy=, aj(t)Bj(t) The individually most likely state x? is: (9.14) x^, = a:gimFyi(t), 1 I t I T + 1 < This quantity maximizes the expected number of states that will be guessed correctly. However, it may yield a quite unlikely state sequence. 332 9 Markov Models Therefore, this is not the method that is normally used, but rather the Viterbi algorithm, which efficiently computes the most likely state sequence. Viterbi algorithm Commonly we want to find the most likely complete path, that is: argmaxP(X(O,p) X To do this, it is sufficient to maximize for a fixed 0: argmaxP(X, 01~) X V ITERBIALGORITHM An efficient trellis algorithm for computing this path is the Viterbi algorithm. Define: 6j(t) = max P(Xi . . -Xt_l,O1. n mot-l,Xt = jlp) x1. .xt_1 This variable stores for each point in the trellis the probability of the most probable path that leads to that node. The corresponding variable *j(t) then records the node of the incoming arc that led to this most probable path. Using dynamic programming, we calculate the most probable path through the whole trellis as follows: 1. Initialization 6j(1)=7Tj, 1 ljlN 2. Induction hj(r + 1) = lyzz6i(t)Uijbijo,, 1 I j I N < Store backtrace IC/j(t + 1) = arglElX6f(t)Uijbij~,, 1 I j 5 N 15iiN 3. Termination and path readout (by backtracking). The most likely state sequence is worked out from the right backwards: ^ XT+1 = argmaxc?i(T+ 1) 15isN % = wC/ri,+,(t + 1) P(T) = lFiyN6i(T+ 1) < 9.3 The Three Fundamental Questions for HMMs 333 In these calculations, one may get ties. We assume that in that case one path is chosen rantiomly. In practical applications, people commonly want to work out not only the best state sequence but the n-best sequences or a graph of likely paths. In order to do this people often store the m < n best previous states at a node. Table 9.2 above shows the computation of the most likely states and state sequence under both these interpretations - for this example, they prove to be identical. 9.3.3 The third problem: Parameter estimation Given a certain observation sequence, we want to find the values of the model parameters / I = (A, B, 7-r) which best explain what we observed. Using Maximum Likelihood Estimation, that means we want to find the values that maximize P(0 I/J): (9.15) argmaxP(OtrainingIP) v There is no known analytic method to choose p to maximize P (0 I p). But we can locally maximize it by an iterative hill-climbing algorithm. This ‘O RWARD-BACKWARD algorithm is the Baum-Welch or Forward-Backward algorithm, which is a ALGORITHM special case of the Expectation Maximization method which we will cover EM ALGORITHM in greater generality in section 14.2.2. It works like this. We don ’t know what the model is, but we can work out the probability of the observation sequence using some (perhaps randomly chosen) model. Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most. By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence. This maximization process is often referred to as TRAINING twining the model and is performed on training data. TRAINING DATA Define Pt (i, j), 1 i t I T, 1 5 i, j I N as shown below. This is the probability of traversing a certain arc at time t given observation sequence 0; see figure 9.7. (9.16) pt(i, j) = P(Xt = i,&+l = jlO,,u) PC& = i,&+l = j,Olp) = P(OI/.JcI) 334 9 Markov Models ai Bj(t + 1) t _ I<. t t + 1 >t + 2 Figure 9.7 The probability of traversing an arc. Given an observation sequence and a model, we can work out the probability that the Markov process went from state Si to s, at time 1. = ~i(~)QijbijotBj(t + 1) I ”, =, %rl(t)Bm(t) = &(t)Qijbijo,fij(t + 1) c ”, =, CL en(tkbnnbmnotBn(t + 1) Note that yi(t) = Cyzl py(i,j). Now, if we sum over the time index, this gives us expectations (counts): i yi (t) = expected number of transitions from state i in 0 t=1 T 1 Pt (i, j) = expected number of transitions from state i to j in 0 t=1 50 we begin with some model ~_r (perhaps preselected, perhaps just chosen randomly). We then run 0 through the current model to estimate the expectations of each model parameter. We then change the model to 9.3 The Three Fundamental Questions for HMMs 335 maximize the values of the paths that are used a lot (while still respect- ing the stochastic constraints). We then repeat this process, hoping to converge on optimal values for the model parameters / I. The reestimation formulas are as follows: (9.17) 7%j = expected frequency in state i at time t = 1 = Y/i(l) (9.18) dij = expected number of transitions from state i to j expected number of transitions from state i C, ‘= l pt (i, j) = CT=, Yi (t) (9.19) hijk = expected number of transitions from i to j with k observed expected number of transitions from i to j C{t:o,=k,15td-} Pt (1, j) = CT=1 Pt (i, A Thus, from ,U = (A, B, n), we derive fi = (A, fi, I?). Further, as proved by Baum, we have that: P(OlcI) 2 P(0ll.l) This is a general property of the EM algorithm (see section 14.2.2). There- fore, iterating through a number of rounds of parameter reestimation will improve our model. Normally one continues reestimating the parameters until results are no longer improving significantly. This process of parameter reestimation does not guarantee that we will find the best LOCAL MAXIMUM model, however, because the reestimation process may get stuck in a Zo- cal maximum (or even possibly just at a saddle point). In most problems of interest, the likelihood function is a complex nonlinear surface and there are many local maxima. Nevertheless, Baum-Welch reestimation is usually effective for HMMs. To end this section, let us consider reestimating the parameters of the crazy soft drink machine HMM using the Baum-Welch algorithm. If we let the initial model be the model that we have been using so far, then training on the observation sequence (lem, ice-t, cola) will yield the following values for Pt (i, j): 336 9 Markov Models (9.20) 9.4 9.4.1 Time (and j) 1 2 3 CP IP yi CP IP y ’2 CP IP y/3 i CP 0.3 0.7 1.0 0.28 0.02 0.3 0.616 0.264 0.88 IP 0.0 0.0 0.0 0.6 0.1 0.7 0.06 0.06 0.12 and so the parameters will be reestimated as follows: Original Reestimated rI CP 1.0 1.0 IP 0.0 0.0 CP IP CP IP A CP 0.7 0.3 0.5486 0.4514 IP 0.5 0.5 0.8049 0.1951 cola ice-t lem cola ice-t lem B CP 0.6 0.1 0.3 0.4037 0.1376 0.4587 IP 0.1 0.7 0.2 0.1363 0.8537 0.0 Exercise 9.4 [*I If one continued running the Baum-Welch algorithm on this HMM and this training sequence, what value would each parameter reach in the limit? Why? The reason why the Baum-Welch algorithm is performing so strangely here should be apparent: the training sequence is far too short to accurately rep- resent the behavior of the crazy soft drink machine. Exercise 9.5 [*I Note that the parameter that is zero in H stays zero. Is that a chance occurrence? What would be the value of the parameter that becomes zero in B if we did an- other iteration of Baum-Welch reestimation? What generalization can one make about Baum-Welch reestimation of zero parameters? HMMs: Implementation, Properties, and Variants Implementation Beyond the theory discussed above, there are a number of practical issues in the implementation of HMMS. Care has to be taken to make the implementation of HMM tagging efficient and accurate. The most obvious issue is that the probabilities we are calculating consist of keeping on multiplying together very small numbers. Such calculations Will rapidly 9.4 HMMs: Implementation, Properties, and Variants 337 underflow the range of floating point numbers on a computer (even if you store them as ‘d ouble ’! ). The Viterbi algorithm only involves multiplications and choosing the largest element. Thus we can perform the entire Viterbi algorithm working with logarithms. This not only solves the problem with floating point underflow, but it also speeds up the computation, since additions are much quicker than multiplications. In practice, a speedy implementation of the Viterbi algorithm is particularly important because this is the runtime algorithm, whereas training can usually proceed slowly offline. However, in the Forward-Backward algorithm as well, something still has to be done to prevent floating point underflow. The need to perform summations makes it difficult to use logs. A common solution is to em- SCALING ploy auxiliary scaling coefficients, whose values grow with the time t so COEFFICIENTS that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer. At the end of each iteration, when the parameter values are reestimated, these scaling factors cancel out. Detailed discussion of this and other implementation issues can be found in (Levinson et al. 1983), (Rabiner and Juang 1993: 365-368), (Cutting et al. 1991), and (Dermatas and Kokkinakis 1995). The main alternative is to just use logs anyway, despite the fact that one needs to sum. Effectively then one is calculating an appropriate scaling factor at the time of each addition: (9.21) funct log_add = if (y - x > log big) then y elsif (x - y > log big) then x else min(x, y) + log(exp(x - min(x, y)) + exp(y - min(x, y))) fi. where big is a suitable large constant like 103 ’. For an algorithm like this where one is doing a large number of numerical computations, one also has to be careful about round-off errors, but such concerns are well outside the scope of this chapter. 9.42 Variants There are many variant forms of HMMs that can be made without funda- mentally changing them, just as with finite state machines. One is to al- [...]... tag AT BEZ IN NN VB PERIOD AT 0 1973 43322 1 067 60 72 80 16 BEZ 0 0 0 3720 42 75 Second tag IN NN 0 4 863 6 4 26 187 1325 17314 42470 11773 4758 14 76 465 6 1329 VI3 PERIOD 0 19 0 38 0 185 61 4 21392 129 1522 954 0 Table 10.3 Idealized counts of some tag transitions in the Brown Corpus For example, NN occurs 4 863 6 times after AT The algorithm for training a Markov Model tagger is summarized in figure 10.1 The... 1 2 3 4 5 6 7 8 9 10 for all tags tj do for all tags rk do _ p(tklrj) = C(tjJk) C(tJ) end end for all tags tj do for all words IV’ do p(wlItj) = C(W’Jj) C(tJ) end end Figure 10.1 Algorithm for training a Visible Markov Model Tagger In most implementations, a smoothing method is applied for estimating the P(tk I tj) and p(dltj) First tag AT BEZ IN NN VB PERIOD AT 0 1973 43322 1 067 60 72 80 16 BEZ 0 0... (e.g., 1973+4 26+ 187 for BEZ) Exercise 10.5 [*I Given the data in table 10.4, compute maximum likelihood estimates as shown in figure 10.1 for P(bearJtk), P(isltk), P(rnoveltk), P(presidentjtk), P(progressltk), and P(thel tk) Take the total number of occurrences of tags from table 10.3 10.2 Markov Model Taggers AT bear is move on president progress the 0 0 0 0 0 0 69 0 16 0 349 BEZ 0 10 065 0 0 0 0 0 0... progress the 0 0 0 0 0 0 69 0 16 0 349 BEZ 0 10 065 0 0 0 0 0 0 IN 0 0 0 5484 0 0 0 0 NN VB 10 0 36 0 382 108 0 0 43 0 133 0 0 4 0 0 PERIOD 0 0 0 0 0 0 0 48809 Table 10.4 Idealized counts for the tags that some words occur with in the Brown Corpus For example, 36 occurrences of move are with the tag NN Exercise 10 .6 Compute the following two probabilities: I*1 P(AT NN BEZ IN AT NN( The bear is on the move.)... to the tag PERIOD: &(PERIOD) = 1.0 61 (t) = 0.0 for t f PERIOD That is, we assume that sentences are delimited by periods and we prepend a period in front of the first sentence in our text for convenience 10 Part-of-Speech Tagging 350 1 comment: Given: a sentence of length n 2 comment: Initialization &(PERIOD) = 1.0 4 61 (t) = 0.0 for t f PERIOD 5 comment: Induction 6 fori:= ltonstepldo 7 for all tags... 12 comment: Termination and path-readout 13 Xn+i = argmaxl,j,7-6n+1(j) 14 forj:=ntolstep -1do 15 Xj = cc/j+1 (Xj+l) 16 end 10 17 P(Xl, , &I) = maXi,jdn+l (t-j) Figure 10.2 Algorithm for tagging with a Visible Markov Model Tagger The induction step is based on equation (10.7), where Ujk = P(tkltj) and bjkw1 = P(w’ I tj): di+l(tj) = I&yT[6i(tk) XP(Wi+lltj) XP(tjl+)], 1 5 jIT QYi+l(tj) = liQll~[Si(tk)... Between 96% and 97% of tokens are disambiguated correctly by the most successful approaches However, it is important to realize that this impressive accuracy figure is not quite as good as it looks, because it is evaluated on a per-word basis For instance, in many genres such as newspapers, the average sentence is over twenty words, and on such sentences, even with a tagging accuracy of 96% this means... investigated in detail by Elworthy (1994) He trained HMMs from the different starting conditions in table 10 .6 The combination of DO and TO corresponds to Visible Markov Model training as we described it at the beginning of this chapter Dl orders the lexical probabilities correctly 10 Part-of-Speech Tagging 160 CLASSICAL EARLY MAXIMUM INITIAL MAXIMUM (for example, the fact that the tag VB is more likely for... tagging is an intermediate layer of representation that is useful and more tractable than full parsing is due to the corpus linguistics work that was led by Francis and Kucera at Brown University in the 1 960 s and 70s (Francis and KuPera 1982) The following sections deal with Markov Model taggers, Hidden Markov Model taggers and transformation-based tagging At the end of the chapter, we discuss levels of... follows: P(ri+l Ik1.i) = P(ti+l Iki) We use a training set of manually tagged text to learn the regularities of tag sequences The maximum likelihood estimate of tag tk following .I 0 Part-of-Speech Tagging 3 46 Wi ti Wi,i+m ti,i+m W’ tj C(w’) c(d) c(tj, tk) C(w’ : tj) T W n the word at position i in the corpus the tag of wi the words occurring at positions i through i + m (alternative notations: Wi Wi+m, . j): 3 36 9 Markov Models (9.20) 9.4 9.4.1 Time (and j) 1 2 3 CP IP yi CP IP y ’2 CP IP y/3 i CP 0.3 0.7 1.0 0.28 0.02 0.3 0 .61 6 0. 264 0.88 IP 0.0 0.0 0.0 0 .6 0.1 0.7 0. 06 0. 06 0.12 and. (t) YIP(t) xt &p(t) 6IP (t) v-VP(t) UJIP (t) ^ xt P(% output lem ice-t cola 1 2 3 4 1.0 0.21 0.0 462 0.021294 0.0 0.09 0.0378 0.0102 06 1.0 0.3 0.084 0.0315 0.0315 0.045 0 .6 1.0 0.029 0.245 0.1. 0.0 0.0 CP IP CP IP A CP 0.7 0.3 0.54 86 0.4514 IP 0.5 0.5 0.8049 0.1951 cola ice-t lem cola ice-t lem B CP 0 .6 0.1 0.3 0.4037 0.13 76 0.4587 IP 0.1 0.7 0.2 0.1 363 0.8537 0.0 Exercise 9.4 [*I If

Định dạng
Số trang	70
Dung lượng	11,86 MB