probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal. We will retain the name 6 and use accumulators: 6i (p, q) = the highest inside probability parse of a subtree NLq Using dynamic programming, we can then calculate the most probable parse for a sentence as follows. The initialization step assigns to each unary production at a leaf node its probability. For the inductive step, we again know that the first rule applying must be a binary rule, but this time we find the most probable one instead of summing over all such rules, and record that most probable one in the (1/ variables, whose values are a list of three integers recording the form of the rule application which had the highest probability.
11.3 The Probability of a 397 probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal. We will retain the name and use accumula- tors: = the highest inside probability parse of a Using dynamic programming, we can then calculate the most probable parse for a sentence as follows. The initialization step assigns to each unary production at a leaf node its probability. For the inductive step, we again know that the first rule applying must be a binary rule, but this time we find the most probable one instead of summing over all such rules, and record that most probable one in the variables, whose values are a list of three integers recording the form of the rule application which had the highest probability. 1. Initialization 2. Induction Store backtrace = 3. Termination and path readout (by backtracking). Since our grammar has a start symbol then by construction, the probability of the most likely parse rooted in the start symbol (11.23) = We want to reconstruct this maximum probability tree We do this by regarding as a set of nodes and showing how to construct this 2. We could alternatively find the highest probability node of any category that dominates the entire sentence as: Probabilistic Context Free Grammars set. Since the grammar has a start symbol, the root node of the tree must be We then show in general how to construct the left and right daughter nodes of a nonterminal node, and applying this process recursively will allow us to reconstruct the entire tree. If = is in the Viterbi parse, and 9) = then: left(&) = right(&) = Note that where we have written ‘a rgmax ’ above, it is possible for there not to be a unique maximum. We assume that in such cases the parser just chooses one maximal parse at random. It actually makes things con- siderably more complex to preserve all ties. 11.3.4 Training a PCFG The idea of training a PCFG is grammar learning or grammar induction, but only in a certain limited sense. We assume that the structure of the grammar in terms of the number of terminals and nonterminals, and the name of the start symbol is given in advance. We also assume the set of rules is given in advance. Often one assumes that all possible rewriting rules exist, but one can alternatively assume some pre-given structure in the grammar, such as making some of the nonterminals dedicated preterminals that may only be rewritten as a terminal node. Training the grammar comprises simply a process that tries to find the optimal probabilities to assign to different grammar rules this architecture. As in the case of we construct an EM training algorithm, the INSIDE-OUTSIDE Inside-Outside algorithm, which allows us to train the parameters of a ALGORITHM PCFG on unannotated sentences of the language. The basic assumption is that a good grammar is one that makes the sentences in the training corpus likely to occur, and hence we seek the grammar that maximizes the likelihood of the training data. We will present training first on the basis of a single sentence, and then show how it is extended to the more realistic situation of a large training corpus of many sentences, by assum- ing independence between sentences. To determine the probability of rules, what we would like to calculate is: 11.3 The Probability of a String 399 where is the count of the number of times that a particular rule is used. If parsed corpora are available, we can calculate these probabilities directly (as discussed in chapter 12). If, as is more common, a parsed training corpus is not available, then we have a hidden data problem: we wish to determine probability functions on rules, but can only directly see the probabilities of sentences. As we don ’t know the rule probabilities, we cannot compute relative frequencies, so we instead use an iterative algorithm to determine improving estimates. We begin with a certain grammar topology, which specifies how many terminals and nals there are, and some initial probability estimates for rules (perhaps just randomly chosen). We use the probability of each parse of a training sentence according to this grammar as our confidence in it, and then sum the probabilities of each rule being used in each place to give an expecta- tion of how often each rule was used. These expectations are then used to refine our probability estimates on rules, so that the likelihood of the training corpus given the grammar is increased. Consider: = We have already solved how to calculate let us call this probability Then: and the estimate for how many times the nonterminal is used in the derivation is: (11.24) is used in the derivation) = In the case where we are not dealing with a preterminal, we substitute the inductive definition of into the above probability and then s, p Therefore, the estimate for how many times this particular rule is used in the derivation can be found by summing over all ranges of words that 11 Probabilistic Context Free Grammars (11.25) (11.26) (11.27) (11.28) the node could dominate: Nj used) Now for the maximization step, we want: _ NY So, the reestimation formula is: = 9) Similarly for preterminals, P(Nj = The = above is, of course, either 0 or 1, but we express things in the second form to show maximal similarity with the preceding case. Therefore, _ Unlike the case of this time we cannot possibly avoid the prob- lem of dealing with multiple training instances one cannot use con- catenation as in the HMM case. Let us assume that we have a set of training sentences = with Wi = . . . Let and be the common from before for use of a nonterminal at a branching node, at a preterminal node, and anywhere respectively, now calculated from sentence 11.4 Problems with the Inside-Outside Algorithm 401 (11.29) (11.30) 11.4 LOCAL MAXIMA If we assume that the sentences in the training corpus are independent, then the likelihood of the training corpus is just the product of the prob- abilities of the sentences in it according to the grammar. Therefore, in the reestimation process, we can sum the contributions from multiple sentences to give the following reestimation formulas. Note that the de- nominators consider all expansions of the nonterminal, as terminals or nonterminals, to satisfy the stochastic constraint in equation (11.3) that a nonterminal ’s expansions sum to 1. and The Inside-Outside algorithm is to repeat this process of parameter reestimation until the change in the estimated probability of the training corpus is small. If is the grammar (including rule probabilities) in the iteration of training, then we are guaranteed that the probability of the corpus according to the model will improve or at least get no worse: Problems with the Inside-Outside Algorithm However, the PCFG learning algorithm is not without problems: 1. 2. Compared with linear models like HMMS, it is slow. For each sentence, each iteration of training is where is the length of the sentence, and n is the number of nonterminals in the grammar. Local maxima are much more of a problem. Charniak (1993) reports that on each of 300 trials of PCFG induction (from randomly initialized parameters, using artificial data generated from a simple English-like PCFG) a different local maximum was found. Or in other words, the algorithm is very sensitive to the initialization of the parameters. This might perhaps be a good place to try another learning method. (For instance, the process of simulated annealing has been used with some success with neural nets to avoid problems of getting stuck in local 11 Probabilistic Context Free Grammars maxima (Kirkpatrick et al. 1983; Ackley et al. 19851, but it is still per- haps too compute expensive for large-scale PCFGs.) Other partial so- lutions are restricting rules by initializing some parameters to zero or performing grammar minimization, or reallocating nonterminals away from “g reedy ” terminals. Such approaches are discussed in Lari and Young (1990). 3. Based on experiments on artificial languages, Lari and Young (1990) suggest that satisfactory grammar learning requires many more terminals than are theoretically needed to describe the language at hand. In their experiments one typically needed about 3n nals to satisfactorily learn a grammar from a training text generated by a grammar with n nonterminals. This compounds the first problem. 4. While the algorithm is guaranteed to increase the probability of the training corpus, there is no guarantee that the nonterminals that the algorithm learns will have any satisfactory resemblance to the kinds of nonterminals normally motivated in linguistic analysis (NP, VP, etc.). Even if one initializes training with a grammar of the sort familiar to linguists, the training regime may completely change the meaning of nonterminal categories as it thinks best. As we have set things up, the only hard constraint is that must remain the start symbol. One option is to impose further constraints on the nature of the grammar. For instance, one could specialize the nonterminals so that they each only generate terminals nonterminals. Using this form of grammar would actually also simplify the reestimation equations we presented above. Thus, while grammar induction from unannotated corpora is possible in principle with PCFGs, in practice, it is extremely difficult. In different ways, many of the approaches of the next chapter address various of the limitations of using vanilla PCFGs. 11.5 Further Reading A comprehensive discussion of topics like weak and strong equivalence, Chomsky Normal Form, and algorithms for changing arbitrary into various normal forms can be found in (Hopcroft and Ullman 1979). Stan- dard techniques for parsing with in NLP can be found in most AI and NLP textbooks, such as (Allen 1995). 11.5 Further Reading 403 Probabilistic were first studied in the late 1960s and early and initially there was an outpouring of work. Booth and Thomson following on from Booth define a PCFG as in this chap- ter (modulo notation). Among other results, they show that there are probability distributions on the strings of context free languages which cannot be generated by a PCFG, and derive necessary and sufficient con- ditions for a PCFG to define a proper probability distribution. Other work from this period includes: (Grenander (Suppes (Huang and Fu and several theses (Horning 1969; Ellis 1969; Hutchins 1970). Tree structures in probability theory are normally referred to as BRANCHING processes, and are discussed in such work as (Harris 1963) and 1971). During the work on stochastic formal languages largely died out, and PCFGs were really only kept alive by the speech community, as an occasionally tried variant model. The Inside-Outside algorithm was intro- duced, and its convergence properties formally proved by Baker (1979). Our presentation essentially follows (Lari and Young 1990). This paper includes a proof of the algorithmic complexity of the Inside-Outside al- gorithm. Their work is further developed in (Lari and Young 1991). For the extension of the algorithms presented here to arbitrary PCFGs, see (Charniak 1993) or (Kupiec 1991, Jelinek et al. (1990) and Jelinek et al. (1992a) provide a thorough introduction to PCFGs. In par- ticular, these reports, and also Jelinek and Lafferty (1991) and Stolcke present incremental left-to-right versions of the Inside and Viterbi algorithms, which are very useful in contexts such as language models for speech recognition. In the section on training a PCFG, we assumed a fixed grammar archi- tecture. This naturally raises the question of how one should determine this architecture, and how one would learn it automatically. There has been a little work on automatically determining a suitable architecture MODEL using model merging, a Minimum Description Length approach (Stolcke and Omohundro 1994b; Chen but at present this task is M INIMUM LENGTH still normally carried out by using the intuitions of a linguist. 3. For anyone familiar with chart parsing, the extension is fairly straightforward: in a chart we always build maximally binary ‘t raversals ’ as we move the dot through rules. We can use this virtual grammar, with appropriate probabilities to parse arbitrary PCFGs (the rule that completes a constituent can have the same probability as the original rule, while all others have probability 1). 404 11 Probabilistic Context Free Grammars have also been used in bioinformatics (e.g., Sakakibara et al. but not nearly as much as HMMS. 11.6 Exercises Exercise 11.1 Consider the probability of a (partial) syntactic parse tree giving part of the struc- ture of a sentence: P Det N’ \I Adj N In general, as the gets large, we cannot accurately estimate the probabil- ity of such trees from any existing training corpus (a data sparseness problem). As we saw, PCFGS approach this problem by estimating the probability of a tree like the one above from the joint probabilities of local subtrees: However, how reasonable is it to assume independence between the probability distributions of these local (which is the assumption that licenses us to estimate the probability of a as the product of the probability of each local tree it contains)? Use a parsed corpus (e.g., the Penn Treebank) and find for some common sub- trees whether the independence assumption seems justified or not. If it is not, see if you can find a method of combining the probabilities of local in such a way that it results in an empirically better estimate of the probability of a larger Exercise 11.2 Using a parse triangle as in figure 11.3, calculate the outside probabilities for the sentence astronomers saw stars with ears according to the grammar in table 11.2. Start at the top righthand corner and work towards the diagonal. Exercise 11.3 Using the inside and outside probabilities for the sentence astronomers saw stars with ears worked out in figure 11.3 and exercise 11.2, reestimate the proba- bilities of the grammar in table 11.2 by working through one iteration of the Inside-Outside algorithm. It is helpful to first link up the inside probabilities shown in figure 11.3 with the particular rules and used to obtain them. 11.6 Exercises 405 What would the rule probabilities converge to with continued iterations of the Inside-Outside algorithm? Why? Exercise 11.4 Recording possible spans of nodes in a parse triangle such as the one in fig- ure 11.3 is the essence of the Cocke-Kasami-Younger (CKY) algorithm for pars- ing (Younger 1967; and 1979). Writing a CKY PCFG parser is quite straightforward, and a good exercise. One might then want to extend the parser from Chomsky Normal Form grammars to the more general case of context-free grammars. One way is to work out the general case oneself, or to consult the appropriate papers in the Further Reading. Another way is to write a grammar transformation that will take a and convert it into a Chomsky Normal Form CFG by introducing specially-marked additional nodes where nec- essary, which can then be removed on output to display parse trees as given by the original grammar. This task is quite easy if one restricts the input CFG to one that does not contain any empty nodes (nonterminals that expand to give nothing). Exercise 11.5 Rather than simply parsing a sequence of words, if interfacing a parser to a speech one often wants to be able to parse a word lattice, of the sort shown in figure 12.1. Extend a PCFG parser so it works with word lattices. (Because the of a PCFG parser is dependent on the number of words in the word lattice, a PCFG parser can be impractical when dealing with large speech lattices, but our keep getting faster every year!) Probabilistic Parsing THE PRACTICE of parsing can be considered as a straightforward CHUNKING plementation of the idea of recognizing higher level units of structure that allow us to compress our description of a sentence. One way to capture the regularity of chunks over different sentences is to learn a grammar that explains the structure of the chunks one finds. This AR INDUCTION is the problem of grammar induction. There has been considerable work on grammar induction, because it is exploring the empiricist question of how to learn structure from unannotated textual input, but we will not cover it here. Suffice it to say that grammar induction techniques are reasonably well understood for finite state languages, but that induction is very difficult for context-free or more complex languages of the scale needed to handle a decent proportion of the complexity of human lan- guages. It is not hard to induce some form of structure over a corpus of text. Any algorithm for making chunks such as recognizing com- mon subsequences will produce some form of representation of sentences, which we might interpret as a phrase structure tree. How- ever, most often the representations one finds bear little resemblance to the kind of phrase structure that is normally proposed in linguistics and NLP. Now, there is enough argument and disagreement within the field of syntax that one might find who has proposed syntactic struc- tures similar to the ones that the grammar induction procedure which you have sweated over happens to produce. This can and has been taken as evidence for that model of syntactic structure. However, such an ap- proach has more than a whiff of circularity to it. The structures found depend on the implicit inductive bias of the learning program. This sug- gests another tack. We need to get straight what structure we expect our . 1 970 ). Tree structures in probability theory are normally referred to as BRANCHING processes, and are discussed in such work as (Harris 1963) and 1 971 ).. essence of the Cocke-Kasami-Younger (CKY) algorithm for pars- ing (Younger 19 67; and 1 979 ). Writing a CKY PCFG parser is quite straightforward, and a good exercise.