UNSUPERVISED INDUCTION OF LATENT SEMANTIC _ GRAMMARS WITH APPLICATION TO PARSING
A Dissertation Presented for the Doctor of Philosophy
Degree
The University of Memphis
Trang 2UMI Number: 3230969
Copyright 2006 by Olney, Andrew McGregor
All rights reserved
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion ® UMI UMI Microform 3230969 Copyright 2006 by ProQuest Information and Learning Company
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code
Trang 3To the Graduate Council:
Iam submitting herewith a dissertation written by Andrew McGregor Olney entitled “Unsupervised Induction of Latent Semantic Grammars with Application to Parsing.” I have examined the final copy of this dissertation for form and
content and recommend that it be accepted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy with a major in
Mathematical Sciences
Ze acy
Thomas Lee peCauley Ph.D
Major ProféSsor '
Trang 5DEDICATION
I say unto you:
You must have chaos in your heart to give birth to a dancing star I say unto you:
You have chaos in you yet
Trang 6ACKNOWLEDGEMENTS
While many people have helped to bring me to this point, I can only
acknowledge a core few First and foremost, my family has been extraordinary in supporting my education To try to say more in such a short space would be ultimately to say less; suffice to say that I feel privileged to be family and friend with such excellent people
Intellectually, I have some reservation in naming people and institutions, because it seems it was often the case that they did not have the effect upon me that they intended Instead of accepting theory, I tend to challenge it and
generally ask embarrassing questions However, I would be remiss if I did not mention the great debt I feel 1 owe the Department of Phonetics and Linguistics at University College London, and particularly Hans van der Koot, Neil Smith, and John Campbell Likewise at the University of Sussex, Phil Husbands, Inman Harvey, Shimon Edelman, and Rudi Lutz
And of course at the University of Memphis, where I had the great blind fortune of discovering the Institute for Intelligent Systems What a strange twist of fate to study abroad and then find world class research in your own back yard! The past few years working in Art Graesser’s lab have been invaluable, allowing me to develop the research chops necessary for my own further work in language and human computer interaction Everyone in the lab has helped me greatly, and I thank them all
Trang 7ABSTRACT
Olney, Andrew McGregor Ph.D The University of Memphis August, 2006 Unsupervised Induction of Latent Semantic Grammars with Application to Parsing Major Professor: Thomas Lee McCauley, Ph.D
Trang 8CONTENTS List of Tables List of Figures List of Abbreviations 1 Introduction S6, › 2 ẽắaaAg hẶaaa 12 ProblemStatement Ặ Ặ Q Q HQ HH ho 1.3 PurposeoftheStudy Q Q LH Q HQ HH He 1.3.1 Theoretcallmplicaions 1.3.2 Pracicallmplicaions c Q Q s 1.3.3 ResearchQuesions c c Q Q LH vo 14 DelimitaHOnS., ee ee ee 1.5 Operational Definiions ee ee 16 Looking Ahead 1 ee 2 Previous work 2.1 Introduction 2.2 ee vo 2.2 Theoretical Foundations 1.0 ee es 2.2.1 Induction 0 ee ee ee 2.2.2 Grammars ee 2.2.3 GrammarInduction 2.0.00 00008, 2.2.4 LatentSemanticInformaton 2.3 Unsupervised English GrammarInduction 2.3.1 PartofSpeechlnducion 2.3.2 English Grammarlnduction
Trang 93.3.1 Latent Semantic Analysis as Distributional Analysis 71
3.3.2 Word Order in Latent Semantic Analysis 72 3.3.3 Latent Semantic Contexts 2 6 ee ee 73 3.3.4 Computational Complexity of Singular Value Decomposition 74 3.35 Building HierarchicalStructulres 76 3.3.6 Comparative Meaning 00.0000 008 78 3.4 Data AnalySiS Q Q Q Q Q HQ HH HQ HH ng ky 79 3.5 Summary of Dissertation OverVieW Q Q Lo 80 4 Findings 81 4.1 Introduction 0 ee ee 81 42 Experimentl LH HH HQ HH kia 82 421 Method Q Q Q Q Q HQ HQ n Q Q KV o 82 226" ằ MA ee 91 4.23 DiSCUSSION ee 92 43 Experiment2 HQ HQ HH vn v ky xo 92 4.3.1 Method 0 0 00000020 ee ee 93 4.3.2 Results 0 ee 102 43.3 DiSCUSSION ee 103 44 ExperimentÖ3 LH HQ HQ HH ng kg kg ko 109 44.1 Method Q Q Q Q HQ HQ HQ Q kh k va 109 44.2 Resulls HQ Q HH HQ HQ kia 113 4.43 DiSCUSSION Q Q Q Q Q Q Q Q Q Q HQ HH ee 115 45 ConclusiOn ee vo 117 5 Conclusion 119 5.1 Overview ee 119
5.2 Interpretation of Findings 2.0.0.0 eee eee eee, 122 5.2.1 Non-orthogonal Singular Value Decomposition 122 5.2.2 Latent Semantic Parsing 1 0 00.0 ee eee 123 5.2.3 Latent Semantic Grammars and Comparative Meaning 131 5.3 Implications for SocialChange .Ặ Ặ co 135 5.4 Recommendation ÍOr AcfOn Q c Q Q hỦ 136 5.5 Recommendation for FurtherStudy 137
5.6 Concluding Statement HQ vo 139
Trang 102.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 LIST OF TABLES Evaluation outcomes from Klein & Manning (2004) 57 Dependency ExtractionRules 95
Nearest Neighbors for Con#ezt;ss„¡ at 50 Dimensions 97
Nearest Neighbors for Contetgiosq: at50 Dimensions 98
WSJ10 Dependency Parsing Results 102
WS5J10 Dependency Parsing Resultsby Type 103
Microsoft Research Paraphrase Corpus Testing Results 113
Trang 112.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 4.1 4.2 4.3 4.4 LIST OF FIGURES Kindsoflnducfon Q Q HQ Q xo 22
The Chomsky Hierarchy .Ặ ee 25
Regular Grammar Branchng .ẶẶẶ o 28 Context Free Grammar Branching 28
A Lexicalized CEG Tree Q Q QQ Q Q H es 29
A Dependency Graph Q Q Q Q se 29 A Non-Projective Dependency Graph 30
Substitutability Tag Tree 2 ee eee 49
Trang 1220K ABL ADIOS ATIS CEG EM LHS LSA LSI MRPC PARSEVAL RG RHS SVD TASA UG WSJ10 LIST OF ABBREVIATIONS
“20,000 Leagues Under the Sea” Paraphrase Corpus Alignment Based Learning
Automatic Distillation of Structure
Air Travel Information System
Context Free Grammar Expectation Maximization Left Hand Side
Latent Semantic Analysis Latent Semantic Indexing
Microsoft Research Paraphrase Corpus Parsing Evaluation
Regular Grammar Right Hand Side
Singular Value Decomposition
Touchstone Applied Science Associates Corpus Universal Grammar
Trang 13CHAPTER 1 INTRODUCTION You shall know a word by the company it keeps Firth (1957) 1.1 Overview
Grammar induction finds structure in data and expresses that structure in the form of a grammar As such, grammar induction has wide applicability across the sciences for both the parsimonious representation of data and the construction of predictive models For example, one ubiquitous notion of grammar is the
grammar of human language In this sense, grammar induction is the process by which humans come to speak and understand language However, all
conceptualizations of grammar are theoretically connected Thus results from one field are often applied in another, whether it be linguistics (Chomsky 1957), computer science (Hopcroft, Motwani, and Ullman 2001), or genetics (Gusfield 1997) Therefore while the emphasis of this dissertation is grammar induction in human language, the following discussion draws from and contributes to
theoretical results in a wide range of areas
Trang 14Throughout this history, a recurring finding is that different approaches yield drastically different results A classic example of this phenomenon is the proof that learning a context free grammar, which is simpler than human grammar, is impossible even if the learner has an infinite amount of positive examples (Gold 1967) The implication is that since human languages are formally more complex (Partee, ter Meulen, and Wall 1993), human languages are unlearnable unless some knowledge is innate This result is a formal cornerstone of “poverty of the stimulus” arguments that underlie the theory of Universal Grammar, which is a widely held concept in modern linguistics (Chomsky 1980) However, it has been shown (Horning 1969) that a probabilistic context free grammar can be learned by only positive examples Thus a drastically different result is obtained from a slightly different approach, i.e from associating each rule in the grammar with a probability distribution
Trang 15of words Thus, supervised training of probabilistic grammars have an expense
that is both ongoing and never ending
Unsupervised learning, on the other hand, removes the need for labeled data In this paradigm, machine learning techniques are used to induce a grammar from an unlabeled corpus This is somewhat the ultimate task in NLP, and a “largely unsolved problem” (Manning and Schtitze 1999, page 387) In this kind of problem, there is no distinction between grammatical and ungrammatical for the learner to use, as exists in Chomskyan linguistics Indeed, the notions of
grammatical and ungrammatical have no meaning for a learner without innate knowledge of a grammar and a finite corpus (consider a human to be a source of infinite corpora) To approach the problem of unsupervised language learning, many researchers have turned to ideas in linguistics that predate the
“Chomskyan Revolution”
One such notion is distributional analysis, championed by Harris (1954) The key idea here is that the function of a word may be known if it can be substituted for another word If so, both words have the same function Perhaps the earliest computational implementation of this idea is Kiss (1973), a psychologist who attempted unsupervised part of speech induction, which is essentially the task of inducing the terminal categories of a grammar This work was not pursued until the relative abundance of cheap computers and statistical methods in the
computational linguistics community in the 1990’s During that time, part of speech induction was attempted by Finch and Chater (1992), Redington, Chater, and Finch (1993), Schiitze (1993), and Schtitze (1995) amongst others
Trang 16contexts For this reason, usually the presence of the most frequent words was used as a context feature vector for defining substitutability The other drawback is lexical ambiguity, i.e that the same word can be used in different contexts with different meanings In fact, the lexical ambiguity problem is even more difficult because of Zipf’s law: the most frequent words have the greatest number of meanings, and thus the greatest ambiguity (Manning and Schiitze 1999) It was not until Schiitze (1995) that these drawbacks began to be satisfactorily
addressed At that time, it was uncommon to try to move beyond part of speech to probabilistic grammars, the main exception being Brill and Marcus (1992)
Despite these early difficulties, researchers in the past five years have attempted unsupervised induction of full probabilistic grammars from part of speech tags (Clark 2001; Klein and Manning 2002; Klein and Manning 2004) These, like Brill and Marcus (1992), attempt to extend distributional analysis to the induction of syntactic structure Also during this time, novel approaches have appeared, such as the graph theoretic approach of Solan et al (2005) and the suffix tree approach of (van Zaanen 2000) What distinguishes these approaches is not only their technique, but also their application Indeed, all but Solan et al (2005) are examples of strong induction, meaning that the models are transferable to other data sets Solan et al (2005), on the other hand, is an example of weak induction, where the only concern is to parsimoniously represent the given data Induction and its strong and weak variations are covered in section 2.2.1
In all of these unsupervised grammar induction models, an important
Trang 17addition of semantics improves performance (Coccaro and Jurafsky 1998;
Bellegarda 2000; Deng and Khudanpur 2003) The commonality among these models is the use of latent semantic analysis (LSA) (Deerwester et al 1990; Landauer, Foltz, and Laham 1998) to create a semantic representation of both words and collections of words in a vector space These models have typically incorporated this information into their language models using a “factored model” approach, whereby information from separate semantic and syntactic
models is combined into a single probability function
However, most of these semantic models are extremely simplistic and derivative of the original model (Bellegarda 1998; Bellegarda 2000) where the similarity between a word and all the text that came before it (the context) are the basis of the semantic model This approach is simplistic for a number of reasons First, it does not account for word order in the preceding material Therefore “John called Mary ” and “Mary called John ” are equivalent contexts and do not allow us to choose between alternate conclusions like “today” and “yesterday.” Second, larger and larger contexts created by LSA converge to a single value (Hu et al In press) This implies that the meanings of words in the context are lost as the context increases Despite these weaknesses in previous semantic modeling, the researchers above reported improvements using factored semantic models over the basic probabilistic grammar This convergence of results shows that semantic information can improve supervised probabilistic grammars, and it suggests that semantic information can improve unsupervised probabilistic grammars as well
Trang 18before, although there have been attempts to apply word order post-hoc to LSA
(Wiemer-Hastings and Zipitria 2001) A straightforward notion of incorporating word order into LSA is to use n-grams (see Section 1.5) instead of individual words In this way a bigram, unigram, and trigram would each have an atomic vector representation and be directly comparable
It may seem counter intuitive that such an n-gram scheme has never been used in conjunction with LSA Simple as this scheme may be, it quickly falls prey to memory limitations of modern day computers for computing singular value decomposition (SVD), the key step in LSA The key reference for SVD is Berry (1992)’s SVDPACK, whose single vector Lanczos recursion method with
re-orthogonalization was incorporated into the BellCore LSI tools Subsequently, either SVDPACK or the LSI tools were used by several researchers (Schtitze 1995; Landauer and Dumais 1997; Landauer, Foltz, and Laham 1998; Coccaro and Jurafsky 1998; Foltz, Kintsch, and Landauer 1998; Bellegarda 2000; Deng and Khudanpur 2003; Olney and Cai 2005b; Olney and Cai 2005a) : All of the previous research described in Chapter 2 derive from this common lineage Using the equation reported in Larsen (1998), a standard orthogonal SVD of a
unigram/bigram by sentence matrix of the LSA Touchstone Applied Science Associates Corpus (Landauer, Foltz, and Laham 1998) requires over 60 gigabytes of memory This estimate is prohibitive for all but current supercomputers
Trang 191.2 Problem Statement
This dissertation examines how latent semantic information can enrich
unsupervised grammar induction It is conjectured that by incorporating latent semantic information and making it sensitive to word order, performance on parsing and meaning judgments will improve
1.3 Purpose of the Study
The purpose of this dissertation is to investigate the utility of latent semantic information for unsupervised grammar induction Utility can be directly
measured by the relative performance of induced latent semantic grammars in parsing sentences and by the relative performance of induced grammars on making meaning judgments Success can be measured by the extent to which such latent semantic grammars outperform respective baselines in parsing and comparative meaning tasks However, success is not limited to a simple
engineering task of increasing performance Successful unsupervised latent semantic grammar induction has a number of important theoretical and practical implications, which we briefly outline in turn
1.3.1 Theoretical Implications
Trang 20simulation to provide evidence for or against a linguistic theory is commonly accepted in the field of cognitive science (Green 1996) Note, however, that the argument is not that latent semantic grammar induction is a theory of human cognition, but rather that latent semantic grammar induction may provide empirical evidence relevant to linguistic theory
1.3.1.1 Poverty of the Stimulus Argument This argument asserts that much knowledge of language is innate, because the linguistic information humans are exposed to is insufficient for learning language Poverty of the stimulus
arguments are a cornerstone supporting both notions of innate grammar and
universal grammar In turn, innateness and universal grammar are central to Chomskyan linguistics, a prominent school of linguistics during the past fifty years The relative success of unsupervised latent semantic grammar induction could provide evidence on what knowledge of language is innate In the extreme case, successful unsupervised grammar induction that incorporates meaning on a computer would imply that humans need no innate knowledge Intermediate levels of success would provide evidence that some linguistic knowledge is innate Moreover, such intermediate success could help clarify the kinds of innate knowledge that are most effective for grammar induction These notions are further discussed in Section 2.2.3
1.3.1.2 The utility of “grammatical” as a formal linguistic notion Linguists have argued a competence/performance distinction, by which grammaticality
judgments are more important than language use (Haegeman 1991)
Trang 21J.K Rowling has written many books *J.K Rowling many books written has
However, this notion is highly dubious when given further consideration Consider the following “grammatical” sentences:
Once that that Joe had quit was obvious, we decided to leave Mike I think Phil said Sam guessed Sally kissed
Colorless green ideas sleep furiously
For Haegeman (1991) and similarly minded linguists, all of the above
sentences are grammatical, and their unacceptability is interpreted as evidence for the competence/ performance distinction However, while the first two are
difficult to process, the third is easy to process by semantically incoherent This distinction is preserved in Roger Brown’s work, where parents give feedback on semantic flaws but not grammatical flaws in children’s sentences (Pinker 1998) Brown's work suggests that forcing all three sentences into the same category, “grammatical,” is incorrect
Trang 221.3.1.3 The reality of hidden syntactic structure It is common for hidden
syntactic structures to be posited in traditional linguistics For example, parts of speech form equivalence classes such that every word belongs to one or more parts of speech, and higher order units such as phrases combine parts of speech The latent semantic grammars created in this dissertation have no hidden
structure Therefore the relative success of these grammars may provide evidence for or against hidden syntactic structure For example, if a latent semantic
grammar could correctly parse a sentence without having a notion of part of
speech, then it would provide some evidence against the theoretical reality of part of speech Likewise, if a latent semantic grammar could not parse a sentence without the notion of part of speech, then it would provide some evidence for the theoretical reality of part of speech
1.3.1.4 Syntax as independent from semantics The separation between syntax and semantics has been theorized by some linguists (Chomsky 1993) In such theories, semantics is applied after a syntactic structure is created Since this dissertation will fuse syntax and semantics into a single grammar, improved parsing performance by that grammar would suggest in the very least that lexical semantics is likely to be an integral part of syntax
1.3.2 Practical Implications
Trang 23we can guarantee improved performance of current approaches just by creating more labeled data for supervised training However, this practical problem has
enormous time constraints For example, the Penn Treebank took six years to build parse trees for 4.5 million words Also, every new language domain has a different distribution of words, which means there must be a specialized
language model If, as this dissertation proposes, a computer can successfully and without supervision induce a probabilistic grammar, then the need for labeled
creation of data disappears Even marginally successful results are worthwhile,
because it takes significantly less time for a human to check a labeled parse tree than it does to create it from scratch (Marcus, Marcinkiewicz, and Santorini 1993)
1.3.2.2 The development of latent semantic grammars for future research Latent semantic analysis has been an extremely useful tool for computational approaches to meaning (Landauer et al In press) However, LSA’s insensitivity to word order has long been recognized as a weakness (Wiemer-Hastings and Zipitria 2001) The development of latent semantic grammars that put word order back into LSA would extend the applicability of LSA to new problem areas and widen the
research effort that uses LSA It is possible that latent semantic grammar induction will have applicability to bioinformatics in much the same way that bioinformatics led to the natural language grammar induction scheme Alignment Based Learning (van Zaanen 2000) The two are highly analogous problems 1.3.2.3 The successful use of non-orthogonal Lanczos recursion for
computational linguistics tasks The computational complexity of latent
Trang 24recursion can be shown to be effective for the grammar induction task outlined in
this dissertation, then it is likely that non-orthogonal methods will be useful in other areas as well, since all are reliant on the representation of “meaning.”
Moreover, the research software products resulting from this dissertation could be used widely in future studies, much in the way that Berry (1992) was widely used for LSA research and incorporated into the BellCore LSI tools
1.3.3 Research Questions
This dissertation examines two questions:
1 Can latent semantic analysis enhance unsupervised grammar induction, as measured by parsing performance?
2 Can latent semantic parse trees enhance performance on comparative meaning tasks?
1.4 Delimitations
There are three delimitations that narrow the scope of tests in this dissertation: 1 Generalizing meaning judgment performance on the Microsoft Research
Paraphrase Corpus (Dolan, Quirk, and Brockett 2004) to all other corpora This corpus does not have the size or the coverage of topics to be
statistically representative of all meaning-judgment corpora However, since the corpus is drawn from newswire text, it is as general as can be practically expected
2 Generalizing parsing performance on the Penn Treebank to all other
Trang 25representative of all treebank corpora Again, since the corpus is drawn
from newswire text, it is as general as can be practically expected 3 The application of non-orthogonal Lanczos recursion (Cullum and
Willoughby 2002) to the creation of large latent semantic spaces
Non-orthogonality may alter the properties of latent semantic analysis in an unknown way
1.5 Operational Definitions
Induction The assignment of structure to data
Grammar A formal representation of structure often defined by a set of rules that rewrite symbols
Probabilistic grammar A grammar in which each rewrite rule is given a
probability, such that all probabilities for the expansion of a symbol sum to one
Parse An assignment of structure to a sentence, according to a grammar
Corpus A collection of textual data
Treebank A corpus containing parse trees of sentences
N-gram A unit of analysis consisting of n words in order, often a unigram, bigram, or trigram
Unsupervised learning Learning without examples or feedback
Trang 26Singular value decomposition The computation of an optimal reduced
representation of a matrix and the key step of latent semantic analysis Lanczos recursion An algorithm for computing singular value decomposition
that optionally enforces orthogonality amongst the singular vectors
Distributional analysis A method for determining the function of a word or
phrase by the examining substitutable words or phrases
Factored model A model composed of multiple submodels whose outputs are blended to give a single prediction
1.6 Looking Ahead
This chapter has presented an overview of unsupervised latent semantic grammar induction The history of grammar induction, with respect to human language, was briefly summarized Likewise previous attempts at incorporating latent semantic information into factored supervised models were described Based on the improved performance of factored supervised models, it is
conjectured that latent semantic information will enhance unsupervised models of grammar induction The incorporation of latent semantic information requires new applications of non-orthogonal Lanczos recursions to bring word order into latent semantic analysis The purpose of this dissertation is to investigate the utility of latent semantic information for unsupervised grammar induction Utility can be directly measured by the relative performance of induced grammars in parsing sentences and by the relative performance of induced grammars on making meaning judgments
Trang 28CHAPTER 2
PREVIOUS WORK
The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest possible number of hypotheses
or axioms
Einstein
2.1 Introduction
This chapter reviews previous work that motivates this dissertation of latent semantic grammar induction This previous work consists of theoretical foundations of induction, grammar, and grammar induction Once those
Trang 292.2 Theoretical Foundations
2.2.1 Induction
The problem of grammar induction is a particular formulation of a central
problem in science That central problem is to find a reduced representation that characterizes a body of data This is the definition of induction that will be used throughout this work
In the terminology of set theory, the problem may be stated as a means of
defining all elements of a set without enumerating them The value of a reduced
representation over enumeration quickly becomes apparent when one considers infinite sets To enumerate the infinity of members in such a set would take an infinite amount of time
Not all reduced descriptions are equally useful For communication purposes, it may be acceptable to use a reduced description that is not constructive For example, the set defined in Equation 2.1 is infinite, and is clearly communicated as such, yet there is no constructive way given of creating more elements
{x :1,2,3, } (2.1)
Likewise, in Equation 2.2, although the set is finite, there is no method presented for finding the specific primes Both of these examples show that although mathematicians can (and often do) use such representations for definitional purposes, these representations are not particularly useful for
Trang 30z:0<#< 2! xis prime P (2.2)
A reduced representation may be used in different ways, each giving it a different ontological status For example, an engineer ! working in data
compression might desire to create a reduced representation of a data set for the purpose of storage or transport In this instance, the reduced description is a closed system: it is not intended to describe anything outside of the given data set Contrast this with a more common situation in science, where the data set consists of a limited number of observations In this instance, the scientist usually desires to make a claim about an entire (perhaps infinite) set of elements based on a limited sample
The principal difference between these two scenarios is that in the first, the engineer is supplied with the entire data set, and in the second, the scientist is given (or acquires) an incomplete data set Therefore the scientist cannot create a reduced representation with the certainty that the engineer can, because it is always possible that a reduced description based on the current set of
observations fails to generalize to unobserved cases This is the more classic notion of induction, which is characterized by uncertainty (Thagard 1999) We introduce the terms weak induction and strong induction to refer to the cases where the data set is complete and where the data set is limited, respectively This choice of names parallels the distinction made in Chomskyan linguistics between strong and weak generative capacity (Chomsky 1965) Weak generative capacity describes the ability to generate a set of sentences, and so it closely parallels weak
Trang 31
induction Strong generative capacity, on the other hand, describes the ability to
assign structure to a set of sentences, and so it closely parallels strong induction Previous work on human language grammar induction has used different metrics to test strong and weak induction, respectively From the standpoint of strong induction, a natural test is to test parsing performance on a set of
treebanked sentences, which have been manually labeled by experts Thus the evaluation standard is to match human performance in assigning structure to data Parsing performance, however, is not always straightforward to evaluate For context free grammars, discussed in Section 2.2.2, the PARSEVAL measures (Manning and Schiitze 1999) have been widely used to evaluate the hierarchical bracketing of parse trees The three measures that PARSEVAL uses are:
Precision The ratio of correct brackets to the number of brackets in the candidate
parse
Recall The ratio of correct brackets to the number of brackets in the gold standard parse
Crossing Brackets The average number of brackets in the candidate parse that overlap constituent boundaries in the gold parse
An illustration of these measures is given below e A gold standard parse:
( ( (Champagne) (and) (dessert)) ( (followed)))
e A candidate parse:
( (Champagne (and dessert)) (followed))
Trang 32Recall: 3/7
Crossing Brackets: 1
Parsing performance is much easier to measure for dependency grammars, which are further discussed in Section 2.2.2 A dependency parse consists of a set of ordered pairs of words such that a tree is formed In this case parsing
performance is easily measured by counting the proportion of ordered pairs in common between the parse tree and the correct tree Lastly, weak induction has traditionally been measured using cross entropy or perplexity These measure a model’s uncertainty in predicting the next word, where uncertainty is expressed in the average number of bits required to specify that word Members of the speech recognition community, where perplexity and its logarithmic analogue cross entropy are regularly used to evaluate performance, may object to the implication here that these measures do not adequately measure generalization ? It is true that these measures can be correlated with generalization; however, it is also the case that a model can have a perfect cross entropy score and no ability to generalize (Charniak 1993) A concrete example is the status given to “in the” by a model Because “the” often follows “in”, a model may lower its cross entropy by chunking “in the” as a phrase However, parsing “in the” as a complete phrase can never be correct Because cross entropy and perplexity most closely measure the match between the induced model and the training data, they can only be taken as evidence for weak induction Because parsing performance is a direct measure of generalization, it can be taken as evidence for strong induction
2 The speech recognition community uses these measures to compare different systems because it decouples the performance of the language model and the acoustic model Additionally, since perplexity and word error rate are
Trang 33There are at least two more useful distinctions that can be made within strong induction Given a set of observations and a reduced description of them, a
scientist will naturally wish to use that reduced description to relate a new observation to that set In set theoretic terms, we may describe the process as follows:
1 The scientist collects a set of observations X
2 The scientist creates a reduced description of the set 3 The reduced description defines a set Y such that X CY 4 Given a new observation z, the scientist desires to know
Hard Membership IszcY
Soft Membership The distance between z and Y
Hard membership is known as a decision problem in computability theory because it has only two possible outcomes (Dunne 1991) Using this approach, a scientist would only be concerned with two mutually exclusive possibilities, z€Yandz ¢ Y Historically, hard membership has been more strongly
associated with symbolic induction Furthermore, hard membership has a clear parallel with the binary Chomskyan notion of grammaticality: either a sentence belongs to the set of grammatical sentences or it does not Soft membership, on the other hand, introduces a notion of distance Given an interval [0, 1], one could define the extreme values to be equivalent to hard membership, leaving the middle range to represent varying degrees of partial membership Soft
membership is common in statistical approaches to induction These two kinds of membership, though seemingly quite similar, have drastically different
Trang 34Induction _“ mm Strong induction Weak induction mm Hard Membership Soft Membership Figure 2.1 Kinds of Induction
2.2.1.1 Summary In this section we have defined induction as the process of finding a reduced description of data We argued that computer science is most interested in constructive reduced representations that can be found with an algorithm We further refined the notion of induction in terms of strong and weak induction and showed how our definitions parallel Chomsky’s definitions of strong and weak generative capacity Since strong induction may be measured in terms of parsing performance, we described parsing performance metrics Finally, we refined the notion of strong induction to distinguish between hard and soft membership We recap the discussion of induction with the depiction in
Figure 2.1 Further distinctions within weak induction are outside the scope of discussion: the focus of this work is strong induction
2.2.2 Grammars
Induction has been defined in Section 2.2.1 in terms of reduced descriptions Formal language grammars have been a historically popular form of reduced representation in computer science, largely because of the close relationship between formal languages and models of computation, such as the Turing Machine (Grune and Jacobs 1990; Hopcroft, Motwani, and Ullman 2001; Revesz 1991) Formal language grammars are therefore constructive reduced
Trang 35Formal languages and their grammars are related to a broader class of
representational systems known as rewriting systems Thus formal languages are of a similar vintage to Post Systems, Lindenmayer grammars, and production systems, amongst others (Salomaa 1985) A brief description of rewriting systems, of which formal languages are a particular form, will facilitate later discussion of the varieties of formal language grammars that are widely used in computational linguistics
Definition
A rewriting system R is a two-tuple R = (V, F) where e V isa finite set of elements 7;
e F isa finite set of ordered pairs, (P,Q) such that each P and Q is composed of a sequence of aj
In a rewriting system, V is frequently referred to as an alphabet, and each P and Q is referred to as a word composed of one or more x; (compare to 2; to letters of the alphabet V and P and Q to words composed of letters) The set F of ordered pairs of words (P,Q) is known as the set of rewriting rules or
productions, such that a given word P may be replaced by word Q by using
production (P, Q) Intuitively, the application of sequence of productions to a
given word results in a derivation that transforms some initial word Đua through a number of intermediate words to Qinai-
Definition
Trang 36e V, is a finite set of nonterminal letters of the alphabet e V, is a finite set of terminal letters of the alphabet e Sis a distinguished start letter of the alphabet
e F isa finite set of ordered pairs, (P,Q) such that each P and Q is composed of a sequence of V,, U V4
In the definition of a formal language grammar, one can see that the rewriting
system is R = (V, U V;, F), and so the only new refinements being made ina
formal language grammar are the distinction between terminal and nonterminal alphabets and a distinguished start letter, S When the rewriting rules are
associated with a probability Prob( P,Q), such that all probabilities of rewriting a
distinct P; sum to one, the formal language grammar is probabilistic This constraint is more precisely expressed in Equation 2.3
Vi S> Prob(P;,Q;) = 1 (2.3)
j
Trang 37Recursively Enumerable (Type 0) Context Sensitive (Type 1) Context Free (Type 2) Regular (Type 3) Finite State Automata Pushdown Automata Linear Bounded Automata Turing Machine Figure 2.2
The Chomsky Hierarchy
state automaton, which has a finite memory Recursively enumerable grammars, on the other hand, require an infinite amount of memory in the form of a
bidirectional tape Since the finite state automaton is less complex than the other
automata, it can be simulated by those automata Thus, in Figure 2.2, less complex grammars can be seen as restricted instances of their enclosing grammars, just as regular grammars (RG) can be seen as a restricted instance of context free
grammars (CFG)
Trang 38of English, CFG are actually a very good match, and efforts to show English is not context free have not been successful (Partee, ter Meulen, and Wall 1993) The
reason that English requires a CFG rather than a RG is easily seen by the existence
of long distance dependencies Consider the long distance, subject-verb agreement dependency below:
The man that Bill saw is Phil
| Y
The man that Bill who won the lottery saw is Phil
Chomskyan linguistics asserts that both these sentences are grammatical, and thus sentences of English In fact, Chomskyan linguistics asserts that an infinite number of embedded clauses between a subject and its verb are permissible In formal grammar terms, that means that an automaton must remember the subject so that it can check that its verb agrees in number, to rule out sentences like this:
*The man that Bill who won the lottery saw are Phil
Trang 39of the model When RG are used in this way, they usually do so under the name of n-gram, where n is analogous to the amount of memory used in words, i.e a bigram model uses a history of two words
CFG are particularly desirable for assigning hierarchical structure to a
sentence, as is done in parsing To better understand why, it is instructive to look at the differences in rewriting rules between a RG and a CFG These differences are most apparent when one considers the Chomsky Normal Form of the
rewriting rules, a notational convenience that does not alter the expressive power of formal language grammars (Revesz 1991)
Definition
A formal language grammar is in Chomsky Normal Form when its rewriting rules F have the following form:
e Regular Grammars
-~ Xa where X € M„,d € Ù; - X7>aZ where X, Z € W„,ø € W;
e Context Free Grammars
- X-a where X € V,,a€ V; ~ X93YZ where X,Y,Z € Vy,
Trang 40Figure 2.3 Regular Grammar Branching a oO Z x yY P q Figure 2.4
Context Free Grammar Branching
different branching properties RG are uniformly right branching as shown in Figure 2.3 CFG, on the other hand, are capable of multiple branching at each level of the tree, because CFG have two nonterminals on the right hand side of their branching rewriting rule, as shown in Figure 2.4 Because human languages can intuitively have multiple branchings at each level of structure, CFG are preferred to RG for describing parses of sentences
A common augmentation of CFGs in computational linguistics is
lexicalization In a lexicalized CFG, each nonterminal node is labeled with its