manning schuetze statisticalnlp phần 1 pdf

Foundations of Statistical Natural Language Processing E0123734 Christopher D. Manning Hinrich Schiitze The MIT Press Cambridge, Massachusetts London, England Second printing, 1999 0 1999 Massachusetts Institute of Technology Second printing with corrections, 2000 All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. Typeset in lo/13 Lucida Bright by the authors using ETPX2E. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Information Manning, Christopher D. Foundations of statistical natural language processing / Christopher D. Manning, Hinrich Schutze. p. cm. Includes bibliographical references (p. ) and index. ISBN 0-262-13360-l 1. Computational linguistics-Statistical methods. I. Schutze, Hinrich. II. Title. P98.5.S83M36 1999 410’.285-dc21 99-21137 CIP Brief Contents I Preliminaries 1 1 Introduction 3 2 Mathematical Foundations 39 3 Linguistic Essentials 81 4 Corpus-Based Work 117 II Words 149 5 Collocations 151 6 Statistical Inference: n-gram Models over Sparse Data 191 7 Word Sense Disambiguation 229 8 Lexical Acquisition 265 III Grammar 315 9 Markov Models 317 10 Part-of-Speech Tagging 341 11 Probabilistic Context Free Grammars 12 Probabilistic Parsing 407 381 Iv Applications and Techniques 461 13 Statistical Alignment and Machine Translation 14 Clustering 495 15 Topics in Information Retrieval 529 16 Text Categorization 575 463 Contents List of Tables xv List of Figures xxi Table of Notations xxv Preface rodx Road Map mxv I Preliminaries 1 1 Introduction 3 1.1 Rationalist and Empiricist Approaches to Language 4 1.2 Scientific Content 7 1.2.1 Questions that linguistics should answer 8 1.2.2 Non-categorical phenomena in language 11 1.2.3 Language and cognition as probabilistic phenomena 15 1.3 The Ambiguity of Language: Why NLP Is Difficult 17 1.4 Dirty Hands 19 1.4.1 Lexical resources 19 1.4.2 Word counts 20 1.4.3 Zipf’s laws 23 1.4.4 Collocations 29 1.4.5 Concordances 31 1.5 Further Reading 34 . Vlll Contents 1.6 Exercises 35 2 Mathematical Foundations 39 2.1 Elementary Probability Theory 40 2.1.1 Probability spaces 40 2.1.2 Conditional probability and independence 2.1.3 Bayes’ theorem 43 2.1.4 Random variables 4 5 2.1.5 Expectation and variance 46 2.1.6 Notation 4 7 2.1.7 Joint and conditional distributions 48 2.1.8 Determining P 48 2.1.9 Standard distributions 50 2.1.10 Bayesian statistics 54 2.1.11 Exercises 59 42 2.2 Essential Information Theory 60 2.2.1 Entropy 61 2.2.2 Joint entropy and conditional entropy 63 2.2.3 Mutual information 66 2.2.4 The noisy channel model 68 2.2.5 Relative entropy or Kullback-Leibler divergence 2.2.6 The relation to language: Cross entropy 73 2.2.7 The entropy of English 76 2.2.8 Perplexity 78 2.2.9 Exercises 78 2.3 Further Reading 79 3 Linguistic Essentials 81 3.1 Parts of Speech and Morphology 8 1 3.1.1 Nouns and 83 pronouns 3.1.2 Words that accompany nouns: Determiners and adjectives 87 3.1.3 Verbs 88 3.1.4 Other parts of speech 91 3.2 Phrase Structure 93 3.2.1 Phrase structure 96 grammars 3.2.2 Dependency: Arguments and adjuncts 3.2.3 X’ theory 106 3.2.4 Phrase structure ambiguity 107 101 72 Contents ix 3.3 Semantics and Pragmatics 109 3.4 Other Areas 112 3.5 Further Reading 113 3.6 Exercises 114 4 Corpus-Based Work 117 4.1 Getting Set Up 118 4.1.1 Computers 118 4.1.2 Corpora 118 4.1.3 Software 120 4.2 Looking at Text 123 4.2.1 Low-level formatting issues 123 4.2.2 Tokenization: What is a word? 124 4.2.3 Morphology 131 4.2.4 Sentences 134 4.3 Marked-up Data 136 4.3.1 Markup schemes 137 4.3.2 Grammatical tagging 139 4.4 Further Reading 4.5 Exercises 147 II Words 149 5 Collocations 151 5.1 Frequency 153 5.2 Mean and Variance 5.3 Hypothesis Testing 5.3.1 The t test 145 157 162 163 5.3.2 Hypothesis testing of differences 166 5.3.3 Pearson’s chi-square test 169 5.3.4 Likelihood ratios 172 5.4 Mutual Information 178 5.5 The Notion of Collocation 183 5.6 Further Reading 187 6 Statistical Inference: n -gram Models over Sparse Data 191 6.1 Bins: Forming Equivalence Classes 192 6.1.1 Reliability vs. discrimination 192 6.1.2 models 192n-gram Contents 6.1.3 Building models n-gram 195 6.2 Statistical Estimators 196 6.2.1 Maximum Likelihood Estimation (MLE) 6.2.2 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law 202 6.2.3 Held out estimation 205 6.2.4 Cross-validation (deleted estimation) 6.2.5 Good-Turing estimation 212 6.2.6 Briefly noted 216 6.3 Combining Estimators 217 6.3.1 Simple linear interpolation 6.3.2 Katz’s backing-off 219 6.3.3 General linear interpolation 6.3.4 Briefly noted 222 6.3.5 Language models for Austen 6.4 Conclusions 224 6.5 Further Reading 225 6.6 Exercises 225 7 Word Sense Disambiguation 229 7.1 Methodological Preliminaries 232 218 220 223 7.1.1 Supervised and unsupervised learning 7.1.2 Pseudowords 233 197 210 232 7.1.3 Upper and lower bounds on performance 233 7.2 Supervised Disambiguation 235 7.2.1 Bayesian classification 235 7.2.2 An information-theoretic approach 239 7.3 Dictionary-Based Disambiguation 241 7.3.1 Disambiguation based on sense definitions 242 7.3.2 Thesaurus-based disambiguation 244 7.3.3 Disambiguation based on translations in a second-language corpus 247 7.3.4 One sense per discourse, one sense per collocation 249 7.4 Unsupervised Disambiguation 252 7.5 What Is a Word Sense? 256 7.6 Further Reading 260 7.7 Exercises 262 Contents xi 8 Lexical Acquisition 265 8.1 Evaluation Measures 267 8.2 Verb Subcategorization 271 8.3 Attachment Ambiguity 278 8.3.1 Hindle and Rooth (1993) 280 8.3.2 General remarks on PP attachment 284 8.4 Selectional Preferences 288 8.5 Semantic Similarity 294 8.5.1 Vector measures 296 space 8.5.2 Probabilistic measures 303 8.6 The Role of Lexical Acquisition in Statistical NLP 308 8.7 Further Reading 312 III Grammar 315 9 Markov Models 317 9.1 Markov Models 318 9.2 Hidden Markov Models 320 9.2.1 Why use HMMs? 322 9.2.2 General form of an HMM 324 9.3 The Three Fundamental Questions for HMMs 325 9.3.1 Finding the probability of an observation 326 9.3.2 Finding the best state sequence 331 9.3.3 The third problem: Parameter estimation 333 9.4 HMMs: Implementation, Properties, and Variants 336 9.4.1 Implementation 336 9.4.2 Variants 337 9.4.3 Multiple input observations 338 9.4.4 Initialization of parameter values 339 9.5 Further Reading 339 10 Part-of-Speech Tagging 341 10.1 The Information Sources in Tagging 343 10.2 Markov Model Taggers 345 10.2.1 The probabilistic model 345 10.2.2 The Viterbi algorithm 349 10.2.3 Variations 351 10.3 Hidden Markov Model Taggers 356 xii Contents 10.3.1 Applying HMMs to POS tagging 357 10.32 The effect of initialization on HMM training 10.4 Transformation-Based Learning of Tags 361 10.4.1 Transformations 362 10.4.2 The learning algorithm 364 10.4.3 Relation to other models 365 10.4.4 Automata 367 10.4.5 Summary 369 10.5 Other Methods, Other Languages 370 10.5.1 Other approaches to tagging 370 10.5.2 Languages other than English 371 10.6 Tagging Accuracy and Uses of Taggers 371 10.6.1 Tagging 371 accuracy 10.6.2 Applications of tagging 374 10.7 Further Reading 377 10.8 Exercises 379 359 11 Probabilistic Context Free Grammars 381 11.1 Some Features of PCFGs 386 11.2 Questions for PCFGs 388 11.3 The Probability of a String 392 11.3.1 Using inside probabilities 392 11.3.2 Using outside probabilities 394 11.3.3 Finding the most likely parse for a sentence 396 11.3.4 Training a PCFG 398 11.4 Problems with the Inside-Outside Algorithm 401 11.5 Further Reading 402 11.6 Exercises 404 12 Probabilistic Parsing 407 12.1 Some Concepts 408 12.1.1 Parsing for disambiguation 408 12.1.2 Treebanks 412 12.1.3 Parsing models vs. language models 414 12.1.4 Weakening the independence assumptions of PCFGs 416 12.1.5 Tree probabilities and derivational probabilities 421 12.1.6 There’s more than one way to do it 423 Contents . . Xl11 121.7 Phrase structure grammars and dependency grammars 428 12.1.8 Evaluation 431 12.1.9 Equivalent models 437 12.1.10 Building Search methods 439parsers: 12.1.11 Use of the geometric mean 442 12.2 Some Approaches 443 12.2.1 Non-lexicalized treebank 443grammars 12.2.2 Lexicalized models using derivational histories 448 12.2.3 Dependency-based models 451 12.2.4 Discussion 454 12.3 Further Reading 456 12.4 Exercises 458 IV Applications and Techniques 461 13 Statistical Alignment and Machine Translation 463 13.1 Text Alignment 466 13.1.1 Aligning sentences and paragraphs 467 13.1.2 Length-based methods 471 13.1.3 Offset alignment by signal processing techniques 475 13.1.4 Lexical methods of sentence alignment 478 13.1.5 Summary 484 13.1.6 Exercises 484 13.2 Word Alignment 484 13.3 Statistical Machine Translation 486 13.4 Further Reading 492 14 Clustering 495 14.1 Hierarchical Clustering 500 14.1.1 Single-link and complete-link clustering 503 14.1.2 Group-average agglomerative clustering 507 14.1.3 An application: Improving a language model 509 14.1.4 Top-down clustering 512 14.2 Non-Hierarchical Clustering 514 14.2.1 K-means 515 14.2.2 The EM algorithm 518 14.3 Further Reading 527 [...]... Extracts from the frequencies of frequencies distribution for bigrams and trigrams in the Austen corpus 15 6 16 1 16 6 16 7 16 9 17 1 17 1 17 2 17 4 17 6 17 8 17 9 18 1 18 2 18 5 19 4 19 7 200 203 205 209 214 List of Tables 6.8 6.9 6 .10 6 .11 7 .1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 .1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8 .10 Good-Turing estimates for bigrams: Adjusted frequencies and probabilities Good-Turing bigram frequency... preposition, punctuation and symbol tags 4.3 4.4 4.5 4.6 5 .1 5.2 5.3 11 9 13 1 13 7 14 0 14 1 14 2 Finding Collocations: Raw Frequency 15 4 Part of speech tag patterns for collocation filtering 15 4 Finding Collocations: Justeson and Katz’ part-of-speech filter 15 5 List of Tables xvi 5.4 5.5 5.6 5.7 5.8 5.9 5 .10 5 .11 5 .12 5 .13 5 .14 5 .15 5 .16 5 .17 5 .18 6 .1 6.2 6.3 6.4 6.5 6.6 6.7 The nouns w occurring most often... co-occurrence in computing content similarity 15 .9 The matrix of document correlations BTB xix 15 .3 16 .1 16.2 16 .3 16 .4 16 .5 16 .6 16 .7 16 .8 16 .9 16 .10 16 .11 16 .12 Some examples of classification tasks in NLP Contingency table for evaluating a binary classifier The representation of document 11 , shown in figure 16 .3 An example of information gain as a splitting criterion Contingency table for a decision... Contents 14 .4 Exercises 528 15 Topics in Information Retrieval 529 15 .1 Some Background on Information Retrieval 530 15 .1. 1 Common design features of IR systems 532 15 .1. 2 Evaluation measures 534 15 .1. 3 The probability ranking principle (PRP) 538 15 .2 The Vector Space Model 539 15 .2 .1 Vector similarity 540 15 .2.2 Term weighting 5 41 15.3 Term Distribution Models 544 15 .3 .1 The Poisson distribution 545 15 .3.2... 9.6 240 243 245 246 249 3 21 323 325 328 329 334 List of Figures xx 111 12 .2 12 .3 12 .4 12 .5 12 .6 12 .7 12 .8 A Penn Treebank tree Two CFG derivations of the same tree An LC stack parser Decomposing a local tree into dependencies An example of the PARSEVAL measures The idea of crossing brackets Penn trees versus other trees 413 4 21 425 430 433 434 436 13 .1 13.2 13 .3 13 .4 13 .5 13 .6 Different strategies... Reading 570 15 .7 Exercises 573 16 Text Categorization 575 16 .1 Decision Trees 578 589 16 .2 Maximum Entropy Modeling 16 .2 .1 Generalized iterative scaling 5 91 16.2.2 Application to text categorization 594 597 16 .3 Perceptrons 604 16 .4 k Nearest Neighbor Classification 607 16 .5 Further Reading Tiny Statistical Tables Bibliography Index 657 611 609 List of Tables 1. 1 1. 2 1. 3 1. 4 1. 5 Common words in Tom Sawyer... selected verbs 12 .3 Selected common expansions of NP as Subject vs Object, ordered by log odds ratio 12 .4 Selected common expansions of NP as first and second object inside VP 12 .5 Precision and recall evaluation results for PP attachment errors for different styles of phrase structure 12 .6 Comparison of some statistical parsing systems 413 11 .1 11. 2 11 .3 12 .1 12.2 418 420 420 436 455 13 .1 Sentence alignment... model 548 15 .3.3 The K mixture 549 15 .3.4 Inverse document frequency 5 51 15.3.5 Residual inverse document frequency 553 15 .3.6 Usage of term distribution models 554 15 .4 Latent Semantic Indexing 554 15 .4 .1 Least-squares methods 557 15 .4.2 Singular Value Decomposition 558 15 .4.3 Latent Semantic Indexing in IR 564 15 .5 Discourse Segmentation 566 1 5 5 1 TextTiling 5 6 7 15 .6 Further Reading 570 15 .7 Exercises... traversing an arc 10 .1 10.2 10 .3 Algorithm for training a Visible Markov Model Tagger Algorithm for tagging with a Visible Markov Model Tagger The learning algorithm for transformation-based tagging 348 350 364 11 .1 11. 2 11 .3 The two parse trees, their probabilities, and the sentence probability A Probabilistic Regular Grammar (PRG) Inside and outside probabilities in PCFGs 385 390 3 91 12 .1 A word lattice... matrix in figure 15 .5 The matrix of singular values of the SVD decomposition of the matrix in figure 15 .5 496 502 502 504 504 505 505 516 517 519 5 31 537 540 546 555 555 558 560 560 List of Figures xxiv 15 .10 The matrix D of the SVD decomposition of the matrix in figure 15 .5 15 .11 The matrix J3 = S2x2D2xn of documents after resealing with singular values and reduction to two dimensions 15 .12 Three constellations . PCFGs. 385 390 3 91 12 .1 A word lattice (simplified). 408 List of Figures 12 .2 12 .3 12 .4 12 .5 12 .6 12 .7 12 .8 13 .1 13.2 13 .3 13 .4 13 .5 13 .6 14 .1 14.2 14 .3 14 .4 14 .5 14 .6 14 .7 14 .8 14 .9 14 .10 15 .1 15.2 15 .3 15 .4 15 .5 15 .6 15 .7 15 .8 15 .9 A. Austen corpus. 15 6 16 1 16 6 16 7 16 9 17 1 17 1 17 2 17 4 17 6 17 8 17 9 18 1 18 2 18 5 19 4 19 7 200 203 205 209 214 List of Tables xvii 6.8 6.9 6 .10 6 .11 7 .1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 .1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8 .10 9 .1 9.2 10 .1 Good-Turing. rankings. 535 349 352 359 363 363 374 375 383 384 394 413 418 420 420 436 455 470 500 5 01 503 518 5 21 List of Tables 15 .3 15 .4 15 .5 15 .6 15 .7 15 .8 15 .9 16 .1 16.2 16 .3 16 .4 16 .5 16 .6 16 .7 16 .8 16 .9 Three quantities that

Định dạng
Số trang	71
Dung lượng	11,65 MB