1. Trang chủ
  2. » Công Nghệ Thông Tin

manning schuetze statisticalnlp phần 4 doc

70 132 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

[...]... 0 1 2 3 4 5 6 7 8 9 10 p PGT(') 0.0007 0.3663 1.228 2.122 3.058 4. 015 4. 9 84 5.96 6. 942 7.928 8.916 1.058x lop9 5.982 x10-7 2.0 04 x 1O-6 3 .46 5 x~O-~ 4. 993 x 10-6 6.555 x~O-~ 8.138 x lo-" 9.733 x 10-6 1.1 34 x 10-5 1.2 94 x 10-5 1 .45 6x 1O-5 26. 84 27. 84 28. 84 29. 84 30. 84 4.383 x10-5 4. 546 ~10-~ 4. 709x 10-5 4. 872 x 1O-5 5.035 x 10-s 1263 1365 1916 2232 2506 0.002062 0.002228 0.003128 0.003 644 0.0 040 92 ... 0.000027 0 .44 8 1.25 2. 24 3.23 4. 21 5.23 6.21 7.21 8.26 0.000137 0.0002 74 0.00 041 1 0.000 548 0.000685 0.000822 0.000959 0.00109 0.00123 0.00137 fdel 0.000037 0.396 1. 24 2.23 3.22 4. 22 5.20 6.21 7.18 8.18 fGT 0.000027 0 .44 6 1.26 2. 24 3. 24 4.22 5.19 6.21 7. 24 8.25 N, 74 671 100 000 2 018 046 44 9 721 188 933 105 668 68 379 48 190 35 709 27 710 22 280 2 019 1;; 903 206 5 64 153 42 4 015 341 099 287 776 251951... ,i(l+F) 6 Statistical Inference: n-gram Models over Sparse Data 2 14 Bigrams r N, r NV Y 1 2 3 4 5 6 7 8 9 10 138 741 2 541 3 10531 5997 3565 248 6 17 54 1 342 1106 896 28 29 30 31 32 90 1 2 3 4 5 6 7 8 9 10 120 86 98 99 12 64 1366 1917 2233 2507 1 1 1 1 1 Trigrams Y N, 28 40 4211 29 325 14 30 10056 31 47 80 32 249 1 1571 189 1088 202 749 2 14 582 366 43 2 378 Nr 35 32 25 18 19 1 1 1 1 1 Table 6.7 Extracts... meaningful 209 6.2 Statistical Estimators scores total n mean Xi Sf = C(Xi,j df - Xi)2 System 1 71, 61, 55, 60, 68, 49 , 42 , 72, 76, 55, 64 609 11 55 .4 1,375 .4 10 System 2 42 , 55, 75, 45 , 54, 51 55, 36, 58, 55, 67 526 11 47 .8 1,228.g 10 Pooled ~2 = 137s~~,:~*s.8 E 130.2 t = x1 72 ~ 252 n 6 = 55 .4- 47.8 vm+ ~ z 1.56 Table 6.6 Using the t test for comparing the performance of two systems Since we calculate the... she 0.011 she 0.01 I 2 54 both 0.0005 both 0.0005 both 0.0005 43 5 sisters 0.0003 sisters 0.0003 1701 inferior 0.00005 Z-gram PC /person) P(,lshe) P(.lwas) PC.) the to 0.0 34 0.032 P ( I inferior) and who to in 0.099 0.099 0.076 0. 045 23 she 0. 141 0.122 not a the to 0.065 0.052 0.033 0.031 0.2 12 41 of to in and 0.066 0. 041 0.038 0.025 0.006 she 0.009 0.0 04 sisters 0.006 293 0.00 04 inferior 00 3-gram 2... not very in to 0.057 0.038 0.030 0.026 0 0.5 0.5 4 m PC I U,LP) UNSEEN M 0 inferior PC Iln,person) UNSEEN 4- gram P(.lboth) 0.111 0.057 0. 048 0.027 Mrs to 0.009 had was p(.lto) be the her have what 1 2 3 4 PC lKP,S) UNSEEN P(.lwas,inf.) PC, ls,w,i) P(.lP,S,W) in 1.0 inferior P(.linferior,to) P( 1 to,both) UNSEEN the Maria cherries her 0.286 0. 143 0. 143 0. 143 Chapter Hour Twice 0.222 0.111 0.111 0.111 both... appeared 940 9 205 6.2 Statistical Estimators Rank 1 2 3 4 Word not a the to MLE 0.065 0.052 0.033 0.031 ELE 0.036 0.030 0.019 0.017 inferior 0 0.00003 = 148 2 Table 6.5 Expected Likelihood Estimation estimates for the word following was times, not appeared 608 times in the training corpus, which overall contained 145 89 word types So our new estimate for P(notlwas) is (608 + 0.5)/( 940 9 + 145 89 x 0.5)... curve - as discussed in section 1 .4. 3 6 Statistical Inference: n-gram Models over Sparse Data In person she l-gram PC.) the* to and of 0.0 34 0.032 0.030 0.029 PC.1 the to and of 0.0 34 0.032 0.030 0.029 PC.) the to and of 0.0 34 0.032 0.030 0.029 8 was 0.015 was 0.015 was 13 she 0.011 to inferior WUS both PC.) the to and of 0.0 34 0.032 0.030 0.029 PC.1 the to and of 0.0 34 0.032 0.030 0.029 0.015 was 0.015... space to unseen events Consider some data discussed by Church and Gale (1991a) in the context of their discussion of various estimators for bigrams Their corpus of 44 million words of Associated Press (AP) newswire yielded a vocabulary of 40 0,653 words (maintaining case distinctions, splitting on hyphens, etc.) Note that this vocabulary size means that there is a space of 1.6 x 1011 possible bigrams,... set data Doing this empirically measures how often n-grams that were seen r times in the training data actually do occur in the test text The empirical estimates fempirical in table 6 .4 were found by randomly dividing the 44 million bigrams in the whole AP corpus into equal-sized training and test sets, counting frequencies in the 22 million word training set and then doing held out estimation using the . Estimators 203 f empirical 0.000027 0 .44 8 1.25 2. 24 3.23 4. 21 5.23 6.21 7.21 8.26 fLap fdel fGT N, 0.000137 0.000037 0.000027 74 671 100 000 0.0002 74 0.396 0 .44 6 2 018 046 0.00 041 1 1. 24 1.26 44 9 721 0.000 548 2.23 2. 24 188 933 0.000685 3.22 3. 24 105. 933 0.000685 3.22 3. 24 105 668 0.000822 4. 22 4. 22 68 379 0.000959 5.20 5.19 48 190 0.00109 6.21 6.21 35 709 0.00123 7.18 7. 24 27 710 0.00137 8.18 8.25 22 280 2 019 1;; 903 206 5 64 153 42 4 015 341 099 287. in section 1 .4. 3. 6 Statistical Inference: n-gram Models over Sparse Data In person l-gram 8 13 2 54 435 1701 Z-gram 1 2 3 4 23 41 293 00 3-gram 2 4 m 4- gram M she PC.) the* 0.0 34 to 0.032 and 0.030 of 0.029 was 0.015 she 0.011 WUS inferior PC.1 PC.) the 0.0 34 the to 0.032

Ngày đăng: 14/08/2014, 08:22

Xem thêm: manning schuetze statisticalnlp phần 4 doc

Mục lục

    Statistical Inference: n-gram Models over Sparse Data

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN