DSpace at VNU: Two entropy-based methods for detecting errors in POS-tagged treebank

2011 Third International Conference on Knowledge and Systems Engineering Two Entropy-based Methods for Detecting Errors in POS-Tagged Treebank Phuong-Thai Nguyen University of Engineering and Technology Vietnam National University, Hanoi thainp@vnu.edu.vn Tu-Bao Ho Japan Advanced Institute of Science and Technology bao@jaist.ac.jp Thi-Thanh-Tam Do University of Engineering and Technology Vietnam National University, Hanoi dotam85@gmail.com occurs in Wall Street Journal (WSJ), a part of Penn Treebank corpus [5] with two possible taggings2 ”centennial/JJ year/NN” and ”centennial/NN year/NN” Among them the second tagging is correct Dickinson found that a large percentage of variation ngrams in WSJ have at least one instance (occurrence) with incorrect label However, using this variation-ngram method, linguists have to check all instances of variation ngrams to find errors The other two techniques take into account more linguistic information including tagging-guide patterns and functional words Dickinson [3] reported a method to detect ad-hoc treebank structures He used a number of linguistically-motivated heuristics to group context-free grammar (CFG) rules into equivalance classes by comparing the right hand side (RHS) of rules An example of heuristics is that CFG rules of the same category should have the same head tag and similar modifiers, but can differ in the number of modifiers By applying these heuristics, the RHS sequences3 ADVP RB ADVP and ADVP, RB ADVP will be grouped into the same class Classes with only one rule, or rules which not belong to any class are problematic He evaluates the proposed method to analyse several types of errors in Penn treebank [5] However, similarly to [2], this study proposed a method to determine candidates of problematic patterns (adhoc CFG rules instead of variation ngrams) but not problematic instances of those patterns Yates et al [10] reported a study on detecting parser errors using semantic filters First, syntactic trees, output of a parser, are converted into an intermediate representation called relational conjunction (RC) Then, using the Web as a corpus, RCs are checked using various techniques including point-wise mutual information, verb arity sampling test, textrunner filter, and question answering (QA) filter In evaluation, error rate reductions of 20% and 67% were reported when tested on Penn treebank and TREC, respectively The interesting point of their paper is that information from the Abstract—This paper proposes two methods of employing conditional entropy to find errors and inconsistencies in treebank corpora These methods are based on two principles that high entropy implies high possibility of error and that entropy is reduced after error correction The first method ranks error candidates using a scoring function based on conditional entropy The second method uses beam search to find a subset of error candidates in which the change of labels leads to decreasing of conditional entropy We carried out experiments with Vietnamese treebank corpus at two levels of annotation including word segmentation and part-of-speech tagging Our experiments showed that these methods detected high-errordensity subsets of original error candidate sets The size of these subsets is only one third the size of whole sets, while these subsets contain 80%-90% of errors in whole sets Moreover, entropy was significantly reduced after error correction Keywords-corpus, treebank, part of speech (POS) tagging, word segmentation, error detection, entropy I INTRODUCTION Currently, natural language processing research is dominated by corpus-based approaches However, building annotated corpora is a costly and labor-intensive task There are errors even in released data, as shown by the fact that complex data such as treebanks are often released in several versions1 In order to speed up annotation and increase the reliability of labelled corpora, various kinds of software tools have been built for format conversion, automatic annotation, tree edition [9], etc In this paper we focus on methods for checking errors and inconsistencies in annotated treebank Three techniques to detect part-of-speech tagging errors have been proposed by Dickinson and Meurers [2] The main idea of their first technique is to consider variation ngrams, the ones which occur more than once in the corpus and include at least one difference in their annotation For example, ”centennial year” is a variation bi-gram which There are several purposes for multi-version treebank publishing: error correction, annotation scheme modification, and data addition For example, major changes in the Penn English Treebank (PTB) [5] upgrade from version I to version II include: POS tagging error correction, and predicateargument structure labelling In PTB upgrade from version II to version III, more data is appended 978-0-7695-4567-7/11 $26.00 © 2011 IEEE DOI 10.1109/KSE.2011.30 Anh-Cuong Le University of Engineering and Technology Vietnam National University, Hanoi cuongla@vnu.edu.vn JJ: adjective, NN: noun adverbial phrase, RB: adverb ADVP: 150 Web was utilized to check for errors Novak and Razimova [8] used Apriori, an association rule mining algorithm, to find annotation rules, and then to search for violations of these rules in corpora They found that violations are often annotation errors They reported an evaluation of this technique performed on the Prague Dependency Treebank 2.0, presented the error analysis which showed that in the first 100 detected nodes, 20 of them contained an annotation error However, this was not an intensive evaluation Given Penn treebank data, and given surrounding context, two words before and twenty two words after, the distribution of centennial’s tag over the tag set JJ, N N is (4/14, 10/14) This distribution has a positive entropy value If all instances of centennial were tagged correctly, the distribution of its tag would be (0, 1) and this distribution has an entropy value of zero This simple analysis suggests that there is a relation between entropy and errors in data, and that high entropy seems to be a problem Note that labelled data are often used for training statistical classifiers such as word segmenters, POS taggers, and syntactic parsers Error-free or reduced-error training data will result in a better classifier Entropy is a measure of uncertainty Does an explicit mathematical relation between entropy and classification error exist? B A Probabilistic Relation between Entropy and Classification Error Suppose that X is a random variable representing information that we know, and Y is another random variable for which we have to guess the value The relation between X and Y is p(y|x) From X, we calculate a classification function g(X) = Yˆ We define probability of error Pe = P (Y = Yˆ ) Fano’s inequality [1] relates Pe to H(Y |X) as follows: H(Y |X) − H(Y |X) − H(Pe ) ≥ (1) Pe ≥ log(M − 1) log(M − 1) Figure 1: Conceptual sets S1: The whole treebank data; S2: Data set of variation ngrams; S3: Error set In this paper, in order to overcome the drawback of previous such those of Dickinson and colleagues, we introduce two learning methods based on conditional entropy for detecting errors in treebanks Our methods, naming ranking and beam search, can detect erroneous instances of variation ngrams4 in treebank data (Figure 1) These methods are based on entropy of labels, given their contexts Our experiments showed that conditional entropy was reduced after error correction, and that by using ranking and beam search, the number of checked instances can be reduced drastically We used Vietnamese treebank [7] for experiments The rest of our paper is organized as follows: in Section error detection methods are presented, then in Section experimental results and discussion are reported, finally conclusions are drawn and future work is proposed in Section where M is the number of possible values of Y The inequality shows an optimal lower bound on classification-error probability If H(Y |X) is small, we have more chances to estimate Y with a low probability of error If H(Y |X) > 0, there can be a number of reasons: • ambiguity: for example, the word can is ambiguous between being an auxiliary, a main verb, or a noun, and thus there is variation in the way can would be tagged in ”I can play the piano”, and ”Pass me a can of beer, please” • the choice of X (feature selection): in decision tree learning [6], H(Y ) − H(Y |X) is called information gain • error: for example, the tagging of a word may be inconsistent across comparable occurrences In this paper we focus on the relation between H(Y |X) and the correctness of training data We make two working assumptions: • there is a strong correlation between high conditional entropy and errors in annotated data • conditional entropy is reduced when errors are corrected These assumptions suggest that error correction can be considered as an entropy reduction process Now we consider a more realistic classification configuration, using K features rather than only one Our objective is to reduce the II ERROR DETECTION METHOD A A Motivating Example First, we consider a motivating example The following 25-gram is a complete sentence that appears 14 times, four times with centennial tagged as JJ and ten times with centennial marked as NN, with the latter being correct, according to the tagging guide (Santorini, 1990) • During its centennial year , The Wall Street Journal will report events of the past century that stand as milestones of American business history This term has the same meaning as the term ”variation nuclei” in [2] In our paper, variation ngram is an ngram which varies in labels because of ambiguity or annotation error Contextual information, for example surrounding words, is not included in an ngram 151 conditional entropy H(Y |X1 , X2 , , XK ) Since conditioning reduces entropy, it is easy to derive: H(Y |X1 , X2 , , XK ) ≤ K in decreasing order using the following scoring function: K H(Y |Xi ) (2) i=1 where the first term does not depend on y, and the second term ΔH is the maximal reduction of the first term when y is changed Suppose that B is a set of possible values of Y , M = |B| Without loss of generality, suppose that B = {1, 2, , M } Given Xi = xi , the discrete conditional distribution of Y is P (Y |Xi = xi ) = (p1 , p2 , , pM ), where pj ≥ 0(1 ≤ j ≤ M ) and be computed by x1 ,x2 , ,xK p(x1 , x2 , , xK ) × H(Y |X1 = x1 , X2 = x2 , , XK = xK ) H(Y |Xi = xi ) = − are sets of possible values of Xi , and = Using Bayes formula and making independent assumptions between Xi , we can decompose p(y|x1 , x2 , , xK ) into: i=1 i=1 M j=1 pj × log(pj ) When the label of x = (x1 , x2 , , xK ) changes from y to y , for each xi , P (Y |Xi = xi ) changes to P (Y |Xi = xi ) = (p1 , p2 , , pM ) in which pj = pj for j = y and j = y , py = (F req(y, xi ) − 1)/F req(xi ), and py = (F req(y , xi ) + 1)/F req(xi ) The entropy H(Y |Xi = xi ) becomes H(Y |Xi = xi ) and it is simple to compute ΔH by the formula where the sum is taken over the set A1 × A2 × × AK , Ai K pj = Also, pj can where F req(j, xi ) is the co-occurrence frequency of j and xi , and F req(xi ) is the frequency of xi which can be easily calculated from a corpus The conditional entropy can be computed by Entropy H(Y |X1 , X2 , , XK ) can be computed as: p(xi |y) × p(y)/ M j=1 pj = F req(j, xi )/F req(xi ) C Empirical Entropy H(Y |X1 = x1 , X2 = x2 , , XK = xK ) y −p(y|x1 , x2 , , xK ) × log(p(y|x1 , x2 , , xK )) (3) i=1 To simplify calculations, we can try to reduce the upK per bound K i=1 H(Y |Xi ) instead of directly handling H(Y |X1 , X2 , , XK ) Later, through our experiments, we will show that this simplification works well The equation (2) can be straightforwardly proved Since conditioning reduces entropy [1], we have: H(Y |X) ≤ H(Y ) This inequality implies that on the average, the more information, greater the reduction in uncertainty By applying this inequality K times, we obtain: H(Y |X1 , X2 , , XK ) ≤ H(Y |Xi ) for ≤ i ≤ K Sum these inequalities and divide both sides by K, we have (2) K H(Y |Xi = xi ) + ΔH Score(x, y) = K K ΔH = maxy i=1 [H(Y |Xi = xi ) − H(Y |Xi = xi )] = K maxy [−p y ×log(py )−py ×log(py ) +py ×log(py )+ i=1 py × log(py )] p(xi ) The idea behind the use of ΔH is that correcting an error should lead to decrease in entropy We consider the word5 nổi, occurring 75 times in Vietnamese treebank; among these occurrences there are error instances, with three possible POS tags A, R, and V The following are scores of the first ten instances: (1) 4.92 + 1.11, (2) 4.35 + 1.55, (3) 3.60 + 1.27, (4) 4.48 − 0.21, (5)3.36 + 0.89, (6)4.18 − 0.31, (7) 2.96 + 0.86, (8) 4.23 − 0.47, (9) 3.98 − 0.30, (10)4.40 − 0.87 The score is represented as a sum of two numbers in which the second is ΔH Error instances are in bold If ΔH is omitted from the scoring formula, the order of examples will be: (1), (4), (10), (2), (8), (6), (9), (3), (5), (7) where p(xi |y) = F req(y, xi )/F req(y), p(y) = F req(y)/L, and p(xi ) = F req(xi )/L where L indicates the number of examples in our data set When K is a large number, it is difficult to compute the true value of H(Y |X1 , X2 , , XK ), since there are |A1 | × |A2 | × × |AK | possible combinations of Xi ’s values A practical approach to overcome this problem is to compute empirical entropy on a data set More specifically, the entropy sum will be taken over (x1 , x2 , , xK ) which ((x1 , x2 , , xK ), y) exists in our data set Empirical entropy was not used for our error detection methods It was used only for computing entropy reduction over data sets in Section 3.6 E Error Detection by Using Beam Search D Error Detection by Ranking In the ranking method, a change in label of an example does not affect the score of other examples Based on the second working assumption stated in Section 2.2, in this Based on the first working assumption stated in Section 2.2, we rank training examples (x, y) = ((x1 , x2 , , xK ), y) as a verb (V): float; as an adverb (R): impossibly; as an adjective (A): famous 152 section we propose a beam search method for error detection A subset of data in which relabelling leads to a decrease in entropy is searched for The objective function is the upper K bound K i=1 H(Y |Xi ) The subset size is limited to N , which is about tens of percent of the whole data set We used a multi-stack beam search algorithm described as follows: S1: Nguyện_vọng1 về2 vấn_đề3 nước dùng đã4 được5 xem_xét6 E: Proposal1 for2 clean water supply3 has4 been5 considered6 Features: về2 , vấn_đề3 , đã4 , được5 S2: Ông1 /N chỉ2 /R muốn3 /V chui4 /V xuống/E đất5 /N khi6 /N chủ_nợ7 /N đến8 /V / E: He1 just2 wanted3 to disappear45 when6 creditors7 came8 / Features: muốn3 , chui4 , đất5 , khi6 , V3 , V4 , N5 , N6 Algorithm A beam-search algorithm for error detection create the initial state, put it into stack[0] for i = to N for each state s in stack[i − 1] {expand s} for each example e in the data set relabel e create a new state s_new and score s_new add s_new to stack[i] prune stack[i] end for end for end for choose the lowest-score state from stacks • • • • • Table 1: Features for word-segmentation and POS tagging error detection tasks S1: Word-segmented sentence S2: POS-tagged sentence E: English translation III EXPERIMENTS A Corpus Description We used word-segmented and POS-tagged data sets of Vietnamese treebank [7] for experiments There are several phenomena specific to Vietnamese words The first is word segmentation Like a number of other Asian languages such as Chinese, Japanese and Thai, Vietnamese has no word delimiter The smallest unit in the construction of Vietnamese words is the syllable A Vietnamese word can be a single word (one syllable) or a compound word (more than one syllable) A space is a syllable delimiter but not a word delimiter in Vietnamese A Vietnamese sentence can often be segmented in many ways Obviously, Vietnamese word segmentation is a non-trivial problem The second is that Vietnamese is an isolating language Functional words instead of word inflection are used to express number, tense, etc The Vietnamese treebank was developed in a two-year national project6 For each data set, there were some phases in their development, including: labelling using tools, manual revision, second manual revision, and manual revision driven by specific linguistic phenomena Therefore each sentence was checked by at least two annotators After each phase, data sets became cleaner Of course revisions were carried out with the use of guidelines, which were also modified in the development of the corpus We can not directly use treebank data for the evaluation of the error-checking task Dickinson and Meurers [2] manually checked all instances of variation ngrams to find erroneous instances However, we did not use Dickinson and Meurers’s method We compared different versions of data sets to find which sentences were modified and at which positions (words or phrases) Table shows the description of data sets which were used in our experiments For each data set, two versions were used to extract evaluation data: one version resulting from manual revision, and the other version resulting from second manual revision a state (or a hypothesis) is a relabelled subset of the data set, states with the same number of examples are put into a stack, and stacks are numbered by the number of examples in their states a state in stack[i − 1] is expanded by adding a new relabelled example into the state’s example set resulting in a new state, and the new state will be add to stack[i] given a state and a new example, the example is relabelled by choosing the label which minimizes the objective function the size of a stack is limited by O (in practice this number is set to one hundred or several hundreds), that means only O lowest-score states are kept the lowest-score state will be chosen as a set of error candidates, if there are more than one such states, the one with the smallest number of examples will be chosen F Application to Word-Segmented and POS-Tagged Data Sets In this paper, we focus on checking word-segmented and POS-tagged corpora For word segmented data, syllable ngrams which have multiple word segmentations will be considered (as random variable Y ) Features are two preceding words and two following words (total of four features, as random variables Xi ) For POS tagged data, words with multiple tags are considered Feature set includes surrounding words and their POS tags (total of eight features) Table shows two examples including labelled sentences, variation ngrams in italics, subscript for mapping Vietnamese-English words, and features http://vlsp.vietlp.org:8080/demo/ 153 Data set 1.Word segmented 2.POS tagged Sentences 68,850 10,120 Words 1,553,235 217,111 of a variation ngram These curves look like Poisson distribution, known as distribution of rare events For the wordsegmented data set, on average each variation ngram has 31.15 instances in total and 3.34 erroneous instances For the POS-tagged data set, on average each variation ngram has 64.36 instances in total and 5.18 erroneous instances Maximum points are close to vertical axis7 It is clear that most variation ngrams have zero, one, two, or several errors Voc 45,403 17,105 Table 2: Vietnamese treebank’s data sets which were used in experiments S1 S2 E Thủ_mơn1 trả giá vì2 sai_lầm3 ngớ_ngẩn4 Thủ_mơn1 trả_giá vì2 sai_lầm3 ngớ_ngẩn4 The goalkeeper1 pays for2 his blunder34 Table 3: Example of word-segmented sentence comparison using MED algorithm S1: erroneous sentence S2: corrected sentence E: English translation B Data Extraction Comparisons were carried out sentence by sentence using minimum edit distance (MED), a dynamic programming algorithm [4], in which three operations including insertion, deletion, and replacement are used The MED algorithm is followed by a post-processing procedure to combine operations on adjacent words of the original sentence Table shows an example of word-segmented sentence comparison using MED algorithm The underscore character is used to connect syllables of the same word The syllable sequence trả giá is a variation bigram The MED algorithm found that trả (pay) was deleted and giá (price) was replaced by trả_giá (pay) Since trả and giá were two adjacent words in the original sentence, deletion and replacement operations were combined together, resulting in the replacement (modification) of trả giá by trả_giá The extraction results on treebank’s two data sets are reported in Table Variation ngrams can be a sequence of syllables with multiple word segmentations in corpus, or a word with multiple tags in corpus An instance (or example) is an occurrence of an ngram An error variation ngram is one with at least one error instance (incorrectly labelled) This table shows the ambiguation core of the corpus The percentage of error variation ngram is high, however the percent of error instances is much lower How to reduce the number of instances to be checked is meaningful Figure 2: Error distribution curves Horizontal axis represents error count Vertical axis represents variation ngram count Red curve corresponds to word segmentation data set Blue curve corresponds to POS tagged data set In the word segmented data set, about 60% of erroneous instances require correction by combining single words to form a compound word About 40% require a change by splitting a compound word into single words A number of typical corrections are listed here: subordinated compound (khu phố → khu_phố (quarter), kim khâu → kim_khâu (needle)), coordinated compound (thu đông → thu_đông (autumn and winter), xinh đẹp → xinh_đẹp (beautiful)), another kind of subordinated compound (nhà khoa_học → nhà_khoa_học (scientist), nguyên bộ_trưởng → nguyên_bộ_trưởng (former minister)), proper noun (Công_ty_FPT → Công_ty FPT (FPT company), Hà Nội → Hà_Nội) Figure shows the percentage of each modified POS tag For example, the first column shows that among 8,734 (Table 4) erroneous POS tagged instances, 20.87% were changed from the noun tag N to other POS tags Among 18 columns, the ones corresponding to noun, verb, adverb, and adjective have largest percentage C Error Types and Distributions As shown in Table 4, not all instances of variation ngrams are erroneous Figure displays error distribution curves which show the likelihood of the number of error instances Data set Variation ngrams 1,565 1,685 Error variation ngrams 1,248 968 Instances Error instances 48,752 108,455 5,227 8,734 D Error Detection Results for Word Segmentation Figure shows error detection results for word segmentation The blue curve represents the number of error examples discovered if annotators check data set in which examples are in original order The red curve represents the number of error examples discovered if annotators check data set in Two points nearest to the vertical axis are the number of variation ngrams which have no erroneous instances Table 4: Data extraction statistics 154 Figure 3: The percentage of each modified POS tag Figure 5: Error detection result for POS tagging Horizontal axis represents the number of examples annotators have to check Vertical axis represents the number of error examples which examples are sorted in decreasing order of entropy It is obvious that most errors, about 89.92% (4,700/5,227) have been detected after checking one third of the data set The yellow curve shows the case using beam search It is better than entropy ranking to a certain degree shows that on both data sets, total empirical entropy of all variation ngrams has already been reduced after error correction (EntDecTotal) Also, total entropy upper bound has also decreased (EntBDecTotal) For the word-segmented data set, a majority of erroneous ngrams (92.90%) show less entropy after error correction, a very small number (0.97%) show no change in entropy, and 6.13% show increasing entropy For POS-tagged data set, the percentage of increased-entropy erroneous ngrams is higher According to our observations on specific erroneous ngrams, there are a number of reasons for the increase of entropy The first is the sparse data problem For ngrams with a small number of instances and few errors, the correction of errors leads to entropy increase in some cases The second is that some words are highly ambiguous, and after revision there are still errors Within the set of 95 erroneous ngrams whose number of erroneous instances is greater than 15, there are 39 ngrams (41.05%) whose entropy increased Though this is a small set, the ratio is high in comparison with 22.71% on average It is logical that entropy upper bound is reduced more than empirical entropy However, it seems that the difference between these values is rather large Note that empirical entropy is summed over a subset of the whole space, therefore it is smaller than the true entropy value If p(x1 , x2 , , xK ) is normalized, the calculation of empirical entropy reduction will result in a higher value8 Figure 4: Error detection result for word segmentation Horizontal axis represents the number of examples annotators have to check Vertical axis represents the number of error examples E Error Detection Results for POS Tagging Figure reports error detection results for POS tagging If annotators check data with examples in original order, the number of detected errors goes up linearly (blue curve) If the data is sorted in decreasing order of entropy, the number of detected errors goes up very fast (red curve), about 81.34% (7,104/8,734) after checking one third of the data set The efficiency of detection goes up faster if beam search technique is used (yellow curve) IV CONCLUSION We have investigated two entropy-based methods for detecting errors and inconsistencies in treebank corpora Our experiments on Vietnamese treebank data showed that these methods are effective More specifically, these methods can F Entropy Reduction Entropy plays a central role in our detection methods, high entropy corresponds to high possibility of error Table Using p(x , x , , x ) = F req(x , x , , x )/L, the value of 2 K K empirical entropy reduction was 173.49 on the word-segmented data set 155 DS EntDec Ngram 92.90% 76.96% EntUnc Ngram 0.97% 0.33% EntInc Ngram 6.13% 22.71% EntDec Total 11.62 13.22 [10] Yates, Alexander, Stefan Schoenmackers, and Oren Etzioni 2006 Detecting Parser Errors Using Web-based Semantic Filters In Proceedings of EMNLP EntB DecTotal 82.90 69.17 Table 5: Entropy changes on data sets (DS) EntDec/EntUnc/EntInc Ngram: the percentage of erroneous ngrams for which entropy decreased/remained/increased; EntDec Total: total entropy reduction of ngrams; EntBDec Total: total entropy bound reduction of ngrams reduce by two thirds the size of error candidate sets, and conditional entropy is really reduced after correction of errors We are applying the entropy-based approach for detecting syntax tree errors in treebank In the future, we intend to use extra resources such as word clusters to improve error detection results We also intend to use this approach for checking other kinds of data ACKNOWLEDGMENT This work is partially supported by the TRIG project at University of Engineering and Technology, VNU Hanoi It is also partially supported by the Vietnam’s National Foundation for Science and Technology Development (NAFOSTED), project code 102.99.35.09 REFERENCES [1] Cover, Thomas M and Joy A Thomas 2006 Elements of Information Theory John Wiley & Sons, Inc [2] Dickinson, Markus and W Detmar Meurers 2003 Detecting Errors in Part-of-Speech Annotation In Proceedings of EACL [3] Dickinson, Markus 2008 Ad Hoc Treebank Structures In Proceedings of ACL [4] Jurafsky, Daniel and James H Martin 2009 Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition Prentice Hall [5] Marcus, Mitchell P., Mary A Marcinkiewicz, Beatrice Santorini 1993 Building a Large Annotated Corpus of English: The Penn Treebank Computational Linguistics [6] Mitchell, Tom M 1997 Machine Learning The McGraw-Hill Companies, Inc [7] Phuong-Thai, Nguyen, Vu Xuan Luong, Nguyen Thi Minh Huyen, Nguyen Van Hiep, Le Hong Phuong 2009 Building a Large Syntactically-Annotated Corpus of Vietnamese In Proceedings of LAW-3, ACL-IJCNLP [8] Novak, Vaclav and Magda Razimova 2009 Unsupervised Detection of Annotation Inconsistencies Using Apriori Algorithm In Proceedings of LAW-3, ACL-IJCNLP [9] Pajas, Petr and Jan Stepanek 2008 Recent Advances in a Feature-rich Framework for Treebank Annotation In Proceedings of COLING 156 ... CONCLUSION We have investigated two entropy-based methods for detecting errors and inconsistencies in treebank corpora Our experiments on Vietnamese treebank data showed that these methods are effective... conditional entropy for detecting errors in treebanks Our methods, naming ranking and beam search, can detect erroneous instances of variation ngrams4 in treebank data (Figure 1) These methods are based... algorithm for error detection create the initial state, put it into stack[0] for i = to N for each state s in stack[i − 1] {expand s} for each example e in the data set relabel e create a new state

Định dạng
Số trang	7
Dung lượng	259,33 KB