automatic annotation of data extracted from large websites

Tài liệu Báo cáo khoa học: "Automatic Collection of Related Terms from the Web" pptx

Tài liệu Báo cáo khoa học: "Automatic Collection of Related Terms from the Web" pptx

... (20%) out of 210 terms were col- lected by the system. This low recall primarily comes from the failure of automatic term recogni- tion (case A in the above classification). Improve- ment of this ... term of the original seed term by hand. The result is shown in the left half (Evaluation I) of Table 2. In this evaluation, 519 terms out of 610 terms were correct: the precision is 85%. From ... terms that should be collected from each seed word, and then checked whether each of the target terms was included in the system output. We counted the number of tar- get terms in the following...

Ngày tải lên: 20/02/2014, 16:20

4 437 0


... The completeness of the output list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observa- tions. ... architecture of the system, and that of this pa- per, directly reflects the three challenges described above. The system consists of three modules: 1. Verb detection: Finds some occurrences of verbs ... is evaluated in terms of efficiency and accuracy. The most useful estimate of effi- ciency is simply the density of observations in the corpus, shown in the first column of Table 3. The SF...

Ngày tải lên: 20/02/2014, 21:20

6 416 0
Báo cáo khoa học: "Automatic Compilation of Travel Information from Automatically Identified Travel Blogs" doc

Báo cáo khoa học: "Automatic Compilation of Travel Information from Automatically Identified Travel Blogs" doc

... calculated Precision values from the top 5 to the top 100 at intervals of 5. Precision= The number of correctly extracted location-name / local-product pairs The number of extracted location-name ... in a list of products from the Google N-gram database. As shown in the table, 41 local products were newly extracted from travel blogs, while 15 and 7 were extracted from generic blogs and ... newly extracted 5 Conclusion In this paper, we proposed a method for identify- ing travel blogs from a blog database, and ex- tracting travel information from them. In the identification of travel...

Ngày tải lên: 08/03/2014, 01:20

4 307 0
Báo cáo khoa học: "Automatic Acquisition of Adjectival Subcategorization from Corpora" docx

Báo cáo khoa học: "Automatic Acquisition of Adjectival Subcategorization from Corpora" docx

... Introduction Research into automatic acquisition of lexical in- formation from large repositories of unannotated text (such as the web, corpora of published text, etc.) is starting to produce large scale lexical ... recall rate. A new tool for linguistic annotation of SCFs in corpus data is also introduced which can considerably alleviate the pro- cess of obtaining training and test data for subcategorization acquisition. 1 ... Proceedings of the 43rd Annual Meeting of the ACL, pages 614–621, Ann Arbor, June 2005. c 2005 Association for Computational Linguistics Automatic Acquisition of Adjectival Subcategorization from Corpora Jeremy...

Ngày tải lên: 08/03/2014, 04:22

8 390 0
Báo cáo khoa học: "A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text" pdf

Báo cáo khoa học: "A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text" pdf

... prob- lem of automatic word sense induction. Proceedings of ACL (Companion Volume), Barcelona, 195-198. Schütze, Hinrich (1993). Part -of- speech induction from scratch. Proceedings of ACL, Columbus, ... vector of each word from the centroid of its closest cluster, and to assign the differential vector to the most appropriate other cluster. This process can be repeated until the length of the ... strong negative effect on the results of the vector comparisons. Fortunately, the problem of data sparseness can be minimized by reducing the dimensionality of the matrix. An appropriate alge- braic...

Ngày tải lên: 08/03/2014, 04:22

4 433 0
Báo cáo khoa học: "Automatic Identification of Word Translations from Unrelated English and German Corpora" pot

Báo cáo khoa học: "Automatic Identification of Word Translations from Unrelated English and German Corpora" pot

... in terms of corpus frequencies: kl~ = frequency of common occurrence of word A and word B kl2 = corpus frequency of word A - kll k21 = corpus frequency of word B - kll k22 = size of corpus ... accuracy of our system we counted the number of times where an acceptable translation of the source word is ranked first. This was true for 72 of the 100 test words, which gives us an accuracy of ... more often than expected by chance in a corpus of English, then the German translations of teacher and school, Lehrer and Schule, should also co-occur more often than expected in a corpus of...

Ngày tải lên: 08/03/2014, 06:20

8 438 0
Tài liệu Báo cáo khoa học: "Automatic Construction of Polarity-tagged Corpus from HTML Documents" docx

Tài liệu Báo cáo khoa học: "Automatic Construction of Polarity-tagged Corpus from HTML Documents" docx

... polarity of words There are some works that discuss learning the po- larity of words instead of sentences. Hatzivassiloglou and McKeown proposed a method of learning the polarity of adjectives from corpus ... of reviews are not available. In addition, the corpus created from re- views is often noisy as we discuss in Section 2. This paper proposes a novel method of building polarity-tagged corpus from ... subjective adjectives from a set of seed adjectives. The idea is to automatically identify the synonyms of the seed and to add them to the seed adjectives (Wiebe, 2000). Riloff et al. proposed...

Ngày tải lên: 20/02/2014, 12:20

8 409 0
Báo cáo khoa học: "Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning" docx

Báo cáo khoa học: "Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning" docx

... and used in place of F in supervised learning. The largest contribution of our method is that it offers an architecture that can drastically reduce the number of features, i.e., from 10M features in ... [log-scale] δ=1e+01 δ=1e+02 δ=1e+04 δ=1e+00 proposed 90.0 91.0 92.0 93.0 94.0 95.0 1.E+02 1.E+04 1.E+06 1.E+08 iCWR+COFER: ostPA iCWR+COFER: ostL1FOBOS COFER: ostPA COFER: ostL1FOBOS iCWR: ostPA iCWR: ostL1FOBOS Sup.ostPA Sup.ostL1FOBOS # of active features [log-scale] Unlabeled ... utilize a large amount of unsupervised data to supplement supervised data. Specifically, an approach that involves incorporating ‘clustering- based word representations (CWR)’ induced from unsupervised...

Ngày tải lên: 07/03/2014, 22:20

6 300 0
Báo cáo khoa học: "Automatic Acquisition of Ranked Qualia Structures from the Web" potx

Báo cáo khoa học: "Automatic Acquisition of Ranked Qualia Structures from the Web" potx

... a fixed number of basic components”, data mining com- prises a range of data analysis techniques”, ”books consist of a series of dots”, or ”a conversation is made up of a series of observable ... NP’ C Plural “p(x) are made up of ” NP QT is made up of NP’ C “p(x) are made of NP QT are made of NP’ C “p(x) comprise” NP QT comprise (of) ? NP’ C “p(x) consist of NP QT consist of NP’ C Table 2: Clues ... Pattern Singular “a(x) x is made up of ” NP QT is made up of NP’ C “a(x) x is made of NP QT is made of NP’ C “a(x) x comprises” NP QT comprises (of) ? NP’ C “a(x) x consists of NP QT consists of NP’ C Plural “p(x)...

Ngày tải lên: 08/03/2014, 02:21

8 379 0
Báo cáo khoa học: "Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web" pot

Báo cáo khoa học: "Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web" pot

... different processes: sep- aration of functional words, segmentation of com- pound nouns, and verification of the usefulness of the extracted sentences. An NE is often concatenated with more than ... tagged corpus Figure 1: Automatic generation of NE tagged corpus from the web siderations in this marking process because of the word ambiguity and boundary ambiguity of NE in- stances. To overcome ... Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web Joohui An Dept. of CSE POSTECH Pohang, Korea 790-784 Seungwoo Lee Dept. of CSE POSTECH Pohang,...

Ngày tải lên: 08/03/2014, 04:22

4 397 0
Báo cáo khoa học: "Automatic construction of a hypernym-labeled noun hierarchy from text" docx

Báo cáo khoa học: "Automatic construction of a hypernym-labeled noun hierarchy from text" docx

... cluster of cities that because of sparse data was assigned a poor hypernym. Some of the suggestions in the .following sec- tion might correct this problem. Of the 50 noise words, a few of them ... up of multiple words, rather than just using the head nouns of the noun phrases. 124 Automatic construction of a hypernym-labeled noun hierarchy from text Sharon A. Caraballo Dept. of ... Electronic Lexical Database. MIT Press. Marti A. Hearst. 1992. Automatic acquisi- tion of hyponyms from large text corpora. In Proceedings of the Fourteenth Interna- tional Conference on Computational...

Ngày tải lên: 08/03/2014, 06:20

7 418 0
Báo cáo khoa học: "Acquisition of Conceptual Data Models from Natural Language Descriptions" doc

Báo cáo khoa học: "Acquisition of Conceptual Data Models from Natural Language Descriptions" doc

... trealanent of a variety of natural language quantifiers. To adapt the form of lexieal entries in the McCord parser from the database query task to the present one, generic definitions of word ... attributes of the relational database predicates in the parse tree. The existence of a relationship between two database relations, is indicated by the sharing of attributes. If the identifier of ... theoretical problem of what a semantics of natural language should consist of by an operational approach in which the propositional content of a sentence is represented by a database tuple, and...

Ngày tải lên: 09/03/2014, 01:20

8 328 0
Integrated Analysis of Data from MRC Fisheries Monitoring Programmes in the Lower Mekong Basin

Integrated Analysis of Data from MRC Fisheries Monitoring Programmes in the Lower Mekong Basin

... coefcients of the linear dependence of relative abundance of spawning stock size at locations in the LMB 87 Table 17 Regression coefcients of the linear dependence of mean body weight of species ... (1999–2010). Analyses of much of the data generated by these programmes had been undertaken, some of which had been published. However, only a limited amount of work has been done to construct time series of the ... TS-GL system 88 Table 18 Results of the GLM to test the dependence of dai catch rates on the quantity of fence 89 Table 19 Statistics of conscation and destruction of illegal shing gears (2000–2009)....

Ngày tải lên: 14/03/2014, 08:47

154 604 0
Báo cáo khoa học: "Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study" potx

Báo cáo khoa học: "Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study" potx

... Experiments of annotation adapta- tion from PD to CTB 5.0 for word segmentation and POS tagging show that, this strategy can make effective use of the knowledge from the corpus with different annotations. ... Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 522–530, Suntec, Singapore, 2-7 August 2009. c 2009 ACL and AFNLP Automatic Adaptation of Annotation Standards: Chinese ... suffers from the limited size of training data (Chiang, 2007; Bikel and Chiang, 2000). We believe this is also a reason why state- of- the-art accuracy for Chinese parsing is much lower than that of...

Ngày tải lên: 17/03/2014, 01:20

9 404 0
Báo cáo khoa học: "Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank" potx

Báo cáo khoa học: "Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank" potx

... grammars (CFGs extracted from treebanks) are very large and grow with the size of the treebank. We were interested in discovering whether the acquisition of lexical mate- rial on the same data displays ... Thesis, Stanford University, CA. C. Manning. 1993. Automatic Acquisition of a Large Subcategorisation Dictionary from Cor- pora. In Proceedings of the 31st Annual Meeting of the Association for Computational ... induction of lexical resources is part of a larger project on the acquisition of wide-coverage, robust, probabilistic, deep unifica- tion grammar resources from treebanks. We are al- ready using the extracted...

Ngày tải lên: 17/03/2014, 06:20

8 405 0