large linguisticallyprocessed web corpora

Tài liệu Báo cáo khoa học: "Large linguistically-processed Web corpora for multiple languages" doc

... model, and thus identifying it with automated techniques is far from trivial. 88 Large linguistically-processed Web corpora for multiple languages Marco Baroni SSLMIT University of Bologna Italy baroni@sslmit.unibo.it Adam ... sizes of over 1 billion words in each case. We provide Web ac- cess to the corpora in our query tool, the Sketch Engine. 1 Introduction The Web contains vast amounts of linguistic data for many ... contain connected text. Second, the TreeTagger was not trained on Web data, and thus its performance on texts that are heavy on Web- like usage (e.g., texts all in lower- case, colloquial forms...

Ngày tải lên: 22/02/2014, 02:20

4 314 0

Báo cáo khoa học: "An Efﬁcient Indexer for Large N-Gram Corpora" docx

... tool that implements efﬁcient indexing and re- trieval of large N-gram datasets, such as the Web1 T 5-gram corpus. Our tool indexes the entire Web1 T dataset with an index size of only 100 MB and performs ... Lan- guage Processing (NLP), the models give a much better performance with larger data sets. However the large data sets, such as the Web1 T 5-Gram corpus of (Brants and Franz, 2006), present a major challenge. ... language models and applying them to various problems, they are not designed for very large corpora, such as the Web1 T 5-gram corpus (Brants and Franz, 2006), hence they do not provide efﬁcient implementations...

Ngày tải lên: 07/03/2014, 22:20

6 320 0

Báo cáo khoa học: "Constructing Transliteration Lexicons from Web Corpora" docx

... existing dictionaries. Regularly exploring Web corpora is a good way to update dictionaries. Transliterated-term extraction using non-parallel corpora has also been conducted (Kuo, 2003). ... into Chinese syllables using the trained cross- Constructing Transliteration Lexicons from Web Corpora Jin-Shea Kuo 1, 2 Ying-Kuei Yang 2 1 Chung-Hwa Telecommunication Laboratories, ... Internet is one of the largest distributed databases in the world. It comprises various kinds of data and at the same time is growing rapidly. Though the World Wide Web is not systematically...

Ngày tải lên: 17/03/2014, 06:20

4 218 0

Báo cáo khoa học: "Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora" potx

Ngày tải lên: 31/03/2014, 03:20

8 235 0

Tài liệu The Anatomy of a Large-Scale Hypertextual Web Search Engine ppt

... in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large- scale web search engine ... World Wide Web Worm (WWWW) [McBryan 94] had an index of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines claim to index from 2 million (WebCrawler) ... PageRank: Bringing Order to the Web The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as...

Ngày tải lên: 24/01/2014, 20:20

20 572 0

Tài liệu Large debt financing syndicated Loans versus corporate bonds docx

... financing for large firms. Since the introduction of the euro syndicated loans and corporate bonds have become the main sources for large debt financing: in both markets, firms can raise large amounts ... find that larger firms are more likely to issue debt in the syndicated loan markets than the corporate bond market. Secondly, when including a larger sample with smaller firms from the larger ... direct corporate bond financing: In both markets, firms can tap the financial markets to raise large amounts of funds with medium and long-term maturities. Today, many of Europe’s largest...

Ngày tải lên: 16/02/2014, 02:20

37 426 0

Tài liệu Báo cáo khoa học: "Finding Parts in Very Large Corpora" pdf

... the machines at our disposal, so still larger corpora would not be out of the question. Finally, as noted above, Hearst [2] tried to find parts in corpora but did not achieve good results. ... Lexicography 3 (1990), 235-245. [2] Marti Hearst, "Automatic acquisition of hyponyms from large text corpora, " in Proceed- ings of the Fourteenth International Conference on Computational ... 3* 4* 2 0 4* 3* 1 0 4* 3* 3* 3* 4* 1 2 1 0 2 4* 64 Finding Parts in Very Large Corpora Matthew Berland, Eugene Charniak rob, ec @ cs. brown, edu Department of Computer Science...

Ngày tải lên: 20/02/2014, 19:20

8 351 0

Tài liệu Báo cáo khoa học: "Creating a Multilingual Collocation Dictionary from Large Text Corpora" docx

... Extraction with Fips Collocations are extracted from syntactically ana- lysed corpora. The analysis is performed by Fips, a large- scale parser based on an adaptation of Chomksy's "Principles ... returns chunks of partial analyses. If 132 Creating a Multilingual Collocation Dictionary from Large Text Corpora Luka Nerima, Violeta Seretan, Eric Wehrli Language Technology Laboratory (LATL), Dept. ... Linguis- tics, 19(1):61-74. Gale W. and Church K. (1991). A program for aligning sentences in bilingual corpora Computational Lin- guistics, 19(1):75-102. Gross, G. (1996). Les expressions figees en...

Ngày tải lên: 22/02/2014, 02:20

4 479 0

Báo cáo khoa học: "Creating a Multilingual Collocation Dictionary from Large Text Corpora" ppt

... coherent text spans found in the corpora resources. At the same time, we intend to provide a quite precise and delimited context, that's why we do not consider a larger context (such as the whole ... using mark-up from text encoding. 133 Creating a Multilingual Collocation Dictionary from Large Text Corpora Luka Nerima, Violeta Seretan, Eric Wehrli Language Technology Laboratory (LATL), Dept. ... collocation's keys occur on the same sentence, as they are in a syntactical relation). When parallel corpora are available, also the translation equivalents of the collocation context are displayed,...

Ngày tải lên: 08/03/2014, 21:20

4 353 0

Báo cáo khoa học: "CS NIPER Annotation-by-query for non-canonical constructions in large corpora" pdf

... annotation tasks that require manual analysis over large corpora. The approach is generalizable to any kind of linguistic phenomena that can be lo- cated in corpora on the basis of queries and require manual ... (Corpus Sniper), a tool that implements (i) a web- based multi- user scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) ... for the annotation of linguistic phenomena whose investiga- tion requires the analysis of large corpora due to a relatively low frequency of instances and whose identiﬁcation requires expert...

Ngày tải lên: 16/03/2014, 20:20

6 356 0

Báo cáo khoa học: "Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation" ppt

... Lin- guistically motivated large- scale NLP with C&C and Boxer. In Proc. ACL Demo and Poster Sessions, pages 33–36. Ido Dagan and Alan Itai. 1990. Automatic processing of large corpora for the resolution ... parallel text. 5 Data Web- scale text data is used for monolingual feature counts, parallel text is used for classiﬁer co-training, and labeled data is used for training and evaluation. Web- scale N-gram ... Mary Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Preslav Nakov and Marti Hearst. 2005. Using the web as an implicit training...

Ngày tải lên: 17/03/2014, 00:20

10 406 0

Báo cáo khoa học: "A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora" pot

... acquisition of lexical information from large repositories of unannotated text (such as the web, corpora of published text, etc.) is starting to produce large scale lexical resources which include ... for large- scale acquisition of subcategorization frames (SCFs) from English corpus data which can be used to acquire comprehen- sive lexicons for verbs, nouns and adjectives. The system incorporates ... Association for Computational Linguistics A System for Large- Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora Judita Preiss, Ted Briscoe, and Anna Korhonen Computer...

Ngày tải lên: 17/03/2014, 04:20

8 551 0

Báo cáo khoa học: "Discovering Relations among Named Entities from Large Corpora" pot

... discovery, however, needed large annotated corpora which cost a great deal of time and effort. We propose an unsupervised method for relation discovery from large corpora. The key idea is clustering ... discovering relations among various entities from large text corpora. Our method does not need the richly annotated corpora required for supervised learning — corpora which take great time and effort ... frequently mentioned in large corpora. Conversely, relations mentioned once or twice are not likely to be important. Our basic idea is as follows: 1. tagging named entities in text corpora 2. getting...

Ngày tải lên: 17/03/2014, 06:20

8 283 0

Developing Large Web Applications pdf

... contribute to the complexity of many large web applications. Typically, large web applications have the following characteristics: Continuous availability Most large web applications must be running ... load times. Web developers need to write code that is especially robust. Large user base Large web applications usually have large numbers of users. This necessitates management of a large number ... information architecture of the module. 28 | Chapter 3: Large- Scale HTML www.it-ebooks.info CHAPTER 1 The Tenets As applications on the Web become larger and larger, how can web developers manage the complexity?...

Ngày tải lên: 23/03/2014, 04:20

302 794 0

Báo cáo khoa học: "Scaling to Very Very Large Corpora for Natural Language Disambiguation" potx

... unsupervised learning with large training corpora, in hopes of being able to obtain the benefits that come from significantly larger training corpora without incurring too large a cost. 2 Confusion ... exploiting very large corpora when labeled data comes at a cost. 1 Introduction Machine learning techniques, which automatically learn linguistic information from online text corpora, have ... a large corpus. Computers and the Humanities, 26:415 439. Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, ...

Ngày tải lên: 23/03/2014, 19:20

8 265 0

Báo cáo khoa học: "AUTOMATIC ACQUISITION OF A LARGE SUBCATEGORIZATION DICTIONARY FROM CORPORA" doc

... (ed.). 1977. Webster's seventh new collegiate dictionary. Springfield, MA: G. & C. Merriam. Hearst, Marti. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Pro- ... frames were determined are in Webster's (Gove 1977) (the only noticed exceptions being certain instances of prefixing, such as overcook and repurchase), but a larger number of the verbs ... AUTOMATIC ACQUISITION OF A LARGE SUBCATEGORIZATION DICTIONARY FROM CORPORA Christopher D. Manning Xerox PARC and Stanford University Stanford...

Ngày tải lên: 23/03/2014, 20:20

8 342 0