... actually function as a single word, and we of- ten condense them into the virtual words “UK” and “w.r.t.”. In order to extract “words” from text streams, unsupervised word segmentation is an important research ... word boundary between two neighboring words, they can leverage only up to bigram word dependencies. In this paper, we extend this work to pro- pose a more efficient and accurate unsupervised word ... probabilities over words 2 ? If a lexicon is finite, we can use a uniform prior G 0 (w) = 1/|V | for every word w in lexicon V . However, with word segmentation every substring could be a word, thus the...
Ngày tải lên: 17/03/2014, 01:20
Ngày tải lên: 23/03/2014, 18:20
Báo cáo khoa học: "Fully Unsupervised Word Segmentation with BVE and MDL" pdf
Ngày tải lên: 30/03/2014, 21:20
Báo cáo khoa học: "Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation" doc
Ngày tải lên: 30/03/2014, 17:20
Tài liệu Word Segmentation for Vietnamese Text Categorization: An online corpus approach pptx
... Vietnamese word segmentation is very problematic, especially without a manual segmentation test corpus. Therefore, we perform two experiments, one is done by human judgment for word segmentation ... ways of segmentation, i.e. the important words are segmented correctly while less important words may be segmented incorrectly. Table 6 represents the human judgment for our word segmentation ... inhomogeneous phenomenon in judgment word segmentation. However, the acceptable segmentation percentage is satisfactory. Nearly eighty percent of word segmentation outcome does not make the...
Ngày tải lên: 12/12/2013, 11:15
Tài liệu Báo cáo khoa học: "Unsupervized Word Segmentation: the case for Mandarin Chinese" doc
... len(w i ), where W is the segmentation corresponding to the sequence of words w 0 w 1 . . . w m , and len(w i ) is the length of a word w i used here to be able to com- pare segmentations resulting ... redefine the sentence segmentation problem as the maximization of the au- tonomy measure of its words. For a character se- quence s, if we call Seg(s) the set of all the possible segmentations, then ... against the corpora from the Second International Chi- nese Word Segmentation Bakeoff (Emerson, 2005). These corpora cover 4 different segmentation guide- lines from various origins: Academia Sinica...
Ngày tải lên: 19/02/2014, 19:20
Tài liệu Báo cáo khoa học: "Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure" pdf
... 151–155, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure Minwoo Jeong and Ivan Titov Saarland ... friendlier user interfaces. To address this problem, we propose an un- supervised Bayesian model for joint dis- course segmentation and alignment. We apply our method to the “English as a sec- ond ... systems. Discourse segmentation of the documents com- posed of parallel parts is a novel and challeng- ing problem, as previous research has mostly fo- cused on the linear segmentation of isolated...
Ngày tải lên: 20/02/2014, 04:20
Tài liệu Báo cáo khoa học: "Joint Word Segmentation and POS Tagging using a Single Perceptron" docx
... pattern “number word + “number word can help to prevent seg- menting a long number word into two words. In order to avoid error propagation and make use of POS information for word segmentation, ... last word can be a complete word or a partial word. A problem arises in whether to give POS tags to incomplete words. If partial words are given POS tags, it is likely that some partial words are ... Chinese, are shown in Table 2. The word segmentation features are extracted from word bigrams, capturing word, word length and character information in the context. The word length features are normalized,...
Ngày tải lên: 20/02/2014, 09:20
Tài liệu Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification" pdf
... co-occurrence. Word based model. In this model, statistical data about word boundary frequencies for each character is retrieved word- wise. For example, in the case of a monosyllabic word only two word ... International Chinese Word Segmentation Bake- off. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, July 2003. Xue, N. 2003. Chinese Word Segmentation as Charac- ter ... that we introduce is that Chinese word segmentation is the classifi- cation of a string of character-boundaries (CB’s) into either word- boundaries (WB’s) and non -word- boundaries. In Chinese, CB’s are...
Ngày tải lên: 20/02/2014, 12:20
Tài liệu Báo cáo khoa học: "Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data" pdf
... Chinese word is composed of either single or multiple characters. Chinese texts are explicitly concatenations of characters, words are not delimited by spaces as that in English. Chinese word segmentation ... Finite-State Word Segmentation Algorithm for Chinese", Proc. of the 32nd Annual Meetmg of ACL, New Mexico, 1994 [9] Palmer D.D., "A Trainable Rule-based Algorithm for Word Segmentation& quot;, ... Automatic Word Segmentation System for Written Chinese Texts", Journal of Chinese Information Processing, Vol. 1, No.2, 1987 (in Chinese) [2] Fan C.K.,Tsai WH., "Automatic Word Identification...
Ngày tải lên: 20/02/2014, 18:20
Tài liệu Báo cáo khoa học: "Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation" pptx
... iterations). 4 Word Lattice Decoding 4.1 Word Lattices In the decoding stage, the various segmentation alternatives can be encoded into a compact rep- resentation of word lattices. A word lattice ... lan- guages marked with word boundaries to construct bilingually motivated “words”. 551 6.5 Using different word aligners The above experiments rely on GIZA++ to per- form word alignment. We next ... 29.75 BS-WordLattice 21.76 31.75 Table 11: BS on IWSLT data sets using MTTK 7 Related Work (Xu et al., 2004) were the first to question the use of word segmentation in SMT and showed that the segmentation...
Ngày tải lên: 22/02/2014, 02:20
Báo cáo khoa học: "Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese" potx
... 1996. A stochastic finite-state word- segmentation algorithm for Chinese. Computational Linguistics, 22. Weiwei Sun. 2011. A stacked sub -word model for joint Chinese word segmentation and part-of-speech ... existence of the last word ; however, whether or not this word exists changes the whole syntactic structure and segmen- tation of the sentence. This is an example in which word segmentation cannot ... improve the segmentation of out-of- vocabulary (OOV) words. Unlike languages such as Japanese that use a distinct character set (i.e. katakana) for foreign words, the transliterated words in Chinese,...
Ngày tải lên: 07/03/2014, 18:20
Báo cáo khoa học: "Exploring Deterministic Constraints: From a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation" ppt
... Character- and word- based features As studied in previous work, word- based feature templates usually include the word itself, sub-words contained in the word, contextual characters/words and so ... constraints as straightforwardly as in English POS tagging. 3.2 Word- based word segmentation A word- based CWS decoder finds the highest scor- ing segmentation sequence ˆ w that is composed by the input ... are incorporated into word- based CWS models, some word- based features are no longer of interest, such as the start- ing character of a word, sub-words contained in the word, contextual characters...
Ngày tải lên: 07/03/2014, 18:20
Báo cáo khoa học: "A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging" pdf
... end of the word • s: a single-character word We can extract segmentation result by splitting the labelled result into subsequences of pattern s or bm ∗ e which denote single-character word and ... probabilities in HMM. 4.2 Word- POS Co-occurrence Model Given a training corpus with POS tags, we can train a word- POS co-occurrence model to approximate the probability that the word sequence of the ... 3-gram word language model measuring the flu- ency of the segmentation result, a 4-gram POS lan- guage model functioning as the product of state- transition probabilities in HMM, and a word- POS co-occurrence...
Ngày tải lên: 08/03/2014, 01:20
Báo cáo khoa học: "Language Model Based Arabic Word Segmentation" pdf
... 5 & 6. Step 3: Keep the top N highest scored segmentations. 3.2.1 Possible Segmentations of a Word Possible segmentations of a word token are restricted to those derivable from a ... We have presented a robust word segmentation algorithm which segments a word into a prefix*-stem-suffix* sequence, along with experimental results. Our Arabic word segmentation system implementing ... Errors 10 K Words 1,844 (76.9%) 98 (4.1%) 455 (19.0%) 2,397 20 K Words 1,174 (71.1%) 82 (5.0%) 395 (23.9%) 1,651 40 K Words 1,005 (69.9%) 81 (5.6%) 351 (24.4%) 1,437 110 K Words 333 (39.6%)...
Ngày tải lên: 08/03/2014, 04:22
Báo cáo khoa học: "Unsupervised Word Alignment with Arbitrary Features" potx
... multiple source words can be computed linearly in the number of source words considered (since the source string is always observable), fea- tures that look at multiple target words require ex- ponential ... (0.8M words) Chinese-English corpus from the tourism and travel domain (Takezawa et al., 2002), a corpus of Czech-English news commen- tary (3.1M words), 9 and an Urdu-English corpus (2M words) ... models trained to maximize likelihood: infrequent source words act as “garbage collectors”, with many target words aligned to them (the word dislike in the Model 4 alignment in Figure 2 is an...
Ngày tải lên: 17/03/2014, 00:20
Báo cáo khoa học: "A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging" potx
... stacked sub -word model. Given multiple word segmentations of one sentence, we formally define a sub -word structure that maximizes the agreement of non -word- break positions. Based on the sub -word structure, ... pre- dicted words and their POS information as clues to find a new word. After one word is found and classi- fied, solvers move on and search for the next possi- ble word. This word- by -word method ... data for sub -word tagging. 3 Method 3.1 Architecture In our stacked sub -word model, joint word segmen- tation and POS tagging is decomposed into two steps: (1) coarse-grained word segmentation...
Ngày tải lên: 17/03/2014, 00:20
Báo cáo khoa học: "Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation" doc
... Generation of Words with Internal Structures Words with rich internal structures can be described using a context-free grammar formalism as word → root (3) word → word suffix (4) word → prefix word (5) Here ... out-of-vocabulary word 戽䊂 䠽吼 ‘English People’. Had there been only a few words with inter- nal structures, current Chinese word segmentation paradigm would be sufficient. We could simply re- cover word structures ... always word segmented or even part-of-speech tagged. That is, the bracketing in our case is around characters instead of words. Another observation is we can still evaluate Chinese word segmentation...
Ngày tải lên: 17/03/2014, 00:20
Báo cáo khoa học: "An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging" docx
Ngày tải lên: 17/03/2014, 01:20
Báo cáo khoa học: "Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study" potx
Ngày tải lên: 17/03/2014, 01:20
Bạn có muốn tìm thêm với từ khóa: