language model training data

Báo cáo khoa học: "Intelligent Selection of Language Model Training Data" ppt

... non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data ... available data as language model training data. This not only produces a language model better matched to the domain of interest (as measured in terms of perplex- ity on held-out in-domain data) , ... univer- sal truth that output quality can always be improved by using more language model training data, but only if the training data is reasonably well-matched to the desired output. This presents a...

Ngày tải lên: 07/03/2014, 22:20

5 348 0

Tài liệu Báo cáo khoa học: "A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation" doc

... 5-gram/2-SLM+2-gram/4-SLM+5- gram/PLSA language model improves both signif- icantly. Bear in mind that Charniak et al. (2003) in- tegrated Charniak’s language model with the syntax- based translation model Yamada and ... compos- ite language model, both the data and the parameters can’t be stored in a single machine, so we have to resort to distributed computing. The topic of large scale distributed language models ... Large language models in machine translation. The 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP), 858-867. E. Charniak. 2001. Immediate-head parsing for language models....

Ngày tải lên: 20/02/2014, 04:20

10 568 0

Tài liệu Báo cáo khoa học: "Smoothing a Tera-word Language Model" doc

... The interpolated models always incorporate the lower or- der distribution Pr(c|b) whereas the back-off models consider it only when the n-gram abc has not been observed in the training data. 3 Data and ... Goodman. 2001. A bit of progress in language modeling. Computer Speech and Language. R. Kneser and H. Ney. 1995. Improved backing-off for m-gram language modeling. In International Confer- ence ... words that precede the n-gram in the training data. Unfortunately this number is not exactly equal to the N(∗bc) value given in the Web 1T dataset be- cause the dataset does not include low count...

Ngày tải lên: 20/02/2014, 09:20

4 425 1

Tài liệu Báo cáo khoa học: "A Succinct N-gram Language Model" ppt

... -gram language models are compressed into 10 GB, which is comparable to a lossy representation (Talbot and Brants, 2008). 2 N -gram Language Model We assume a back-off N-gram language model ... language model structure and word iden- tiﬁers. In Proc. of ICASSP 2003, volume 1. A. Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. of the ARPA Workshop on Human Language ... representation with block compression. N-gram language models of 42.65GB were compressed to 18.37GB. Finally, the 8-bit quantized N -gram language models are represented by 9.83GB of space. Table...

Ngày tải lên: 20/02/2014, 09:20

4 458 0

Tài liệu Báo cáo khoa học: "A Phonotactic Language Model for Spoken Language Identification" pptx

... Language Recognition Evaluation (LRE) data. The database was intended to establish a baseline of performance capability for language recognition of conversational tele- phone speech. The database ... statistical language modeling, and language identification. A typical LID system is illustrated in Figure 1 (Zissman, 1996), where language dependent voice tokenizers (VT) and language models ... the 1996 NIST Language Recognition Evaluation database. 1 Introduction Spoken language and written language are similar in many ways. Therefore, much of the research in spoken language identification,...

Ngày tải lên: 20/02/2014, 15:20

8 437 0

Tài liệu Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model " pptx

... very efficient. 5 Experiments 5.1 Training Data for the Language Model We used the EDR Japanese Corpus Version 1.0 (EDR, 1991) to train the language model. It is a corpus of approximately ... Corpus for training. The first column of Table 1 shows the number of sentences, words, and characters of the training set. Table 1: The amount of the training data and the test data for handwritten ... P(C[X)- P(X[C)P(C) P(X) (2) P(C) is called the language model. It is computed from the training corpus. Let us call P(XIC ) the OCR model. It can be computed from the a priori likelihood...

Ngày tải lên: 20/02/2014, 18:20

7 472 0

Tài liệu Báo cáo khoa học: "Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data" pdf

... 'vxyzw' is a Chinese 1268 Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data Sun Maosong, Shen Dayang*, Benjamin K Tsou** State Key Laboratory of Intelligent Technology ... statistical data required by calculating mi and dts, in fact it is character bigram, is automatically derived from a news corpus of about 20M Chinese characters. The testing texts and training ... texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between...

Ngày tải lên: 20/02/2014, 18:20

7 396 0

Tài liệu Báo cáo khoa học: "Supervised Grammar Induction using Training Data with Limited Constituent Information *" docx

... held-out data, and the rest for training. Before proceeding with the main discussion on training from the ATIS, we briefly describe the pretraining stage of the adaptive strategy. 5.1.1 Pretraining ... existing labeled data. We hope that pretraining the grammars on these data might place them in a better position to learn from the new, sparsely labeled data. In the pretraining stage for ... ing with the ATIS data. 5.1.2 Partially Supervised Training on ATIS We now return to the main focus of this experi- ment: learning from sparsely annotated ATIS training data. To verify whether...

Ngày tải lên: 20/02/2014, 19:20

7 423 0

Tài liệu Báo cáo khoa học: "NATURAL-LANGUAGE ACCESS TO DATABASES--THEORETICAL/TECHNICAL ISSUES" docx

... questions. The mapping between a natural -language question and the corresponding database query, however, can differ dramatically according to the way the database is organized. For instance, ... 94025 I INTRODUCTION Although there have been many experimental systems for natural -language access to databases, with some now going into actual use, many problems in this area remain to ... motivation stems partly from the fact that, too often in the past, discussion of natural -language access to databases has focused, at the expense of the underlying issues, on what particular systems...

Ngày tải lên: 21/02/2014, 20:20

2 227 0

Tài liệu Báo cáo khoa học: "THEORETICAL/TECHNICAL ISSUES IN NATURAL LANGUAGE ACCESS TO DATABASES" pdf

... expression above into a formal query language such as SQL. The relations ZONEF and LUCF are existing database relations, but there is no relation GEOBASE* in the database, giv- ing the subplanning ... department?" must be mapped into three radi- cally different database query language expressions depending on how the database is set up. It may he appropriate to retrieve a pre-stored ... SBLOCK) '('410 XlII) '(==) ) ) (3) Make the data base administrator (DBA) respon- sible for providing a formal query language defi- nition of the virtual relations produced. In...

Ngày tải lên: 21/02/2014, 20:20

6 375 0

Tài liệu Báo cáo khoa học: "ISSUES IN NATURAL LANGUAGE ACCESS TO DATABASES FROM A LOGIC PROGRAMMING PERSPECTIVE" doc

... relational database query language. Moreover, the efficiency of the DEC-10 Prolog implementation is comparable both with compiled Lisp [9] and with current relational database systems [6] (for databases ... reallsed in the language Prolog, has a great deal in common with the relational approach to databases, which can be seen as the result of a "bottom-up" effort to make database languages ... make database languages more like natural language. However Prolog is much more general than relational database formalisms, in that it permits data to be defined by general rules having...

Ngày tải lên: 21/02/2014, 20:20

4 446 0

Tài liệu Báo cáo khoa học: "A Structured Language Model" ppt

... Derivational Model. In Proceedings of the Human Language Technology Workshop, 272-277. ARPA. Raymond Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Trigger-based language models: a maximum ... 2 P(wk/Wk-ITk 1) = P(wjho) models, re- ferred to as H and h, respectively; h0 is the previous exposed (headword, POS/non-term tag) pair; the parses used in this model were those assigned man- ... word wk and they were implemented using the Maximum Entropy Model- ing Toolkit 1 (Ristad97). The constraint templates in the {W,H} models were: 4 <= <*>_<*> <7>; P- <=...

Ngày tải lên: 22/02/2014, 03:20

3 342 0

Tài liệu occupational projections and training data doc

Ngày tải lên: 23/02/2014, 19:20

208 172 0

Báo cáo khoa học: "Optimizing Language Model Information Retrieval System with Expectation Maximization Algorithm" doc

... its original parameters are given by the basic language modeling approach calculation. Figure 2. HMM model for EM IR We define our HMM model as a four-tuple, {S,A,B,π}, where S is a ... calculated with the simple language modeling approach. Even if the query term is not in the document, it will be assigned a small value according to the basic language modeling method. The rest ... and HMM training procedure After establishing the HMM model, the observation sequence is another necessary part for our HMM training procedure. The observation sequence used in HMM training...

Ngày tải lên: 08/03/2014, 01:20

9 317 1

Báo cáo khoa học: "A Discriminative Language Model with Pseudo-Negative Samples" pptx

... propose a novel discriminative language model, which can be ap- plied quite generally. Compared to the well known N-gram language models, discriminative language models can achieve more accurate ... such training sets, and we treat correct or incorrect examples indepen- dently in training. 3 Discriminative Language Model with Pseudo-Negative samples We propose a novel discriminative language ... correctly. 2 Previous work Probabilistic language models (PLMs) estimate the probability of word strings or sentences. Among these models, N-gram language models (NLMs) are widely used. NLMs approximate...

Ngày tải lên: 08/03/2014, 02:21

8 315 0

Báo cáo khoa học: "Language Model Based Arabic Word Segmentation" pdf

Ngày tải lên: 08/03/2014, 04:22

8 189 0

Báo cáo khoa học: "Automatic Acquisition of Language Model based on Head-Dependent Relation between Words" pdf

Ngày tải lên: 08/03/2014, 05:21

5 334 0

Báo cáo khoa học: "A Preference-first Language Processor Integrating the Unification Grammar and Markov Language Model for Speech Recognition-ApplicationS" potx

Ngày tải lên: 08/03/2014, 07:20

6 393 0

Báo cáo khoa học: "Semantic Information Preprocessing for Natural Language Interfaces to Databases" docx

Ngày tải lên: 08/03/2014, 07:20

3 328 0

Báo cáo khoa học: "EVALUATION OF NATURAL LANGUAGE INTERFACES TO DATABASE SYSTEMS: A PANEL DISCUSSION " ppt

Ngày tải lên: 08/03/2014, 18:20

2 371 0