0

language model training data

Báo cáo khoa học:

Báo cáo khoa học: "Intelligent Selection of Language Model Training Data" ppt

Báo cáo khoa học

... non-domain-specifc language models, for each sentence of the textsource used to produce the latter language model. We show that this produces better language models, trained on less data, thanboth random data ... available data as language model training data. This not only pro-duces a language model better matched to the do-main of interest (as measured in terms of perplex-ity on held-out in-domain data) , ... univer-sal truth that output quality can always be im-proved by using more language model training data, but only if the training data is reasonablywell-matched to the desired output. This presentsa...
  • 5
  • 348
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation" doc

Báo cáo khoa học

... 5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA language model improves both signif-icantly. Bear in mind that Charniak et al. (2003) in-tegrated Charniak’s language model with the syntax-based translation model Yamada and ... compos-ite language model, both the data and the parameterscan’t be stored in a single machine, so we have toresort to distributed computing. The topic of largescale distributed language models ... Large language models in ma-chine translation. The 2007 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),858-867.E. Charniak. 2001. Immediate-head parsing for language models....
  • 10
  • 567
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Smoothing a Tera-word Language Model" doc

Báo cáo khoa học

... Theinterpolated models always incorporate the lower or-der distribution Pr(c|b) whereas the back-off modelsconsider it only when the n-gram abc has not beenobserved in the training data. 3 Data and ... Goodman. 2001. A bit of progress in language modeling. Computer Speech and Language. R. Kneser and H. Ney. 1995. Improved backing-off form-gram language modeling. In International Confer-ence ... words that precede the n-gram in the training data. Unfortunately this number is not exactly equalto the N(∗bc) value given in the Web 1T dataset be-cause the dataset does not include low count...
  • 4
  • 425
  • 1
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "A Succinct N-gram Language Model" ppt

Báo cáo khoa học

... -gram language models are com-pressed into 10 GB, which is comparable to a lossyrepresentation (Talbot and Brants, 2008).2 N -gram Language Model We assume a back-off N-gram language model ... language model structure and word iden-tifiers. In Proc. of ICASSP 2003, volume 1.A. Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. of the ARPA Workshopon Human Language ... representation withblock compression. N-gram language models of42.65GB were compressed to 18.37GB. Finally,the 8-bit quantized N -gram language models arerepresented by 9.83GB of space.Table...
  • 4
  • 457
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "A Phonotactic Language Model for Spoken Language Identification" pptx

Báo cáo khoa học

... Language Recognition Evaluation (LRE) data. The database was intended to establish a baseline of performance capability for language recognition of conversational tele-phone speech. The database ... statistical language modeling, and language identification. A typical LID system is illustrated in Figure 1 (Zissman, 1996), where language dependent voice tokenizers (VT) and lan-guage models ... the 1996 NIST Language Recognition Evaluation database. 1 Introduction Spoken language and written language are similar in many ways. Therefore, much of the research in spoken language identification,...
  • 8
  • 436
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model " pptx

Báo cáo khoa học

... very efficient. 5 Experiments 5.1 Training Data for the Language Model We used the EDR Japanese Corpus Version 1.0 (EDR, 1991) to train the language model. It is a corpus of approximately ... Corpus for training. The first column of Table 1 shows the number of sen- tences, words, and characters of the training set. Table 1: The amount of the training data and the test data for handwritten ... P(C[X)- P(X[C)P(C) P(X) (2) P(C) is called the language model. It is computed from the training corpus. Let us call P(XIC ) the OCR model. It can be computed from the a priori likelihood...
  • 7
  • 472
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data" pdf

Báo cáo khoa học

... 'vxyzw' is a Chinese 1268 Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data Sun Maosong, Shen Dayang*, Benjamin K Tsou** State Key Laboratory of Intelligent Technology ... statistical data required by calculating mi and dts, in fact it is character bigram, is automatically derived from a news corpus of about 20M Chinese characters. The testing texts and training ... texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between...
  • 7
  • 396
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Supervised Grammar Induction using Training Data with Limited Constituent Information *" docx

Báo cáo khoa học

... held-out data, and the rest for training. Before proceeding with the main discussion on training from the ATIS, we briefly describe the pretraining stage of the adaptive strategy. 5.1.1 Pretraining ... existing labeled data. We hope that pretraining the grammars on these data might place them in a better position to learn from the new, sparsely labeled data. In the pretraining stage for ... ing with the ATIS data. 5.1.2 Partially Supervised Training on ATIS We now return to the main focus of this experi- ment: learning from sparsely annotated ATIS training data. To verify whether...
  • 7
  • 423
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "NATURAL-LANGUAGE ACCESS TO DATABASES--THEORETICAL/TECHNICAL ISSUES" docx

Báo cáo khoa học

... questions. The mapping between a natural -language question and the corresponding database query, however, can differ dramatically according to the way the database is organized. For instance, ... 94025 I INTRODUCTION Although there have been many experimental systems for natural -language access to databases, with some now going into actual use, many problems in this area remain to ... motivation stems partly from the fact that, too often in the past, discussion of natural -language access to databases has focused, at the expense of the underlying issues, on what particular systems...
  • 2
  • 227
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "THEORETICAL/TECHNICAL ISSUES IN NATURAL LANGUAGE ACCESS TO DATABASES" pdf

Báo cáo khoa học

... expression above into a formal query language such as SQL. The relations ZONEF and LUCF are existing database relations, but there is no relation GEOBASE* in the database, giv- ing the subplanning ... department?" must be mapped into three radi- cally different database query language expressions depending on how the database is set up. It may he appropriate to retrieve a pre-stored ... SBLOCK) '('410 XlII) '(==) ) ) (3) Make the data base administrator (DBA) respon- sible for providing a formal query language defi- nition of the virtual relations produced. In...
  • 6
  • 374
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "ISSUES IN NATURAL LANGUAGE ACCESS TO DATABASES FROM A LOGIC PROGRAMMING PERSPECTIVE" doc

Báo cáo khoa học

... relational database query language. Moreover, the efficiency of the DEC-10 Prolog implementation is comparable both with compiled Lisp [9] and with current relational database systems [6] (for databases ... reallsed in the language Prolog, has a great deal in common with the relational approach to databases, which can be seen as the result of a "bottom-up" effort to make database languages ... make database languages more like natural language. However Prolog is much more general than relational database formalisms, in that it permits data to be defined by general rules having...
  • 4
  • 445
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "A Structured Language Model" ppt

Báo cáo khoa học

... Derivational Model. In Proceedings of the Human Language Technology Workshop, 272-277. ARPA. Raymond Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Trigger-based language models: a maximum ... 2 P(wk/Wk-ITk 1) = P(wjho) models, re- ferred to as H and h, respectively; h0 is the previous exposed (headword, POS/non-term tag) pair; the parses used in this model were those assigned man- ... word wk and they were implemented using the Maximum Entropy Model- ing Toolkit 1 (Ristad97). The constraint templates in the {W,H} models were: 4 <= <*>_<*> <7>; P- <=...
  • 3
  • 342
  • 0
Báo cáo khoa học:

Báo cáo khoa học: "Optimizing Language Model Information Retrieval System with Expectation Maximization Algorithm" doc

Báo cáo khoa học

... its original parame-ters are given by the basic language modeling approach calculation. Figure 2. HMM model for EM IR We define our HMM model as a four-tuple, {S,A,B,π}, where S is a ... calculated with the simple language modeling approach. Even if the query term is not in the document, it will be assigned a small value according to the basic language modeling method. The rest ... and HMM training procedure After establishing the HMM model, the observa-tion sequence is another necessary part for our HMM training procedure. The observation se-quence used in HMM training...
  • 9
  • 317
  • 1
Báo cáo khoa học:

Báo cáo khoa học: "A Discriminative Language Model with Pseudo-Negative Samples" pptx

Báo cáo khoa học

... propose a novel discrim-inative language model, which can be ap-plied quite generally. Compared to thewell known N-gram language models, dis-criminative language models can achievemore accurate ... such training sets,and we treat correct or incorrect examples indepen-dently in training. 3 Discriminative Language Model withPseudo-Negative samplesWe propose a novel discriminative language ... correctly.2 Previous workProbabilistic language models (PLMs) estimate theprobability of word strings or sentences. Amongthese models, N-gram language models (NLMs) arewidely used. NLMs approximate...
  • 8
  • 315
  • 0

Xem thêm