Extraction of Vietnamese collocation from text corpora trích chọn lọc collocation tiếng Việt từ kho ngữ liệu văn bản

lan-Nguyen Cam Tu [35] about discovery scheme for classification and clusteringweb documents in Vietnamese, has given the label based on N-gram testing toextract meaningful phrases or co

Trang 1

1 Introduction 2

1.1 Definitions 2

1.2 Related works and motivation 3

1.3 Contribution of the thesis 6

2 Collocation: concept, roles and applications 7 2.1 Collocations’ characteristics 7

2.1.1 Recurrent 8

2.1.2 Arbitrary 8

2.1.3 Domain-dependent 8

2.1.4 Non-substitutability (the closely linked in terms of vocabulary) 9 2.2 Classification of collocations 9

2.2.1 Idiomatic Phrases 10

2.2.2 Support Verb Construction 10

2.2.3 Fixed Phrases 10

2.3 Applications 11

2.4 Vietnamese collocations 12

3 Basic methods in Collocation extraction 14 3.1 Frequency 15

3.2 Hypothesis testing 16

3.2.1 T-Test 17

3.2.2 Chi-Square 18

3.3 Point-wise Mutual Information (PMI) 20

4 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23

4.2 The Linguistic Measure 24

v

Trang 2

vi TABLE OF CONTENTS

4.3 Designed model 25

5 Experiments 27 5.1 Data preparation 27

5.1.1 Collecting corpora 27

5.1.2 Extracting bi-grams 28

5.1.3 Adding syntactic information to bi-grams 28

5.2 The test models 29

5.3 Experimental results with statistical methods 30

5.3.1 Bi-grams with syntactic information 31

5.4 The experiments of our proposal 32

Trang 3

2.2 The collocation has Support Verb Construction 11

2.3 Fixed noun phrase 11

2.4 Some type of Vietnamese fix phrase 13

3.1 Sample label type for filter of Vietnamese 16

3.2 Some collocations extracted by frequency 16

3.3 Some collocations extracted by the method T-Test 19

3.4 Some collocations extracted by the method Chi-Square 20

3.5 Some collocations extracted by the method PMI 21

4.1 A number of bi-grams and information about the main noun/verb and frequency of appearance 26

5.1 The label used by vnTagger 27

5.2 Output from the four methods on data without syntactic information 30 5.3 Output from the four methods on data has been labelled and parsed 31 5.4 Results of experimental runs on all models with two input data sets 32 5.5 Output from the four methods and our combined method 33

5.6 Some bi-grams extracted after phase 2 of combined method 34

vii

Trang 4

1

Trang 5

charac-a combincharac-ation of the fixed charac-and repecharac-ated words Thus, Firth pcharac-aid charac-attention to thelexical of collocation, and Choueka tends to research aspects of syntactic function ofcollocation in the text The definition of Benson is one of the most commonly-used,but it ignores a number of features and attributes of collocation applications in ma-chine translation For example, collocation could not be translated from English intoVietnamese word by word.

Collocation, for instance, is an expression of two or more words that correspond

to a conventional way of saying things They are also known as a class of word groupswhich lie between idioms and free word combination [5] However, it is typical todraw a line between a phrase and a collocation Idioms and phrase may be defined asexpression in the language that is peculiar to itself either grammatically or especially

in having a meaning that cannot be derived from the sum of the meanings of itselements It becomes well impossible to guess the meaning of an idiom from thewords it contains And, moreover, the meanings that idioms have are often strongerthan the meanings of non-idiomatic phrases

Many studies of collocation in English have been conducted, but there is no dard definition of collocation to be made, and the definition of collocation depends

stan-2

Trang 6

on the point and purpose of each of these studies

In this thesis, we accept this definition: collocation is a combination of wordsthat often appear together in the normal range in the text, position and grammaticalrelations are relatively fixed

1.2 Related works and motivation

A good example of the type of problem is Halliday’s example of strong vs.powerful tea (Halliday 1966: p150) It is a convention in English to talk about strongtea, not powerful tea, although any speaker of English would also understand thelatter unconventional expression The combination of words that do not follow arule of grammar or semantics is definition of collocations Thus, one collocation can

be interpreted as a combination of the words which do not follow a rule of grammar

or semantics at all In some points of view, collocations are fixed and inflexible Themeaning of a collocation is not usually inferred from the meaning of words into parts,and replacing a word with one component of synonyms can completely change themeaning of the collocation Collocations are also understood as idiosyncratic prag-matics combination of lexical items (Fontenelle, 1992, p222): heavy rain, light breeze,great difficulty, grow steadily, meet requirement, reach consensus, pay attention, ask

a question Unlike idioms (kick the bucket, lend a hand, pull someone’s leg), theirmeaning is fairly transparent and easy to decode Different from the regular produc-tions, (big house, cultural activity; read a book ) collocations expressions are highlyidiosyncratic, since the lexical items a headword combines with in order to express

a given meaning is contingent upon that word (Mel’cuk, 2003)

As it has been pointed out by many researchers (Cruse, 1986; Benson, 1990;McKeown and Radev, 2000), collocations cannot be described by means of generalsyntactic and semantic rules They are arbitrary and unpredictable, and thereforeneed to be memorized They constitute the so-called semi-finished products of lan-guage (Hausmann, 1985) or the islands of reliability (Lewis, 2000) on which thespeakers build their utterances

In addition, collocation is a special problem of linguistic Syntax imposes straints on word order or the occurrence of particular phrasal types such as PPs

con-or NPs, and lexical semantics imposes Joachim Wermter and Udo Hahn [37] duced a linguistic measure for identifying PP-verb collocations in German, which isbased on the property of non- or limited modifiability

intro-Due to their popularity that there are a large number of collocation extraction

Trang 7

word concerns the English language: (Choueka, 1988; Church et al, 1989; Churchand Hanks, 1990; Smadja, 1993; Justeson and Katz, 1995; Kjellmer in 1994, Sinclair

in 1995; Lin, 1998), among many others Choueka (1988) provides methods to detectn-grams (consecutive) simply by calculating the co-occurrence frequency Justesonand Katz (1995) apply a POS-filter on the pair of their extraction (Kjellmer 1994[16]) Smadja (1993) uses the z-score associated with multiple diagnostic (e.g., thepresence of two systems of lexical items at the same distance in the text) and extractspredicative collocations, rigid noun phrases and phrasal templates He then uses theparser to validate the results Parsing is shown to lead to an increase in accuracyfrom 40% to 80% (Church et al, 1989) and (Church and Hanks, 1990) using POSinformation and parsed to extract verb-object pairs, then they are ranked according

to the mutual information (MI) measure Lin (1998)[18, 20, 19] also proposes ahybrid approach based on a dependency parser The candidate is extracted thencompared with MI result

In the document production tasks such as machine translation [1, 24, 34] andnatural language processing [5,13,32,38,23], collocations also presented the impor-tance Furthermore, they are useful in a variety of other applications, such as wordsense disambiguation (Brown et al, 1991) and parsing (Alshawi and Carter, 1994).Collocation is particularly important because the incidence in the native language,

in all the areas or categories According to Jackendoff (1997, 156) and Mel’Cuk(1998, 24), a large number of collocations appeared in the vocabulary of a language.The past decade has witnessed a considerable development of collocation extractiontechniques that concerns both monolingual (parallel) and multilingual corpora Wecan mention here only a part of this work: (Berry-Rogghe, 1973; Church et al., 1989;Smadja, 1993; Lin, 1998; Krenn and Evert, 2001) for monolingual extraction, and(Kupiec, 1993; Wu, 1994; Smadja et al., 1996; Kitamura and Mat-sumoto, 1996;Melamed, 1997) for bilingual extraction via alignment

In the first paper on fuzzy decision making Raj Kishor Bisht and H.S.Dhami [2]suggest a way to check the possibility whether a word combination can be considered

as collocation or not Fuzzy logic allows the formation of a logic based model byutilizing the reasoning behind the existing methods The resulting model has thesimplicity of the logic based model and performs better than the existing statisticalmodels

In the study of collocation, German is the second most investigated language.The first is the study of Breidt (1993) and more recently, Krenn and Evert, such

as (Krenn and Evert in 2001; Evert and Krenn, 2001 Evert 2004) Breidt, using MI

Trang 8

and t-score and compares accuracy results when different parameters change, such

as window size, the presence compared with absence of lemmatization, corpus size,and the presence compared with absence of POS and syntactic information Then,Krenn and Evert (2001) used a German chunk-er to extract the pair syntax as P-N-

V Their work set the basis of formal methods and the pricing system in collocationextraction Zinsmeister and Heid (2003, 2004) focused on combining NV and ANVdetermined using a stochastic parser

Thanks to the outstanding work of Gross on lexicon-grammar (1984), French isone of the languages most studied on the distribution and conversion capabilities

of the word This work was done before the computer era and the advent of pus linguistics, while the automatic extraction was then performed, for example, in(Lafon, 1984; Daille in 1994 ; Bourigault in 1992, Goldman et al, 2001)

cor-There are also a number of methods to extract collocation studies in other guages [25,3,27,30] For over 20 years ago, the field of natural language processinghas achieved many accomplishments (such as labelling grade, topic detection, or re-covery information) [26, 33, 29, 21] However, most of these were made for Westernlanguages and their value is lost when applied to other languages Not until now,Vietnamese researchers are attracted by linguistics and Vietnamese standard grade.The necessary data warehouse terms not built in a certain standard, and so far al-most no resources are public It is difficult for amateurs to learn or research in thisfield

lan-Nguyen Cam Tu [35] (about discovery scheme for classification and clusteringweb documents in Vietnamese), has given the label based on N-gram testing toextract meaningful phrases (or collocation) from the n-gram on the basis of teststatistics This paper gives a few names of statistical methods to determine colloca-tion, such as the mutual information (mutual information), the technical hypothesistesting (hypothesis testing technologies), Hypothesis Null (null hypothesis) on theindependent of the n-gram from the ways of testing and to test the validity of the-ories in which the author has used methods of hypothesis testing for n-gram (n <=2), based on when the Chi-Square to find the collocation Chi-Square values are cal-culated from a large data set (data Vnexpress (199MB) and Wikipedia (270MB) inabout 200 subjects), and are based on a threshold value to determine the collocation(which authors called coloThreshold)

Trang 9

1.3 Contribution of the thesis

It was found that the studies on the extracting collocation for English has gone

to a lot of investigation; however, the study on extracting collocations in Vietnamese

is still a relatively new field Not much research has been conducted and the resultsare still very limited This literature focuses on the application of some statisticalmethods to extract collocation in the Vietnamese language, studying the effects ofpretreatment of the extracted text, comparing the accuracy of the test model; wepropose a combination of methods to improve the accuracy of the program Ourgoals are below :

• To investigate Vietnamese collocations: details on definitions, characteristics,classification, and some applications in machine translation of collocations andproblems of natural language processing

• To present some method of extracting collocations based on statistics Morespecifically, within the limits of this thesis, we will delve into four methods: themethod based on frequency, two methods of testing theories and methods based

on mutual information For each method, from presenting relevant theoreticalbasis, we presented how to apply them to solve the problems in the Vietnamesecollocations extraction, some experimental models, results and evaluation ofthe application four methods to extract collocations in Vietnamese

• To propose a method of combining statistical, syntactic information and alinguistic measure for identifying PP-verb and NPs collocations From thepresent theoretical basis, we develop empirical models, evaluate results andthe accuracy of the program based on this method

The following describe content of the chapter:

Chapter 2 presents an overview of collocation, the characteristics, classificationand application of collocation, introduces the concept of collocation in Vietnamese.Chapter 3 explores classical statistical methods Chapter 4 presents collocation ex-traction method that we proposed This chapter includes the construction of empir-ical models on the statistical methods for Vietnamese and combined method usinginformation of syntax and a linguistic measure Chapter 5 presents the results ofexperimental models which have been proposed Chapter 6 describe our conclusion

Trang 10

• What are collocations?

• Characteristics of a collocation? How many types of collocations?

• Which must be extracted collocations for?

• Concept of Vietnamese collocation and extraction Vietnamese collocations

2.1 Collocations’ characteristics

According to the definition outlined earlier, a collocation has four mainfeatures:

7

Trang 11

2.1.1 Recurrent

The appearance together of the word which created the collocation in adocument is not a special case They are used repeatedly in a certain con-text Phrases like to make a decision, to hit a record, to perform an operation

is the common collocation in English text, or HIV / AIDS, chuyển dịch cơcấu, học hỏi kinh nghiệm is the common collocation in Vietnamese text, andphrases such as to buy short, to ease the jib or vaccine, kiểm thử phần mềm isthe collocation specific areas of expertise Both types of collocation are usedrepeatedly in other contexts

2.1.2 Arbitrary

In a sense, the collocation’s meaning has idiomatic, or fixed phrase Commondefinition of a collocation could not be directly inferred from the meaning ofwords constituting it In most cases, a collocation could not be translatedword by word style from one language into another language For example,

we can translate the phrase open door in Vietnamese into English, Germaneasily, but could translated word by word the phrase cạnh tranh gay gắt fromVietnamese into English or German A Vietnamese learners could not easilyuse the phrase cạnh tranh gay gắt if they do not know the meaning of thephrase before Translating a text from one language into another language notonly requires knowledge of the rules of grammar and semantics as collocationswith rigidity, warehouse bilingual corpus of collocations is essential for anapplication effective machine translation

2.1.3 Domain-dependent

In professional writing, there are many collocations The terminology isoften less familiar to those who do not research and study in that field Inaddition, there are words familiar to the reader but they have completely dif-ferent meaning in the specialized text For example, in information technologysuch as kỹ nghệ phần mềm, xử lý bundle, tài nguyên hệ thống entirely newwords for those who study in the social, economic or another Besides, thereare many phrases that do not contain specialized terminology but its meaning

is not familiar to people outside of the majors For example, in English text,

Trang 12

2.2 Classification of collocations 9

a dry suit is not a dry suit, which is a special type of clothing to help thesailors did not get wet in the extreme weather conditions Indigenous peopleare often unaware of the rigidity of the collocation in the regular text, however,the rigidity of the collocation in the text can also cause major difficulties forthem

2.1.4 Non-substitutability (the closely linked in terms

of vocabulary)

We usually could not replace a component of the collocation by its synonyms,because the alternative can completely change the original meaning of thephrase The nature of the collocation is often used by practitioners and whencompiling a dictionary collection of collocations (Cowie[10]; Benson [1]) Thepractiser and compiling a dictionary based on the idea of language of others todecide what are collocation phrases and words that is not a collocation Theycollect information in the form of the questionnaire, each question had beenremoved a word The word disability can easily be answered by the natives,while with the languages learner which it is not simple Therefore, collocationhas its own probability distribution (Halliday [6]; Cruse [11]) In other words,for example, the probability phrase red herring appear consecutively in the textarea will be greater than the probability of occurrence of red with a probability

of occurrence of herring; or they could not be regarded as two of which aretwo independent random variables Based on this idea, we developed a set ofmethods to select and identify collocation extracted from the large corpus ofdata based on statistics

2.2 Classification of collocations

The linguists and the compilations of the dictionary have conducted manystudies to provide a classification system for collocations One classificationsystem was based on the relationship between the components Accordingly,there are two types of collocation, they are collocation based on relations ofgrammar and collocation related semantics Collocations based on grammarrelationship often include prepositions, the structure verb form + preposition(for examples come to, put on), adjective + preposition (as afraid of, fond of )

Trang 13

and noun + preposition (egg, by accident, witness to) Collocations which aresemantically related pairs of words are limited in terms of vocabulary.

Another classification system is favoured to the structure of the cation Accordingly, there are two types of collocations: collocations are thecompounds and the collocations have structure more flexible Collocation that

collo-is the compound has pairs of words that appear consecutively in the text, andwith fixed function syntax Noun + noun phrases are examples of such type

of collocation The collocation is the pair of flexible word includes a subjectand verb forms, and the distance they can be (or appear interrupt word)

We favour an approach which draws a line between collocation and freeword combination on the semantic layer [37], the compositionality between thecomponents of a linguistic expression For this purpose, there are three classes

of collocation based on varying degrees of semantic compositionality of thebasic lexical entities involved:

2.2.1 Idiomatic Phrases

In this case, none of the lexical components involved contribute to overallmeaning in semantically transparent way The meaning of the expression ismetaphorical or figurative The idiomatic phrases contain one, several, or noempty seats If gaps exist, the phrase pattern for determining the label of thewords may be added to that space

2.2.2 Support Verb Construction

The second class contains expression in which at least one component contributes

to the overall meaning in a semantically transparent way and thus constitutes itssemantic core This type of collocation is the most flexible structure They are oftenappear together repeatedly with a certain number of grammatical structures Forexample: Hostile-takeover, make-Decision Table 2.2 illustrates some collocationsrelated predicate in Vietnamese

2.2.3 Fixed Phrases

The collocations include the noun phrase terms in specific fields The nounphrase’s meaning could not be inferred from the meaning of the word component

Trang 14

2.3 Applications 11

Figure 2.2: The collocation has Support Verb Construction

For example, stock market, Foreign Exchange, New York Stock Exchange, The DowJones Industrials average of 30

Figure 2.3 illustrates some of the collocation form a fixed noun phrase in namese

Viet-Figure 2.3: Fixed noun phrase

2.3 Applications

Collocations exist in a lot of text The concept of collocation is not only thephrases in the text adjacent, but also the idiomatic phrase, the terminology Thereare two main issues which are the rigidity and inseparable in meaning betweenthe phrases There are phrases, no errors on grammatical, no errors or violations

of rules, but they are not considered to be true, or not accepted, simply becausethe natives do not speak like that This problem is the cause of the difficulties

Trang 15

that beginners encounter when they learn a language Therefore, the extraction ofcollocation could help the language learners to get used to using words and wordcombinations by native speakers A second issue related to the collocation we want tomention is the problem related to the definition of collocation As mentioned above,the definition of a collocation is not usually derived directly from the definition

of the word component This characteristic has important influence to a machinetranslation system The request of the user for each machine translation system isthe target text to achieve a precision and as fluency as possible Using the method oftranslation from a collocation of words to translate from one language into anotherlanguage not only reduces the accuracy of the system, but also affects the degree offluency of the target text Therefore, a program may be able to identify collocations,and updates on bilingual collocation dictionaries not only increases the accuracy

of the program but also the nature of the text In addition, warehouse bilingualcorpus of collocation is benefit to the program of social language and many otherapplications

2.4 Vietnamese collocations

Like other languages, Vietnamese also has many collocations For instance,

we must use rửa rau to describe the action wash vegetables but we can not userửa gạo to describe the action wash rice before cooking The right phrase is vogạo According to the dictionary translation in English - Vietnamese, collocationmeans "an arrangement in place, the placement order." In the field of language,collocation can be understood like "(a) use the word, (a) incorporating the word"

In Vietnamese, there is a concept very close to the meaning of collocation, which

is a fixed phrase [4] The fixed phrase is a number of word combined, existing as

a unit is available as word, it has semantic constituents and stability as well asword Definition of fixed phrase has developed and organized in a way that theorganization of the phrase, and are generally iconic Therefore, if only based on thesurface, on the meaning of each constituent is generally could not understand thewhole phrase For example: anh hùng rơm, đồng không mông quạnh, tiếng bấc tiếngchì Furthermore, fixed phrase mean as a whole corresponds to a body structure

of its material This means that it has very high expression, for example, the fixedphrase: rán sành ra mỡ, méo miệng đòi ăn xôi vò, say như điếu đổ the expression

is the fullest extent The fixed phrase should be distinguished from the neighbouringunits, they are easily confused with compound words and free phrases If accepted

Trang 16

2.4 Vietnamese collocations 13

as a temporary name that is not immediately identify their conceptual content, itcan be summed up in one of the classification picture Vietnamese fixed phrase asfollows [8]:

Figure 2.4: Some type of Vietnamese fix phrase

The classification of Vietnamese fixed phrase above is not worked out the solute boundaries between these categories, and not the units in each category areshown the properties of pure type There are intermediate unit is formed by the way

ab-of free expression, less stable still crisp There are those who have achieved the highexpression, but the durability and the body of the structure are low

The concept of collocation and Vietnamese fixed phrase are very close together,but for the problem in extracted Vietnamese collocation, collocation is understoodmore broadly than the fixed phrase Derived from the characteristics of collocation(phrases including two or more words appearing together frequently), the problem

in extracted Vietnamese collocation becomes the problem of extracting n-gram cluding many word appearing frequently with each other Collocation in problem inextracted Vietnamese collocation include compound words, phrases, fixed phrases,

in-or even free phrase if they are present with great frequency in the cin-orpus

Trang 17

Basic methods in Collocation

extraction

Some classical methods in the study of collocation is the approach of the practiceand compiled dictionaries According to Benson and Morton [1], the component ofcollocation could not be separated and handled independently Therefore, the pro-cess of extracting selected collocation is not a pattern exists, but must be extractedmanually selected, and added to the dictionary In recent years, the approaches based

on statistics have been applied in the study of collocation extraction This is theresult of the fact that there is more and more large corpus of data that computercan understand Chouka [7] has developed programs that automatically extract col-location selected from text using n-gram from 2 to 6 words A simple method todetermine the collocation in the corpus is based on the frequency of appearances

If two or more words often appear together, they can completely make collocation.However, n-grams with the highest frequency of sometimes are not a collocation.For example, we consider the bigram in the corpus data as of the, in the, to the, etc

To solve this problem, Justeson and Katz [14] give a method based on experience

to improve the accuracy of the program, by the bigram pass through a filter based

on the labels of categories This filter only passes through the N-gram structuredetermination Some models are used to along as AN, NN, AAN, and ANN, with acorresponding adjective, N corresponds to the term Although these methods based

on the experience are rather simple but they have significantly improved the curacy of the program The extracting methods based on the frequency have beenused quite effectively for the fixed noun phrase However, it does not really work forthe collocation that has a more flexible structure or collocation that contains the

ac-14

Trang 18

3.1 Frequency 15

separated components The method of hypothesis testing based on mutual mation is given to improve this situation However, each method has strengths andweaknesses points, and depending on the used data In the rest of this chapter, we

infor-go into detail about the four classical methods based on statistics used in extractingcollocation: method based on frequency, t-test, Chi-squared, and methods of usingmutual information

3.1 Frequency

This method is based on the assumption: collocation is a combination of wordsthat often appear together in the text If two words appear together several timesover a certain threshold, then it can be seen that they relate to each other, and may

be treated as collocation However, the precision of this method is limited We canimprove this method by giving the phrase a bi-gram pass through a filter This filter

is based on the label of words in input phrases, and only words which are probably aphrase could pass through Justeson and Katz [14] provide the template of Englishphrase Table 3.1 illustrates the label used for English proposed by Justeson and Katz[15] However, in Vietnamese adjectives usually go after to modifier to the noun, andlocation of verbs, adjectives and prepositions in sentences are different from English

We propose a model of the language label for Vietnam as in figure 3.1 In thismodel, A represents the adjective, preposition representing P and N represent nouns.When conducting comparative experimental results, extracting bi-gram followingthe models available significantly improved the accuracy of the program extractingbased on frequency

Table 3.1: Sample label type for filter of English

A N Linear function

N N Regression coefficients

A A N Gaussian random variable

A N N Cumulative distribution function

N A N Mean squared error

N N N Class probability function

N P N Degree of freedom

In particular, A: adjective, N: nouns and P: preposition This is the simplestmethod to extract collocations in the text However, this method requires a large

Trang 19

Figure 3.1: Sample label type for filter of Vietnamese

data set and the accuracy of the program depends on the size of the data corpus

In addition, it only extracts the collocation from a fixed pair

Figure 3.2: Some collocations extracted by frequency

3.2 Hypothesis testing

In many cases, two words can occur together randomly and thus there is nocollocation For such cases, we could not apply the approach based on the frequency,hypothesis testing methods are invoked The method of hypothesis testing is used

to accept or reject the null hypothesis In the problem collocations extraction, pothesis testing helps us determine if two words appear together randomly, or it is

hy-a collochy-ation Initihy-al hypothesis H0 states that there is no connection in the ance of these words From this null hypothesis, we determined the event to occur

Trang 20

appear-3.2 Hypothesis testing 17

if H0 were true The probability P: when the event occurs and the H0 is true andreject H0 if P is too low (usually P <0.05, 0.01, 0005 or 0001) and retain H0 in othercases

3.2.1 T-Test

t-Test is a method commonly used in hypothesis testing In t-Tests, the ability distribution of the wi surrounding the root w is assumed to follow normaldistribution Null hypothesis is a sample that has average distribution µ, T-Testsconsider the differences between the average value of the sample and the averagevalue of its normal distribution If t is greater than a certain threshold t0, null hy-pothesis H0 is accepted, by contrast, H0 is rejected The value t is calculated usingthe formula:

dis-After having completed the value of t, we search the table of t distribution with

is the corresponding deviation If t larger than the value of t0 corresponding to thedeviation α determined, we can remove the hypothesis H0 with precision (1 − α).For example, applying t-test: Our null hypothesis is stated as follows: The av-erage height of male is 158 cm We reviewed a sample of high-dimensional index of

200 men, with x = 169 and σ2 = 2600 and we want to determine if the sample hasbeen taken from the files of that population above, in other words it has compliedwith the empty theory The value of t is calculated as follows:

To illustrate the use of t-Test in collocations extraction, we calculate t for newcompanies We considered the data corpus is a sequence of N bi-grams, and thesample is a set of random variables corresponding to each bi-gram, the value by 1when the bi-gram appears in the corpus, and the value 0 in otherwise

Trang 21

In our corpus, new appearance was 15.828, companies appear 4675 times, and14.307.668 bi-grams Probability for new and the companies shall be calculated asfollows:

proba-Table 3.2.2 shows the frequency of new and companies in the corpus

C(new) = 15.828, C(Companies) = 4.675, C(new Companies) = 8 and 14.307.668bi-grams Chi-squared index is calculated by squaring the total effect of the value

of each cell (i, j) with its expected value divided by the expected value Specifically,

Định dạng
Số trang	43
Dung lượng	1,18 MB