semantic similarity in vietnamese

VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Ha Noi – 2010 VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Supervisor: Phd. Phạm Bảo Sơn Ha Noi – 2010 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Abstract Our thesis shows the quality of semantic vector representation with random projection and Hyperspace Analogue to Language model under about the researching on Vietnamese. The main goal is how to find semantic similarity or to study synonyms in Vietnamese. We are also interested in the stability of our approach that uses Random Indexing and HAL to represent semantic of words or documents. We build a system to find the synonyms in Vietnamese called Semantic Similarity Finding System. In particular, we also evaluate synonyms resulted from our system. Keywords: Semantic vector, Word space model, Random projection, Apache Lucene 2 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Acknowledgments First of all, I wish to express my respect and my deepest thanks to my advisor Pham Bao Son, University of Engineering and Technology, Viet Nam National University, Ha Noi for his enthusiastic guidance, warm encouragement and useful research experiences. I would like to gratefully thank all the teachers of University of Engineering and Technology, VNU for their invaluable knowledge which they provide me during the past four academic years. I would also like to send my special thanks to my friends in K51CA class, HMI Lab. Last, but not least, my family is really the biggest motivation for me. My parents and my brother always encourage me when I have stress and difficulty. I would like to send them great love and gratefulness. Ha Noi, May 19, 2010 Nguyen Tien Dat 3 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Contents 4 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 5 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 6 Chapter 1: Introduction Nguyen Tien Dat Chapter 1 Introduction Finding semantic similarity is an interesting project in Natural Language Processing (NLP). Determining semantic similarity of a pair of words is an important problem in many NLP applications such as: web-mining [18] (search and recommendation systems), targeted advertisement and domains that need semantic content matching, word sense disambiguation, text categorization [28][30]. There is not much research done on semantic similarity for Vietnamese, while semantic similarity plays a crucial role for human categorization [11] and reasoning; and computational similarity measures have also been applied to many fields such as: semantics-based information retrieval [4][29], information filter [9] or ontology engineering [19]. Nowadays, word space model is often used in current research in semantic similarity. Specifically, there are many well-known approaches for representing the context vector of words such as: Latent Semantic Indexing (LSI) [17], Hyperspace Analogue to Language (HAL) [21] and Random Indexing (RI) [26]. These approaches have been introduced and they have proven useful in implementing word space model [27]. In our thesis, we carry on the word space model and implementation for computing the semantic similarity. We have studied every method and investigated their advantages and disadvantages to select the suitable technique to apply for Vietnamese text data. Then, we built a complete system for finding synonyms in Vietnamese. It is called Semantic Similarity Finding System. Our system is a combination of some processes or approaches to easily return the synonyms of a given 7 Chapter 1: Introduction Nguyen Tien Dat word. Our experimental results on the task of finding synonym are promising. Our thesis is organized as following. First, in Chapter 2, we introduce the background knowledge about word space model and also review some of the solutions that have been proposed in the word space implementation. In the next Chapter 3, we then describe our Semantic Similarity Finding System for finding synonyms in Vietnamese. Chapter 4 describes the experiment we carry out to evaluate the quality of our approach. Finally, Chapter 5 is conclusion and our future work. 8 Chapter 1: Introduction Nguyen Tien Dat 9 Chapter 2. Background Knowledge Nguyen Tien Dat Chapter 2 Background Knowledge 2.1 Lexical relations The first section, we describe the lexical relations to clear the concept of synonym as well as hyponymy. Relations lexical concepts are difficult to define a common way. It is given by Cruse (1986) [35]. A lexical relation is a culturally recognized pattern of association that exists between lexical units in a language. 2.1.1 Synonym and Hyponymy The synonymy is the equality or at least similarity of the importance of different linguistic. Two words are synonymous if they have the same meaning [15]. Words that are synonyms are said to be synonymous, and the sate of being a synonym is called synonymy. For the example, in the English, words “car” and “automobile” are synonyms. In the figurative sense, two words are often said to be synonyms if they have the same extended meaning or connotation. Synonyms can be any part of speech (e.g. noun, verb, adjective or pronoun) as the two words of a pair are the same part of speech. More examples of Vietnamese synonyms: độc giả - bạn đọc (noun) chung quanh – xung quanh (pronoun) bồi thường – đền bù (verb) an toàn – đảm bảo (adjective) 10 [...]... and the main function of Semantic Vector Package applied in our system There are two main functions for training data from Lucene Index: i: Building semantic models, other name of this function is indexing and ii: Searching models (querying) that perform the number of different searching Searching Model will introduced in the next part 3.4 Building Model is indexing all documents and term in free text... to use our system for finding synonyms in Vietnamese 3.2 System Processes Flow Figure 3.1: The processes of Semantic Similarity Finding System 32 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat After introducing the components and description of our system, we show how to operate the system in following steps: • First, we need to collect data for training Data used in this system is only... a beginning, becoming more and more information in order to manage Web by using the content categories 29 Chapter 2 Background Knowledge Nguyen Tien Dat 30 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat Chapter 3 Semantic Similarity Finding System We built a complete system to find synonyms in Vietnamese Our system operates the word-space model based on the approach: Random Indexing (RI)... file formats 35 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat There are two kinds of Building Model in Semantic Vector Package The first is BuildIndex which semantic vector indexes of all terms in all document bases on the number of documents in free text corpus The second is BuildPositionalIndex which create or index term according to some word close to it in document; so this is reason... of words in the vocabulary 2.2.2 Semantic similarity As we have seen in the definition, the word-space model is a model of semantic similarity On the other hand, the geometric metaphor of meaning is Meanings are locations in a semantic space, and semantic similarity is proximity between the locations The term-document vector represents the context of term in low granularity Besides, creating term vector... from the index • reconstruct the original document fields, edit them and re-insert to the index • optimize indexes • And much more Figure 3.2: Lucene Index Toolbox - Luke 34 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat 3.3 Semantic Vector Package There are many software or open sources could be applied in creating semantic vector space But we chose a popular source-code that is Semantic. .. term vector according to the some words surrounding to compute semantic vector [21] It is a kind of semantic vector model To compare the semantic similarity in semantic vector model, we use Cosine distance: Figure 2.2: Cosine distance 14 Chapter 2 Background Knowledge Nguyen Tien Dat In practice, it is easier to calculate the cosine of the angel between the vectors instead of angle: A cosine value of zero... identify or index terms in the text documents Model is useful in information retrieval, information filter, indexing The invention can be traced at the Salton's introduction 12 Chapter 2 Background Knowledge Nguyen Tien Dat about Vector space Model for information retrieval [29] This term is due to Hinrich Schutze (1993): “Vector similarity is the only information present in Word Space: semantically... meaning is included in that of other word [14] Some examples in English: “scarlet”, “vermilion”, and “crimson” are hyponyms of “red” And in Vietnamese: “vàng cốm”, “vàng choé” and “vàng lụi” are hyponyms of “vàng”, in case, “vàng” is in color In our thesis, we don’t distinguish clearly between synonym and hyponym We suppose the hyponym is a kind of synonym 2.1.2 Antonym and Opposites In the lexical semantics,... implementing RI creating the context vector of terms according the documents which they occur in In other hand, RI produces HAL when it is used to make the co-occurrence within narrow window size matrix of all terms in the free text corpus Hence, the context vectors are built by the words that immediately surround the target word 3.1 System Description The semantic Similarity Finding System contains three . Tien Dat 3 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Contents 4 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 5 Finding the semantic similarity in Vietnamese. Dat 6 Chapter 1: Introduction Nguyen Tien Dat Chapter 1 Introduction Finding semantic similarity is an interesting project in Natural Language Processing (NLP). Determining semantic similarity of. researching on Vietnamese. The main goal is how to find semantic similarity or to study synonyms in Vietnamese. We are also interested in the stability of our approach that uses Random Indexing and

Định dạng
Số trang	64
Dung lượng	615,95 KB