1. Trang chủ
  2. » Luận Văn - Báo Cáo

semantic similarity in vietnamese

64 615 16

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 64
Dung lượng 615,95 KB

Nội dung

VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Ha Noi – 2010 VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Supervisor: Phd. Phạm Bảo Sơn Ha Noi – 2010 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Abstract Our thesis shows the quality of semantic vector representation with random projection and Hyperspace Analogue to Language model under about the researching on Vietnamese. The main goal is how to find semantic similarity or to study synonyms in Vietnamese. We are also interested in the stability of our approach that uses Random Indexing and HAL to represent semantic of words or documents. We build a system to find the synonyms in Vietnamese called Semantic Similarity Finding System. In particular, we also evaluate synonyms resulted from our system. Keywords: Semantic vector, Word space model, Random projection, Apache Lucene 2 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Acknowledgments First of all, I wish to express my respect and my deepest thanks to my advisor Pham Bao Son, University of Engineering and Technology, Viet Nam National University, Ha Noi for his enthusiastic guidance, warm encouragement and useful research experiences. I would like to gratefully thank all the teachers of University of Engineering and Technology, VNU for their invaluable knowledge which they provide me during the past four academic years. I would also like to send my special thanks to my friends in K51CA class, HMI Lab. Last, but not least, my family is really the biggest motivation for me. My parents and my brother always encourage me when I have stress and difficulty. I would like to send them great love and gratefulness. Ha Noi, May 19, 2010 Nguyen Tien Dat 3 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Contents 4 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 5 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 6 Chapter 1: Introduction Nguyen Tien Dat Chapter 1 Introduction Finding semantic similarity is an interesting project in Natural Language Processing (NLP). Determining semantic similarity of a pair of words is an important problem in many NLP applications such as: web-mining [18] (search and recommendation systems), targeted advertisement and domains that need semantic content matching, word sense disambiguation, text categorization [28][30]. There is not much research done on semantic similarity for Vietnamese, while semantic similarity plays a crucial role for human categorization [11] and reasoning; and computational similarity measures have also been applied to many fields such as: semantics-based information retrieval [4][29], information filter [9] or ontology engineering [19]. Nowadays, word space model is often used in current research in semantic similarity. Specifically, there are many well-known approaches for representing the context vector of words such as: Latent Semantic Indexing (LSI) [17], Hyperspace Analogue to Language (HAL) [21] and Random Indexing (RI) [26]. These approaches have been introduced and they have proven useful in implementing word space model [27]. In our thesis, we carry on the word space model and implementation for computing the semantic similarity. We have studied every method and investigated their advantages and disadvantages to select the suitable technique to apply for Vietnamese text data. Then, we built a complete system for finding synonyms in Vietnamese. It is called Semantic Similarity Finding System. Our system is a combination of some processes or approaches to easily return the synonyms of a given 7 Chapter 1: Introduction Nguyen Tien Dat word. Our experimental results on the task of finding synonym are promising. Our thesis is organized as following. First, in Chapter 2, we introduce the background knowledge about word space model and also review some of the solutions that have been proposed in the word space implementation. In the next Chapter 3, we then describe our Semantic Similarity Finding System for finding synonyms in Vietnamese. Chapter 4 describes the experiment we carry out to evaluate the quality of our approach. Finally, Chapter 5 is conclusion and our future work. 8 Chapter 1: Introduction Nguyen Tien Dat 9 Chapter 2. Background Knowledge Nguyen Tien Dat Chapter 2 Background Knowledge 2.1 Lexical relations The first section, we describe the lexical relations to clear the concept of synonym as well as hyponymy. Relations lexical concepts are difficult to define a common way. It is given by Cruse (1986) [35]. A lexical relation is a culturally recognized pattern of association that exists between lexical units in a language. 2.1.1 Synonym and Hyponymy The synonymy is the equality or at least similarity of the importance of different linguistic. Two words are synonymous if they have the same meaning [15]. Words that are synonyms are said to be synonymous, and the sate of being a synonym is called synonymy. For the example, in the English, words “car” and “automobile” are synonyms. In the figurative sense, two words are often said to be synonyms if they have the same extended meaning or connotation. Synonyms can be any part of speech (e.g. noun, verb, adjective or pronoun) as the two words of a pair are the same part of speech. More examples of Vietnamese synonyms: độc giả - bạn đọc (noun) chung quanh – xung quanh (pronoun) bồi thường – đền bù (verb) an toàn – đảm bảo (adjective) 10 [...]... and the main function of Semantic Vector Package applied in our system There are two main functions for training data from Lucene Index: i: Building semantic models, other name of this function is indexing and ii: Searching models (querying) that perform the number of different searching Searching Model will introduced in the next part 3.4 Building Model is indexing all documents and term in free text... to use our system for finding synonyms in Vietnamese 3.2 System Processes Flow Figure 3.1: The processes of Semantic Similarity Finding System 32 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat After introducing the components and description of our system, we show how to operate the system in following steps: • First, we need to collect data for training Data used in this system is only... a beginning, becoming more and more information in order to manage Web by using the content categories 29 Chapter 2 Background Knowledge Nguyen Tien Dat 30 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat Chapter 3 Semantic Similarity Finding System We built a complete system to find synonyms in Vietnamese Our system operates the word-space model based on the approach: Random Indexing (RI)... file formats 35 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat There are two kinds of Building Model in Semantic Vector Package The first is BuildIndex which semantic vector indexes of all terms in all document bases on the number of documents in free text corpus The second is BuildPositionalIndex which create or index term according to some word close to it in document; so this is reason... of words in the vocabulary 2.2.2 Semantic similarity As we have seen in the definition, the word-space model is a model of semantic similarity On the other hand, the geometric metaphor of meaning is Meanings are locations in a semantic space, and semantic similarity is proximity between the locations The term-document vector represents the context of term in low granularity Besides, creating term vector... from the index • reconstruct the original document fields, edit them and re-insert to the index • optimize indexes • And much more Figure 3.2: Lucene Index Toolbox - Luke 34 Chapter 3: Semantic Similarity Finding System Nguyen Tien Dat 3.3 Semantic Vector Package There are many software or open sources could be applied in creating semantic vector space But we chose a popular source-code that is Semantic. .. term vector according to the some words surrounding to compute semantic vector [21] It is a kind of semantic vector model To compare the semantic similarity in semantic vector model, we use Cosine distance: Figure 2.2: Cosine distance 14 Chapter 2 Background Knowledge Nguyen Tien Dat In practice, it is easier to calculate the cosine of the angel between the vectors instead of angle: A cosine value of zero... identify or index terms in the text documents Model is useful in information retrieval, information filter, indexing The invention can be traced at the Salton's introduction 12 Chapter 2 Background Knowledge Nguyen Tien Dat about Vector space Model for information retrieval [29] This term is due to Hinrich Schutze (1993): “Vector similarity is the only information present in Word Space: semantically... meaning is included in that of other word [14] Some examples in English: “scarlet”, “vermilion”, and “crimson” are hyponyms of “red” And in Vietnamese: “vàng cốm”, “vàng choé” and “vàng lụi” are hyponyms of “vàng”, in case, “vàng” is in color In our thesis, we don’t distinguish clearly between synonym and hyponym We suppose the hyponym is a kind of synonym 2.1.2 Antonym and Opposites In the lexical semantics,... implementing RI creating the context vector of terms according the documents which they occur in In other hand, RI produces HAL when it is used to make the co-occurrence within narrow window size matrix of all terms in the free text corpus Hence, the context vectors are built by the words that immediately surround the target word 3.1 System Description The semantic Similarity Finding System contains three . Tien Dat 3 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Contents 4 Finding the semantic similarity in Vietnamese Nguyen Tien Dat 5 Finding the semantic similarity in Vietnamese. Dat 6 Chapter 1: Introduction Nguyen Tien Dat Chapter 1 Introduction Finding semantic similarity is an interesting project in Natural Language Processing (NLP). Determining semantic similarity of. researching on Vietnamese. The main goal is how to find semantic similarity or to study synonyms in Vietnamese. We are also interested in the stability of our approach that uses Random Indexing and

Ngày đăng: 13/07/2014, 17:15

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] D. Appelt 1999, An Introduction to information extraction, Artificial Intelligence Communications, 12, 1999 Sách, tạp chí
Tiêu đề: An Introduction to information extraction
[2] David M. Blei, Andrew Y. Ng, Michael I. Jordan 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022 Sách, tạp chí
Tiêu đề: Latent DirichletAllocation
[3] Thorsten Brants, Francine Chen, and Ioannis Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis. In Conference on Information and Knowledge Management (CIKM), pages 211–218, 2002 Sách, tạp chí
Tiêu đề: Topic-baseddocument segmentation with probabilistic latent semantic analysis
[4] MW.Berry, S.T Dumiais & G.W.O'Brien 1994. Using linear algebra for intelligent information retrieval. Computer Science Department Sách, tạp chí
Tiêu đề: Using linear algebra forintelligent information retrieval
[5] Cowie and W.Lehnert. 1996, Information Extraction, In Communications of the ACM, 39, 1996 Sách, tạp chí
Tiêu đề: Information Extraction
[6] H. Cunningham. 1999, Information extraction: a User Guide (revised version), Research Menorandum CS-99-07, Department of Computer Science, University of Sheffied, May, 1999 Sách, tạp chí
Tiêu đề: Information extraction: a User Guide (revised version)
[7] Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990).Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(16):391–407.(p. 157, p. 159) Sách, tạp chí
Tiêu đề: Indexing by latent semantic analysis
Tác giả: Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R
Năm: 1990
[8] Dang Duc Pham, Giang Binh Tran, Son Pham Bao 2009. A hybrid approach to Vietnamese Word Segmentation using Part of Speech tags. International Conference on Knowledge and Systems Engineering Sách, tạp chí
Tiêu đề: A hybrid approach toVietnamese Word Segmentation using Part of Speech tags
[9] Mohammad Emtiyaz Khan. Matrix Inversion Lemma and Information Filter.Honeywell Technology Solutions Lab, Bangalore, India Sách, tạp chí
Tiêu đề: Matrix Inversion Lemma and Information Filter
[10] Dr. Edel Garcia 2006. Singular Value Decomposition (SVD) A Fast Track Tutorial. First Published on September 11, 2006; Last Update: September 12, 2006 Sách, tạp chí
Tiêu đề: A Fast TrackTutorial
[11] Katherine Heller, Adam Sanborn, Nick Chater. Hierarchical Learning of Dimensional Biases in Human Categorization. Department of Engineering University of CambridgeCambridge CB2 1PZ Sách, tạp chí
Tiêu đề: Hierarchical Learning ofDimensional Biases in Human Categorization
[16] Khoo, C., & Na, J.C. (2006). Semantic Relations in Information Science.Annual Review of Information Science and Technology, 40, 157-228 Sách, tạp chí
Tiêu đề: Semantic Relations in Information Science
Tác giả: Khoo, C., & Na, J.C
Năm: 2006
[17] Thomas K Landauer 1998. An Introduction to Latent Semantic Analysis.Discourse Processes, 25, 259-284 Sách, tạp chí
Tiêu đề: An Introduction to Latent Semantic Analysis."Discourse Processes
[18] Raymond Kosala, Hendrik Blockeel 2001.Web Mining Research: A Survey.Department of Computer Science Katholieke Universiteit LeuvenCelestijnenlaan 200A, B-3001 Heverlee, Belgium Sách, tạp chí
Tiêu đề: Web Mining Research: A Survey
[19] Sergei Nirenburg, Victor Raskin and Svetlana Sheremetyeva Lexical Acquisition. Computing Research Laboratory New Mexico State University Sách, tạp chí
Tiêu đề: LexicalAcquisition
[20] Claes Neuefeind Fabian Steeg 2009. Information-Retrieval: Vektorraum-Model.Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universit at zu K oln Sách, tạp chí
Tiêu đề: Information-Retrieval: Vektorraum-Model
[21] Ulrik Petersen 2009. Emdros HAL example (Hyperspace Analogue toLanguage) Sách, tạp chí
Tiêu đề: Emdros HAL example
[22] Hyperspace Analogue to language [Lund and Burgess, 1996] -- Lund, Kevin and Curt Burgess. (1996) Producing high-dimensional semantic spaces from lexical co-ccurrence, Behavior Research Methods, Instruments and Computers, Volume 28, number 2, pp. 203–208 Sách, tạp chí
Tiêu đề: Producing high-dimensional semantic spaces fromlexical co-ccurrence
[23] Robertson, S., & Sp arck Jones, K. (1997). Simple, proven approaches to text re trieval (Technical report No. 356). Computer Laboratory, University of Cambridge Sách, tạp chí
Tiêu đề: Simple, proven approaches to textre trieval
Tác giả: Robertson, S., & Sp arck Jones, K
Năm: 1997
[24] James Richard Curran 2004. From Distributional to Semantic Similarity.Doctor of Philosophy Institute for Communicating and Collaborative Systems Sách, tạp chí
Tiêu đề: From Distributional to Semantic Similarity

TỪ KHÓA LIÊN QUAN

w