Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 67 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
67
Dung lượng
1,22 MB
Nội dung
VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS MASTER THESIS HANOI - 2008 VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS Major: Information Technology Specificity: Information Systems Code: 60 48 05 MASTER THESIS SUPERVISOR: Prof. Dr. Ha Quang Thuy HANOI - 2008 i Acknowledgements My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years. Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. This work can not be possible without his support. Also, I would like to thank him for being my friend, my older brother who has brought me a lot of lessons in both scientific research and daily life. My thanks also go to all members of seminar group “data mining”. Especially, I would like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing experiments. I highly acknowledge the invaluable support and advice in both technical and daily life of my teachers, my colleagues in Department of Information Systems, Faculty of Technology, Vietnam National University, Hanoi I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity Resolution and Tracking crossover Web Documents”, Vietnam National University, Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10 “Developing content filter systems to support management and implementation public security – ensure policy” Finally, from bottom of my heart, I would specially like to say thanks to all members in my family, all my friends. They are really an endless encouragement in my life. Nguyen Cam Tu ii Assurance I certify that the achievements in this thesis belong to my personal, and are not copied from any other’s results. Throughout the dissertation, all the mentions are either my proposal, or summarized from many sources. All the references have clear origins, and properly quoted. I am responsible for this statement. Hanoi, November 15, 2007 Nguyen Cam Tu iii Table of Content Introduction 1 Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis 3 1.1. Introduction 3 1.2. The Early Methods 5 1.2.1. Latent Semantic Analysis 5 1.2.2. Probabilistic Latent Semantic Analysis 8 1.3. Latent Dirichlet Allocation 11 1.3.1. Generative Model in LDA 12 1.3.2. Likelihood 13 1.3.3. Parameter Estimation and Inference via Gibbs Sampling 14 1.3.4. Applications 17 1.4. Summary 17 Chapter 2. Frameworks of Learning with Hidden Topics 19 2.1. Learning with External Resources: Related Works 19 2.2. General Learning Frameworks 20 2.2.1. Frameworks for Learning with Hidden Topics 20 2.2.2. Large-Scale Web Collections as Universal Dataset 22 2.3. Advantages of the Frameworks 23 2.4. Summary 23 Chapter 3. Topics Analysis of Large-Scale Web Dataset 24 3.1. Some Characteristics of Vietnamese 24 3.1.1. Sound 24 3.1.2. Syllable Structure 26 3.1.3. Vietnamese Word 26 3.2. Preprocessing and Transformation 27 3.2.1. Sentence Segmentation 27 iv 3.2.2. Sentence Tokenization 28 3.2.3. Word Segmentation 28 3.2.4. Filters 28 3.2.5. Remove Non Topic-Oriented Words 28 3.3. Topic Analysis for VnExpress Dataset 29 3.4. Topic Analysis for Vietnamese Wikipedia Dataset 30 3.5. Discussion 31 3.6. Summary 32 Chapter 4. Deployments of General Frameworks 33 4.1. Classification with Hidden Topics 33 4.1.1. Classification Method 33 4.1.2. Experiments 36 4.2. Clustering with Hidden Topics 40 4.2.1. Clustering Method 40 4.2.2. Experiments 45 4.3. Summary 49 Conclusion 50 Achievements throughout the thesis 50 Future Works 50 References 52 Vietnamese References 52 English References 52 Appendix: Some Clustering Results 56 v List of Figures Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. ( [55]) 9 Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55]) 10 Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] 12 Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively 13 Figure 1.5. Quantities in the model of latent Dirichlet allocation 13 Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation 16 Figure 2.1. Classification with Hidden Topics 20 Figure 2.2. Clustering with Hidden Topics 21 Figure 3.1. Pipeline of Data Preprocessing and Transformation 27 Figure 4.1. Classification with VnExpress topics 33 Figure 4.2 Combination of one snippet with its topics: an example 35 Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics) 37 Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at 60topics 37 Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics 39 Figure 4.6. Clustering with Hidden Topics 40 Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering 42 Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 47 Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 47 vi List of Tables Table 3.1. Vowels in Vietnamese 24 Table 3.2. Tones in Vietnamese 25 Table 3.3. Consonants of hanoi variety 26 Table 3.4. Structure of Vietnamese syllables 26 Table 3.5. Functional words in Vietnamese 29 Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset 29 Table 3.7. Statistics of VnExpress dataset 30 Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics 30 Table 3.9. Statistic of Vietnamese Wikipedia Dataset 31 Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics 31 Table 4.1 Google search results as training and testing dataset. The search phrases for training and test data are designed to be exclusive 34 Table 4.2. Experimental results of baseline (learning without topics) 38 Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset 38 Table 4.4. Some collocations with highest values of chi-square statistic 44 Table 4.5. Queries submitted to Google 45 Table 4.6. Parameters for clustering web search results 46 vii Notations & Abbreviations Word or phrase Abbreviation Information Retrieval IR Latent Semantic Analysis LSA Probability Latent Semantic Analysis PLSA Latent Dirichlet Allocation LDA Dynamic Topic Models DTM Correlated Topic Models CTM Singular Value Decomposition SVD 1 Introduction The World Wide Web has influenced many aspects of our lives, changing the way we communicate, conduct business, shop, entertain, and so on. However, a large portion of the Web data is not organized in systematic and well structured forms, a situation which causes great challenges to those seeking for information on the Web. Consequently, a lot of tasks enabling users to search, navigate and organize web pages in a more effective way have been posed in the last decade, such as searching, page rank, web clustering, text classification, etc. To this end, there have been a lot of successful stories like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few. Inspired by this trend, the aim of this thesis is to develop efficient systems which are able to overcome the difficulties of dealing with sparse data. The main motivation is that while being overwhelmed by a huge amount of online data, we sometimes lack data to search or learn effectively. Let take web search clustering as an example. In order to meet the real-time condition, that is the response time must be short enough, most of online clustering systems only work with small pieces of text returned from search engines. Unfortunately those pieces are not long and rich enough to build a good clustering system. A similar situation occurs in the case of searching images only based on captions. Because image captions are only very short and sparse chunks of text, most of the current image retrieval systems still fail to achieve high accuracy. As a result, much effort has been made recently to take advantage of external resources like learning with knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy. These approaches, however, have some difficulties: (1) constructing a knowledge base is very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in one application cannot be reused in another one even in the same domain. In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification. Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the learner on both the original data (like snippets or image captions) and a rich set of hidden topics discovered from the universal data collection. The general frameworks are flexible [...]... whole documents as input since the downloading of original documents is timeconsuming The question here is how to enhance the quality of clustering for such document snippets in online web clustering Inspired by the fact those snippets are only small pieces of text (and thus poor in content) we propose the framework to enrich them with hidden topics for clustering (Figure 2.2) This framework and topic. .. particularly in the literary 3.2 Preprocessing and Transformation Data preprocessing and Transformation are necessary steps for any data mining process in general and for hidden topics mining in particular After these steps, data is clean, complete, reduced, partially free of noises, and ready to be mined The main steps for our preprocessing and transformation are described in the subsequent sections and shown... grows linearly with the number of training documents The parameters for a K -topic pLSI model are K multinomial distributions of size V and M mixtures over the K hidden topics This gives KV + KM parameters and therefore linear growth in M The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem In practice, a tempering heuristic... of text clustering specific for web pages, can be offline or online Offline clustering is to cluster the whole storage of available web documents and does not have the constraint of response time In online clustering, the algorithms need to meet the “real-time condition”, i.e the system need to perform clustering as fast as possible For example, the algorithm should take the document snippets instead... for analyzing topics for new documents (in the case of LDA), we have a natural way to map documents from term space into topic space This is really an advantage over heuristic-based mapping in the previous approaches [16][3][10] 2.4 Summary This chapter described two general frameworks and their advantages for learning with hidden topics: one for classification and one for clustering The main advantages... document j and each concept We write this approximation as X k = U k Σ k VkT Based on this approximation, we can now do the following: - See how related documents j and q are in the concept space by comparing the ˆ ˆ vectors d j and d q (usually by cosine similarity) This gives us a clustering of the documents - 7 Comparing terms i and p by comparing the vectors tˆi and tˆn , giving us a clustering of... of hidden topics The main motivation is how to gain benefits from huge sources of online data in order to enhance quality of the Text /Web clustering and classification Unlike previous studies of learning with external resources, we approach this issue from the point of view of text /Web data analysis that is based on recently successful latent topic analysis models like LSA, pLSA, and LDA The underlying... analyzed in the estimation phrase – given the document The topic distributions of training dataset are then combined with training dataset itself for learning classifier In the similar way, new documents, which need to be classified, are combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier b Framework for Clustering Figure... machine-learning methods have been applied to text classification including decision trees, neural networks, support vector machines, etc In the typical applications of machine-learning methods, the training data is passed to a learning phrase The result of the learning step is an appropriate classifier capable of categorizing new documents However, in the cases such as the training data is not as much as expected... a topic analysis technique such as pLSA, LDA, etc is applied to the dataset The result of this step is an estimated topic model which consists of hidden topics and the probability distributions of words over these topics Upon this model, we can do topic inference for training dataset and new data For each document, the output of topic inference is a probability distribution of hidden topics – the topics . NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS Major: Information Technology Specificity: Information Systems Code:. OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS MASTER THESIS . 16 Figure 2.1. Classification with Hidden Topics 20 Figure 2.2. Clustering with Hidden Topics 21 Figure 3.1. Pipeline of Data Preprocessing and Transformation 27 Figure 4.1. Classification