Topic modeling and its applications

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	85
Dung lượng	2,61 MB

Nội dung

Topic modeling and its applications Topic modeling and its applications Topic modeling and its applications luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY THÂN QUANG KHOÁT TOPIC MODELING AND ITS APPLICATIONS MAJOR: INFORMATION TECHNOLOGY THESIS FOR THE DEGREE OF MASTER OF SCIENCE SUPERVISOR: Prof HỒ TÚ BẢO HANOI, 2009 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY THESIS FOR THE DEGREE OF MASTER OF SCIENCE MAJOR: INFORMATION TECHNOLOGY TOPIC MODELING AND ITS APPLICATIONS THAN QUANG KHOAT HANOI, 2009 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY THÂN QUANG KHOÁT TOPIC MODELING AND ITS APPLICATIONS MAJOR: INFORMATION TECHNOLOGY THESIS FOR THE DEGREE OF MASTER OF SCIENCE SUPERVISOR: Prof HỒ TÚ BẢO HANOI, 2009 PLEDGE I promise that the content of this thesis was written solely by me Any of the content was written based on the reliable references such as published papers in distinguished international conferences and journals, and books published by widely-known publishers Many parts and discussions of the thesis are new, not previously published by any other authors ACKNOWLEDGEMENT First and foremost, I would like to present my gratitude to my supervisor, Professor Ho Tu Bao, for introducing me to this attractive research area, for his willingness to promptly support me to complete the thesis, and for many invaluable advices from the starting point of my thesis I would like to sincerely thank Nguyen Phuong Thai and Nguyen Cam Tu for sharing some data sets and for pointing me to some sources on the network where I can find the implementations of some topic models Thanks are also to Phung Trung Nghia for spending his valuable days on helping me to load the data for my experiments Finally, I would like to thank David Blei and Thomas Griffiths for their insightful discussions on Topic Modeling and for providing the C implementation of one of their topic models TABLE OF CONTENTS List of Phrases List of Tables List of Figures Chapter INTRODUCTION Chapter MODERN PROGRESS IN TOPIC MODELING 11 2.1 Linear algebra based models 12 2.2 Statistical topic models 13 2.3 Discussion and notes 18 Chapter LINEAR ALGEBRA BASED TOPIC MODELS 21 3.1 An overview 21 3.2 Latent Semantic Analysis 22 3.3 QR factorization .33 3.4 Discussion 35 Chapter PROBABILISTIC TOPIC MODELS 37 4.1 An overview 37 4.2 Probabilistic Latent Semantic Analysis 39 4.3 Latent Dirichlet Allocation 44 4.4 Hierarchical Latent Dirichlet Allocation 53 4.5 Bigram Topic Model 60 Chapter SOME APPLICATIONS OF TOPIC MODELS 64 5.1 Classification 64 5.2 Analyzing research trends over times 65 5.3 Semantic representation 66 5.4 Information retrieval 67 5.5 More applications 68 5.6 Experimenting with some topic models 68 CONCLUSION 74 REFERENCES 75 LIST OF PHRASES Abbreviation Full name AI ART AT BTM cDTM CTM dDTM DELSA DiscLDA EM HDP HDP-RE hLDA HMM-LDA HTMM IG-LDA IR LDA LSA MBTM MCMC nCRP NetSTM PF-LDA pLSA PLSV sLDA Spatial LDA STM SVD TEM Artificial Intelligence Author-Recipient-Topic Model Author-Topic Model Bigram Topic Model Continuous Dynamic Topic Model Correlated Topic Model Discrete Dynamic Topic Model Dirichlet Enhanced LSA Discriminative LDA Expectation Maximization Hierarchical Dirichlet Processes Hierarchical Dirichlet Processes with random effects Hierarchical Latent Dirichlet Allocation Hidden Markov Model LDA Hidden Topic Markov Model Incremental Gibbs LDA Information Retrieval Latent Dirichlet Allocation Latent Semantic Analysis Memory Bounded Topic Model Markov Chain Monte Carlo Nested Chinese restaurant process Network Regularized Statistical Topic Model Particle Filter LDA Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Visualization Supervised Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Syntactic Topic Model Singular Value Decomposition Tempered EM algorithm LIST OF TABLES Table 2.1 Some selected Probabilistic topic models 15 Table 5.1 DiscLDA for Classification 65 Table 5.2 Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBDM) .68 Table 5.3 The most probable topics from NIPS and VnExpress collections 70 Table 5.4 Finding the topics of a document 71 Table 5.5 Finding topics of a report 71 Table 5.6 Selected topics found by HMM-LDA 72 Table 5.7 Classes of function words found by HMM-LDA .73 LIST OF FIGURES Figure 1.1 Some approaches to representing knowledge Figure 2.1 A general view on Topic Modeling 12 Figure 2.2 Probabilistic topic models in view of the bag-of-words assumption 16 Figure 2.3 Viewing generative models in terms of Topics 17 Figure 2.4 A parametric view on generative models 18 Figure 3.1 A corpus consisting of documents .23 Figure 3.2 An illustration of finding topics by LSA using cosine 29 Figure 3.3 A geometric illustration of representing items in 2-dimensional space 30 Figure 3.4 Finding relevant documents using QR-based method 34 Figure 4.1 Graphical model representation of pLSA .40 Figure 4.2 A geometric interpretation of pLSA .41 Figure 4.3 Graphical model representation of LDA 46 Figure 4.4 A geometric interpretation of LDA 46 Figure 4.5 A variational inference algorithm for LDA 48 Figure 4.6 A geometric illustration of document generation process 55 Figure 4.7 An example of hierarchy of topics [8] 58 Figure 4.8 A graphical model representation of BTM 61 Figure 5.1 LDA for Classification 64 Figure 5.2 The dynamics of the three hottest and three coldest topics .65 Figure 5.3 Evolution of topics through decades 66 Chapter INTRODUCTION Information Retrieval (IR) has been being a very active area and has a long history The development of IR often associates with increasingly huge corpora such as collections of Web pages, collections of scientific papers over years Therefore, it poses many hard questions that have received much attention from researchers One of the most famous questions that seem to be never ended is how to automatically index the documents of a given corpus or database Another substantial question is how to find the most relevant documents in the semantic manner from the Internet or a given corpus to a given user’s query Finding and ranking are usually important tasks in IR Many tools for supporting these tasks are available now, for example, Google and Yahoo However most of these available tools are only able to search for documents via words matching instead of semantic matching Semantics is well-known to be complicated, so finding and ranking documents in the presence of semantics are extremely hard Despite of this fact, these tasks however potentially have many important applications, which in my opinion are future web service technologies, for instance, semantic searching, semantic advertising, academic recommending, and intelligent controlling Semantics is a hot topic not only in the IR community but also in the Artificial Intelligence (AI) community In particular, in the field of knowledge representation it is crucial to know how to effectively represent natural knowledge gathered from the environment around so that reusing it or integrating new knowledge are easy and efficient To obtain a good knowledge database, semantics cannot be absent since any word has its own meanings and has semantic relations to some other words As we know, a word may have multiple senses and play different roles in 68 that LDA-based retrieval often outperforms other methods Table 5.2 shows their comparison of LDA-based retrieval with some other methods for various data sets The measure for their comparison was average precision Table 5.2 Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBDM) Collection AP QL 0.2179 CBDM 0.2326 LBDM 0.2651 FT 0.2589 0.2713 0.2807 SJMN 0.2032 0.2171 0.2307 LA 0.2468 0.259 0.2666 WSJ 0.2958 0.2984 0.3253 5.5 More applications The number of important applications of Topic Modeling has been increasing day by day Due to the limited time and space of this thesis, we cannot describe all So we only discuss some typical applications in previous sections Many interesting applications of Topic Modeling outside text corpora were introduced Some authors have demonstrated that topic models can be employed to find objects appeared in collections of images [69], [41] Biro et al [7] have used topic models to learn spam and non-spam webs for spam filtering task Dietz et al [21] proposed a topic model that is able to learn the correlations among scientific papers for predicting citation influences The same researches can be found in [19], and [49] Some authors enjoy the ability topic models in collecting topics from a data set to classify texts [29], to visualize documents [32], [38] Learning concepts by the use of topic models has been also mentioned by many authors [75], [23] For more interesting applications, we refer to [12], [18], [22], [35], [43], [44], [45], [52], [53], [56], [65], [70], [73], and [76] 5.6 Experimenting with some topic models 69 This section reports the results of the author’s experiments with some probabilistic topic models.6 These experiments used the following data sets: - A collection of 1740 papers of NIPS’ conferences from volume to volume 12 This collection yields a vocabulary of 13649 unique English words, and 2301375 word tokens in total - A collection of 12525 articles of VnExpress – an electronic Vietnamese newspaper It is comprised of a vocabulary of 20171 unique Vietnamese words and 1427482 word tokens in total The first data set which was preprocessed to removed stop words and other special characters can be found in http://psiexp.ss.uci.edu/research/programs_data, see http://books.nips.cc/ for the original papers The second collection was provided by Nguyen Cam Tu The generative models that we experimented with include LDA and HMMLDA LDA was previously described in Chapter 4, however the implementation here uses Gibbs sampler [25] instead of variational EM algorithm for the inference of parameters HMM-LDA is a non-bag-of-words model proposed by Griffiths et al [26] It is a composite model that takes both long- and short-range dependencies More specifically, it considers the syntactic component as a Hidden Markov Model (HMM) and the semantic component as a topic model Thus, a document is a sample from a mixture of an HMM and a topic model This assumption makes the HMM-LDA favorable in discovering function words and content words separately, as we shall see later 5.6.1 Finding topics To demonstrate the power of topic models in uncovering topics of a given corpus or document, we applied LDA to the two data sets with the following settings Note that some experiments with linear algebra based models were presented in Chapter Nonetheless, they only dealt with some toy data sets 70 - The number of topics: T  50 - The number of iterations of Gibbs sampler: N  1000 - a was a scalar: a  50 /T - b was a scalar: b  0.01 We remark that the scalar b here should be interpreted as the hyperparameters h of LDA described in Section 4.3 Table 5.3 The most probable topics from NIPS and VnExpress collections Topics from the NIPS Collection TOPIC_10 0.0406 TOPIC_1 0.0329 TOPIC_21 0.0313 TOPIC_47 0.0279 order 0.0284 units 0.0898 network 0.2152 training 0.1233 approach 0.0199 hidden 0.0624 neural 0.1134 set 0.0697 case 0.0174 unit 0.0527 networks 0.0988 error 0.0631 results 0.0170 layer 0.0518 input 0.0743 test 0.0326 number 0.0118 network 0.0363 output 0.0564 data 0.0319 general 0.0106 input 0.0360 inputs 0.0203 generalization 0.0271 method 0.0102 weights 0.0312 architecture 0.0185 sets 0.0207 work 0.0095 output 0.0298 outputs 0.0136 performance 0.0201 terms 0.0090 net 0.0190 net 0.0113 examples 0.0183 problem 0.0088 training 0.0187 layer 0.0103 trained 0.0146 Topics from the VnExpress Collection TOPIC_46 0.0340 TOPIC_49 0.0308 TOPIC_34 0.0301 TOPIC_24 0.0296 trận_đấu 0.0328 cổ_phiếu 0.0903 trường 0.0829 cầu_thủ 0.0453 trận 0.0323 phiên 0.0662 sinh_viên 0.0565 đội 0.0441 chiến_thắng 0.0285 giảm 0.0539 đại_học 0.0519 bóng_đá 0.0287 tỷ_số 0.0255 đạt 0.0532 học_sinh 0.0425 cup 0.0262 thắng 0.0229 điểm 0.0478 lớp 0.0216 trận 0.0247 Vòng 0.0228 lệnh 0.0471 đào_tạo 0.0213 sân 0.0209 giành 0.0219 giao_dịch 0.0466 giáo_dục 0.0207 mùa 0.0208 mở_rộng 0.0217 khớp 0.0404 tiếng 0.0183 bóng 0.0201 đối_thủ 0.0197 giá_trị 0.0376 tốt_nghiệp 0.0171 đội_bóng 0.0194 set 0.0194 khối_lượng 0.0350 học_bổng 0.0160 đội_tuyển 0.0191 The topics obtained from these experiments are easily understandable Table 5.3 presents the most probable topics (with their corresponding probabilities) of each collection, NIPS and VnExpress Each topic is represented by the 10 most probable words, accompanied with their corresponding probabilities (in the second column) For NIPS corpus, the 10th topic plays the most important role with the 71 largest probability among 50 topics, and makes us think about “methods” for some “problems” The other topic with smaller probability shows that many of the NIPS corpus discuss about “Neural Network” For the VnExpress corpus, it is easily observed that the main topics, on which the articles concentrate, often relate to “sport”, “stock”, and “education” To find topics of a document that did appear in the corpus previously, we can use the quantities Pr(z k | q i ) More concretely, given the document di , we inspect all Pr(z 1:T | q i ) to find the most probable topics for the document Table 5.4 shows the most probable topics of the first NIPS paper by Abu-Mostafa in 1988, entitled “Connectivity versus Entropy” Table 5.4 Finding the topics of a document Topics of the paper “Connectivity versus Entropy” TOPIC_41 0.2155 TOPIC_36 0.2111 TOPIC_8 0.0967 TOPIC_26 0.0571 probability 0.0634 theorem 0.0263 neurons 0.0746 learning 0.3049 distribution 0.0490 bound 0.0188 neuron 0.0540 learn 0.0350 information 0.0308 threshold 0.0179 activity 0.0247 learned 0.0294 density 0.0160 number 0.0162 connections 0.0238 task 0.0264 random 0.0153 proof 0.0149 phase 0.0205 rule 0.0243 stochastic 0.0149 size 0.0137 network 0.0180 tasks 0.0180 entropy 0.0139 bounds 0.0134 inhibitory 0.0120 based 0.0113 log 0.0130 dimension 0.0130 excitatory 0.0103 examples 0.0106 distributions 0.0128 neural 0.0129 fig 0.0100 space 0.0096 statistical 0.0101 networks 0.0118 inhibition 0.0091 learns 0.0092 Table 5.5 Finding topics of a report Topics of the report “Phái đoàn Triều Tiên viếng cố tổng thống Hàn Quốc” TOPIC_18 0.1022 TOPIC_47 0.0916 TOPIC_42 0.0554 TOPIC_ 0.0553 đàm_phán 0.2484 tổng_thống 0.3516 kinh_tế 0.8690 xây_dựng 0.5419 thành_viên 0.2314 hạt_nhân 0.1072 tình_trạng 0.0419 dự_án 0.3218 tổ_chức 0.0914 mối 0.0925 gia_tăng 0.0270 tòa_nhà 0.0848 tự_do 0.0805 tuyên_bố 0.0677 thống_đốc 0.0068 liên 0.0050 hiệp_định 0.0780 quan_chức 0.0630 dừng 0.0055 đoàn 0.0000 kinh_tế 0.0647 chiến_tranh 0.0625 tín_hiệu 0.0025 quan_chức 0.0000 thỏa_thuận 0.0563 chuyến 0.0381 lý_thuyết 0.0006 trang_phục 0.0000 đoàn 0.0247 khẳng_định 0.0328 quốc_hội 0.0002 màu 0.0000 quốc_hội 0.0242 nỗ_lực 0.0293 đoàn 0.0000 người 0.0000 mở_cửa 0.0115 căng_thẳng 0.0250 quan_chức 0.0000 tổng_thống 0.0000 72 A new document that did not previously appear in the corpus can be easily inspected to find its topics See the treatment for a new document in Section 4.3.2 Table 5.5 shows the most probable topics of a recent report on 21/08/2009 about Taiwan (“Triều Tiên”) 5.6.2 Finding classes of words As mentioned earlier, HMM-LDA is able to classify words into classes without prior knowledge other than a text corpus This ability is derived from a clever combination of a HMM and a topic model It implies that the model is much more complex than some other topic models such as LDA, BTM, and CTM The interested readers may refer to [26] for more details To demonstrate the ability of the model, we experimented with the NIPS corpus with the same settings as in the above experiments, except the number of iterations N  400 , the number of states NS  16 and the hyperparameter g  0.1 Table 5.6 and 5.7 respectively shows the most probable topics and classes of function words obtained by training the corpus Table 5.6 Selected topics found by HMM-LDA Topics of the NIPS corpus TOPIC_49 0.0358 TOPIC_15 0.0319 TOPIC_21 0.0309 TOPIC_10 0.0294 * 0.9435 units 0.2029 cells 0.0776 state 0.0508 behavior 0.0039 unit 0.1110 cell 0.0472 policy 0.0391 detail 0.0015 hidden 0.1107 stimulus 0.0394 action 0.0350 work 0.0011 layer 0.0748 response 0.0361 value 0.0317 air 0.0010 weights 0.0580 visual 0.0306 reinforcement 0.0302 complex 0.0010 network 0.0330 stimuli 0.0253 actions 0.0268 others 0.0010 activation 0.0301 responses 0.0205 control 0.0216 ways 0.0009 connections 0.0209 spatial 0.0194 function 0.0175 mass 0.0009 layers 0.0207 receptive 0.0170 reward 0.0162 pressure 0.0009 training 0.0206 input 0.0158 time 0.0157 Note that the topics found by HMM-LDA are at least as interpretable as the ones found by LDA The intriguing characteristic of the model is its capability of discovering classes of function words As we can see, some classes of words in table 5.7 are representatives of some classes of function words, such as 73 “determiner”, “preposition”, “verb”, and “adjective” Table 5.7 Classes of function words found by HMM-LDA classes of function words CLASS_14 0.1856 CLASS_1 0.0944 CLASS_6 0.0814 CLASS_10 0.0803 the 0.5731 in 0.2797 * 0.0393 NUMBER 0.4463 a 0.1720 for 0.1631 same 0.0181 * 0.1182 an 0.0330 with 0.1038 two 0.0154 i 0.0254 this 0.0263 on 0.0755 different 0.0139 a 0.0228 each 0.0190 from 0.0621 first 0.0139 x 0.0194 these 0.0138 as 0.0598 single 0.0136 t 0.0159 our 0.0110 at 0.0444 new 0.0123 n 0.0151 its 0.0100 by 0.0246 neural 0.0116 c 0.0132 two 0.0078 using 0.0241 large 0.0110 b 0.0124 all 0.0077 over 0.0127 local 0.0110 r 0.0104 CLASS_7 0.0647 CLASS_11 0.0594 CLASS_2 0.0577 CLASS_8 0.0551 used 0.0397 is 0.3466 of 0.8971 be 0.1718 shown 0.0242 are 0.1621 between 0.0312 not 0.0627 based 0.0150 can 0.0986 in 0.0160 been 0.0299 obtained 0.0116 was 0.0544 for 0.0110 also 0.0254 described 0.0115 will 0.0386 to 0.0059 more 0.0186 trained 0.0110 have 0.0368 on 0.0057 only 0.0137 * 0.0105 were 0.0367 over 0.0054 then 0.0126 given 0.0098 has 0.0270 that 0.0030 very 0.0117 presented 0.0092 may 0.0245 and 0.0014 have 0.0116 well 0.0092 would 0.0168 where 0.0013 * 0.0080 74 CONCLUSION In this thesis the author has presented an extensive survey on Topic Modeling Many appealing characteristics and potential applications of Topic Modeling have been revealed The author attempted to classify the topic models into various classes for the ease of swift gaining an overview on and understanding the field To our knowledge, this is the first, though rough, classification of topic models, not previously discussed by any other Topic Modeling authors Apart from those main results, the author also presented some topic models each of which is most typical for a certain class of models Many advantages and disadvantages of a model have been revealed after presenting it Some possible extensions to some models have been discussed either Finally, the thesis reports the author’s experiments with some probabilistic topic models on a collection of NIPS’ papers and a collection of VnExpress’ articles The goal of those experiments is to take the initial step toward a clear interpretation and further study of the field in the future Even though the aim of the thesis is to make an extensive survey on Topic Modeling, the results are somehow incomplete The partial views presented in Chapter may not cover all aspects of Topic Modeling Some probabilistic topic models were discussed in detail without any practical test From these facts, the thesis may fail to persuade the readers to completely agree with the author’s arguments 75 REFERENCES Aldous, D (1985), “Exchangeability and Related Topics”, in E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp 1–198 Andrieu C., Freitas N D., Doucet A, Jordan M I (2003), “An Introduction to MCMC for Machine Learning”, Machine Learning, 50, pp 5–43 Asuncion A., Smyth P., Welling M (2008), “Asynchronous Distributed Learning of Topic Models”, Advances in Neural Information Processing Systems, 20, pp 81-88 Beal M J., Ghahramani Z., Rasmussen C E (2002), “The infinite hidden Markov model”, Advances in Neural Information Processing Systems, 14 Berry M W., Dumais S T., O’Brien G W (1994), “Using Linear Algebra for Intelligent Information Retrieval”, SIAM Review, 37, pp 573–595 Bishop C (2006), Pattern Recognition and Machine Learning, Springer Biro I., Szabo J., Benczur A (2008), “Latent Dirichlet Allocation in Web Spam Filtering”, In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web -WWW, pp 29-32 Blei D M., Griffiths T L., Jordan M I (2007), “The nested Chinese restaurant process and Bayesian inference of topic hierarchies”, http://arxiv.org/abs/0710.0845 Shorter version appears in NIPS, 16, pp 17–24 Blei D M., Jordan M I (2006), “Variational inference for Dirichlet process mixtures”, Bayesian Analysis, 1(1), pp 121–144 10 Blei D M., Lafferty J (2007), “A correlated topic model of Science”, The Annals of Applied Statistics, 1(1), pp 17–35 11 Blei D M., Ng A Y., Jordan M I (2003), “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3, pp 993–1022 12 Blei D., Jordan M (2003), “Modeling annotated data”, In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 127-134 13 Blei D., Lafferty J (2006), “Dynamic Topic Models”, In Proceedings of the 23rd International Conference on Machine Learning -ICML, pp 113 - 120 14 Blei D., McAuliffe J (2007), “Supervised topic models”, Advances in Neural Information Processing Systems, 19 76 15 Boyd-Graber J., Blei D (2008), “Syntactic topic models”, Advances in Neural Information Processing Systems, 20 16 Canini K R., Shi L., Griffths T (2009), “Online Inference of Topics with Latent Dirichlet Allocation”, In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics –AISTATS, 5, pp 65-72 17 Chemudugunta C., Holloway A., Smyth P., Steyvers M (2008), “Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning”, In Proceedings of the International Semantic Web Conference 18 Chemudugunta C., Smyth P., Steyvers M (2006), “Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model”, Advances in Neural Information Processing Systems, 18 19 Cohn D., Hofmann T (2000), “The missing link - a probabilistic model of document content and hypertext connectivity”, Advances in Neural Information Processing Systems, 12 20 Deerwester S., Dumais S T., Fumas G W., Landauer T K., Harshman R (1990), “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, 41, pp 391-407 21 Dietz L., Bickel S., Scheffer T (2007), “Unsupervised Prediction of Citation Influences”, In Proceedings of the 24th International Conference on Machine Learning –ICML, pp 233 - 240 22 Fei-Fei L., Perona P (2005), “A Bayesian hierarchical model for learning natural scene categories”, In Proceedings of the International Conference on Computer Vision and Pattern Recognition -CVPR, 2, pp 524-531 23 Foltz P W., Kintsch W., Landauer T K (1998), “The measurement of textual coherence with Latent Semantic Analysis”, Discourse Processes, 25, pp 285-307 24 Gomes R., Welling M., Perona P (2008), “Memory Bounded Inference in Topic Models”, In Proceedings of the 25th International Conference on Machine Learning -ICML, pp 344-351 25 Griffiths T L., Steyvers M (2004), “Finding scientific topics”, Proceedings of the National Academy of Sciences, USA, 101, pp 5228–5235 26 Griffiths T L., Steyvers M., Blei D M., Tenenbaum J B (2005), “Integrating topics and syntax”, Advances in Neural Information Processing Systems, 17, pp 537–544 77 27 Griffiths T L., Steyvers M., Tenenbaum J (2007), “Topics in Semantic Representation”, Psychological Review, 114(2), pp 211–244 28 Gruber A., Rosen-Zvi M., Weiss Y (2007), “Hidden topic Markov models”, In Proceedings of Artificial Intelligence and Statistics -AISTATS, 2, pp 163-170 29 Hieu P X., Minh N L., Horiguchi S (2008), “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections”, In Proceedings of the 17th International World Wide Web Conference -WWW, pp 91-100 30 Hofmann T (1999), “Probabilistic latent semantic indexing”, In Proceedings of the 22sd Annual International SIGIR Conference, pp 50-57 31 Hofmann T (2001), “Unsupervised Learning by Probabilistic Latent Semantic Analysis”, Machine Learning, 42(1), pp 177-196 32 Iwata T., Yamada T., Ueda N (2008), “Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents”, In Proceedings of The 14th ACM SIGKDD Inter Conference on Knowledge Discovery and Data Mining -KDD 33 Jordan M., Ghahramani Z., Jaakkola T., Saul L (1999), “Introduction to variational methods for graphical models”, Machine Learning, 37, pp 183–233 34 Kim S., Smyth P (2007), “Hierarchical Dirichlet Processes with random effects”, Advances in Neural Information Processing Systems, 19 35 Kintsch W (2001), “Predication”, Cognitive Science, 25, pp 173–202 36 Kurihara K., Welling M., Teh Y W (2007), “Collapsed Variational Dirichlet Process Mixture Models”, In Proceedings of the 21st Joint Conference on Artificial Intelligence -IJCAI, pp 2796-2801 37 Kurihara K., Welling M., Vlassis N (2007), “Accelerated variational DP mixture models”, Advances in Neural Information Processing Systems, 19 38 Lacoste-Julien S., Sha F., Jordan M I (2008), “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification”, Advances in Neural Information Processing Systems, 20 39 Landauer T K., Dumais S T (1997), “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge”, Psychological Review, 104, pp 211–240 40 Landauer T K., Foltz P W., Laham D (1998), “Introduction to latent semantic analysis”, Discourse Processes, 25, pp 259–284 78 41 Lavrenko V., Manmatha R., Jeon J (2003), “A Model for Learning the Semantics of Pictures”, Advances in Neural Information Processing Systems, 15 42 Lee D D., Seung H S (1999), “Learning the parts of objects by non-negative matrix factorization”, Nature, 401, pp 788-791 43 León J A., Olmos R., Escudero I., Cas J., Salmerón L (2005), “Assessing Short Summaries With Human Judgments Procedure and LSA in narrative and expository texts”, Behavioral Research Methods, 38(4), pp 616-627 44 McCallum A., Corrada-Emmanuel A., Wang X (2005), “Topic and Role Discovery in Social Networks”, In Proceedings of the 19th Joint Conference on Artificial Intelligence -IJCAI, pp 786-791 45 Mei Q., Cai D., Zhang D., Zhai C (2008), “Topic Modeling with Network Regularization”, In Proceedings of the 17th International World Wide Web Conference -WWW, pp 101-110 46 Michael W., Berry M W., Drmac D., Jessup E R (1999), “Matrices, Vector Spaces, and Information Retrieval”, Siam Review, 41(2), pp 335–362 47 Minka T., Lafferty J (2002), “Expectation–propagation for the generative aspect model”, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence -UAI, pp 352–359 48 Mukherjee I., David M Blei D M (2008), “Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation”, Advances in Neural Information Processing Systems, 20 49 Nallapati R., Ahmed A., Xing E P., Cohen W W (2008), “Joint Latent Topic Models for Text and Citations”, In Proceedings of The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining -KDD 50 Navarro D J., Griffiths T (2008), “Latent Features in Similarity Judgments: A Nonparametric Bayesian Approach”, Neural Computation, 20, pp 2597–2628 51 Newman D., Asuncion A., Smyth P., Welling M (2007), “Distributed inference for latent Dirichlet allocation”, Advances in Neural Information Processing Systems, 19 52 Newman D., Chemudugunta C., Smyth P., Steyvers M (2006), “Analyzing Entities and Topics in News Articles Using Statistical Topic Models”, In Proceedings of Intelligence and Security Informatics, LNCS 3975, Springer-Verlag, pp 93–104 79 53 Newman D., Chemudugunta C., Smyth P., Steyvers M (2006), “Statistical EntityTopic Models”, In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining -KDD, pp 680 - 686 54 Papadimitriou C., Tamaki H., Raghavan P., Vempala S (1998), “Latent semantic indexing: A probabilistic analysis”, In Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on Principles of database systems, pp 159-168 55 Porteous I., Newman D., Ihler A., Asuncion A., Smyth P., Welling M (2008), “Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation”, In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining –KDD, pp 569-577 56 Rosen-Zvi M., Griffiths T., Steyvers M., Smyth P (2006), “Learning Author Topic Models from Text Corpora”, ACM Transactions on Information Systems 57 Roweis S T., Saul L K (2000), “Nonlinear Dimensionality Reduction by Locally Linear Embedding”, Science, 290, pp 2323-2326 58 Steyvers M., Griffiths T L., Dennis S (2006), “Probabilistic inference in human semantic memory”, TRENDS in Cognitive Sciences, 10(7), pp 327-334 59 Tam Y C., Schultz T (2008), “Correlated Bigram LSA for Unsupervised Language Model Adaptation”, Advances in Neural Information Processing Systems, 20 60 Teh Y W., Kurihara K., Welling M (2008), “Collapsed variational inference for HDP”, Advances in Neural Information Processing Systems, 20 61 Teh Y W., Newman D., Welling M (2007), “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation”, Advances in Neural Information Processing Systems, 19 62 Teh Y., Jordan M., Beal M., Blei D (2006), “Hierarchical Dirichlet Processes”, Journal of the American Statistical Association, 101(476), pp 1566-1581 63 Tenenbaum J B., Griffiths T L (2001), “Generalization, similarity, and Bayesian inference”, Behavioral and Brain Sciences, 24, pp 629–640 64 Tenenbaum J B., Silva V., Langford J (2000), “A Global Geometric Framework for Nonlinear Dimensionality Reduction”, Science, 290, pp 2319-2322 65 Toutanova K., Johnson M (2008), “A Bayesian LDA-based model for semi-supervised part-of-speech tagging”, Advances in Neural Information Processing Systems, 20 80 66 Wallach H (2006), “Topic Modeling: Beyond Bag-of-Words”, In Proceedings of the 23rd International Conference on Machine Learning –ICML, pp 977 - 984 67 Wang C., Blei D., Heckerman D (2008), “Continuous Time Dynamic Topic Models”, Advances in Neural Information Processing Systems, 20 68 Wang L., Dunson D B (2007), “Fast Bayesian Inference in Dirichlet Process Mixture Models”, Technical Report 69 Wang X., Grimson E (2007), “Spatial Latent Dirichlet Allocation”, Advances in Neural Information Processing Systems, 19 70 Wang X., Mohanty N., McCallum A (2005), “Group and Topic Discovery from Relations and Text”, In Proceedings of the 3rd international workshop on Link discovery –LinkKDD, pp 28-35 71 Wainwright M J., Jordan M I (2008), “Graphical Models, Exponential Families, and Variational Inference”, Foundations and Trends in Machine Learning, 1(1–2), pp 1-305 72 Wei X., Croft B (2006), “LDA-based document models for ad-hoc retrieval”, In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp 178-185 73 Wolfe M B., Schreiner M E., Rehder B., Laham D., Foltz P W., Kintsch W., Landauer T (1998), “Learning from text: Matching readers and text by latent semantic analysis”, Discourse Processes, 25, pp 309–336 74 Yu K., Yu S., Tresp V (2005), “Dirichlet Enhanced Latent Semantic Analysis”, Advances in Neural Information Processing Systems, 17 75 Zheng B., McLean D C, Lu X (2006), “Identifying biological concepts from a protein-related corpus with a probabilistic topic model”, BMC Bioinformatics, 7(58), pp 1-10 76 Zhu J., Ahmed A., Xing E (2009), “MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification”, In Proceedings of the 26th International Conference on Machine Learning ABSTRACT Topic Modeling has been being an attractive research direction in Artificial Intelligence Many important applications of Topic Modeling have been reported including automatically indexing the documents of a given corpus, finding topical communities from collections of scientific papers, supporting spam filtering task, revealing the development of Science over years, discovering hot and cold topics in the research community, discovering different groups with their corresponding roles only by using text corpora, explaining statistically the inference process in human memory Hence, this thesis is devoted to surveying the modern development of the field It is attempted to reveal the most appealing characteristics and the main directions from which new topic models were or will be developed The author also attempts to reveal advantages and disadvantages of each considered model Possible extensions to some topic models will be discussed in details after presenting them Finally, the author reports some experiments with some topic models on a collection of papers from NIPS conferences and a collection of articles of VnExpress – an electronic Vietnamese newspaper Keywords: topic modeling, topic models, semantic representation, graphical models, knowledge discovery TĨM TẮT Mơ hình hóa chủ đề (Topic Modeling) hướng nghiên cứu hấp dẫn sôi động cộng đồng Trí tuệ nhân tạo Rất nhiều ứng dụng quan trọng phát hiện, chẳng hạn tự động đánh mục kho liệu, trợ giúp lọc spam, trực quan hóa phát triển khoa học nhiều năm, phát chủ đề nghiên cứu sôi động, phát thực thể giới thực vai trò chúng từ kho liệu văn bản, giải thích q trình học người Do luận văn tập trung vào khảo sát tiến trình phát triển lĩnh vực cách đầy đủ Tác giả cố gắng mô tả đặc trưng bật hướng phát triển mơ hình Tác giả mơt tả số mơ hình chủ đề tiêu biểu, bàn luận thêm số ưu điểm nhược điểm mơ hình Một số khả mở rộng cho mơ hình bàn luận cách chi tiết sau mô tả chúng Cuối cùng, luận văn mô tả kết thực nghiệm tác giả số mơ hình hai liệu, kho gồm 1740 báo hội thảo NIPS kho gồm 12525 tin tức báo điện tử Vnexpress Từ khóa: topic modeling, topic models, semantic representation, graphical models, knowledge discovery ... simplex), and the words of a topic composes a simplex known as topic simplex Then LDA posits that each word of both the observed and unseen documents is generated by randomly chosen topic which... QUANG KHOAT HANOI, 2009 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY THÂN QUANG KHOÁT TOPIC MODELING AND ITS APPLICATIONS MAJOR: INFORMATION TECHNOLOGY... Artificial Intelligence Author-Recipient -Topic Model Author -Topic Model Bigram Topic Model Continuous Dynamic Topic Model Correlated Topic Model Discrete Dynamic Topic Model Dirichlet Enhanced LSA

Ngày đăng: 13/02/2021, 07:15

Nguồn tham khảo

Tài liệu tham khảo

Loại

Chi tiết

1. Aldous, D. (1985), “Exchangeability and Related Topics”, in E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp. 1–198

Sách, tạp chí

Tiêu đề:	Exchangeability and Related Topics”, in "E´cole d’E´te´ de Probabilite´s de Saint-Flour XIII–1983
Tác giả:	Aldous, D
Năm:	1985

2. Andrieu C., Freitas N. D., Doucet A, Jordan M. I. (2003), “An Introduction to MCMC for Machine Learning”, Machine Learning, 50, pp. 5–43

Sách, tạp chí

Tiêu đề:	An Introduction to MCMC for Machine Learning”, "Machine Learning
Tác giả:	Andrieu C., Freitas N. D., Doucet A, Jordan M. I
Năm:	2003

3. Asuncion A., Smyth P., Welling M. (2008), “Asynchronous Distributed Learning of Topic Models”, Advances in Neural Information Processing Systems, 20, pp. 81-88

Sách, tạp chí

Tiêu đề:	Asynchronous Distributed Learning of Topic Models”, "Advances in Neural Information Processing Systems
Tác giả:	Asuncion A., Smyth P., Welling M
Năm:	2008

4. Beal M. J., Ghahramani Z., Rasmussen C. E. (2002), “The infinite hidden Markov model”, Advances in Neural Information Processing Systems, 14

Sách, tạp chí

Tiêu đề:	The infinite hidden Markov model”, "Advances in Neural Information Processing Systems
Tác giả:	Beal M. J., Ghahramani Z., Rasmussen C. E
Năm:	2002

5. Berry M. W., Dumais S. T., O’Brien G. W. (1994), “Using Linear Algebra for Intelligent Information Retrieval”, SIAM Review, 37, pp. 573–595

Sách, tạp chí

Tiêu đề:	Using Linear Algebra for Intelligent Information Retrieval”, "SIAM Review
Tác giả:	Berry M. W., Dumais S. T., O’Brien G. W
Năm:	1994

7. Biro I., Szabo J., Benczur A. (2008), “Latent Dirichlet Allocation in Web Spam Filtering”, In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web -WWW, pp. 29-32

Sách, tạp chí

Tiêu đề:	Latent Dirichlet Allocation in Web Spam Filtering”, In "Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web -WWW
Tác giả:	Biro I., Szabo J., Benczur A
Năm:	2008

8. Blei D. M., Griffiths T. L., Jordan M. I. (2007), “The nested Chinese restaurant process and Bayesian inference of topic hierarchies”, http://arxiv.org/abs/0710.0845.Shorter version appears in NIPS, 16, pp. 17–24

Sách, tạp chí

Tiêu đề:	The nested Chinese restaurant process and Bayesian inference of topic hierarchies”, "http://arxiv.org/abs/0710.0845". Shorter version appears in"NIPS
Tác giả:	Blei D. M., Griffiths T. L., Jordan M. I
Năm:	2007

9. Blei D. M., Jordan M. I. (2006), “Variational inference for Dirichlet process mixtures”, Bayesian Analysis, 1(1), pp. 121–144

Sách, tạp chí

Tiêu đề:	Variational inference for Dirichlet process mixtures”, "Bayesian Analysis
Tác giả:	Blei D. M., Jordan M. I
Năm:	2006

10. Blei D. M., Lafferty J. (2007), “A correlated topic model of Science”, The Annals of Applied Statistics, 1(1), pp. 17–35

Sách, tạp chí

Tiêu đề:	A correlated topic model of Science”, "The Annals of Applied Statistics
Tác giả:	Blei D. M., Lafferty J
Năm:	2007

11. Blei D. M., Ng A. Y., Jordan M. I. (2003), “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3, pp. 993–1022

Sách, tạp chí

Tiêu đề:	Latent Dirichlet allocation”, "Journal of Machine Learning Research
Tác giả:	Blei D. M., Ng A. Y., Jordan M. I
Năm:	2003

12. Blei D., Jordan M. (2003), “Modeling annotated data”, In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127-134

Sách, tạp chí

Tiêu đề:	Modeling annotated data”, In "Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Tác giả:	Blei D., Jordan M
Năm:	2003

13. Blei D., Lafferty J. (2006), “Dynamic Topic Models”, In Proceedings of the 23rd International Conference on Machine Learning -ICML, pp. 113 - 120

Sách, tạp chí

Tiêu đề:	Dynamic Topic Models”, In "Proceedings of the 23rd International Conference on Machine Learning -ICML
Tác giả:	Blei D., Lafferty J
Năm:	2006

14. Blei D., McAuliffe J. (2007), “Supervised topic models”, Advances in Neural Information Processing Systems, 19

Sách, tạp chí

Tiêu đề:	Supervised topic models”
Tác giả:	Blei D., McAuliffe J
Năm:	2007

15. Boyd-Graber J., Blei D. (2008), “Syntactic topic models”, Advances in Neural Information Processing Systems, 20

Sách, tạp chí

Tiêu đề:	Syntactic topic models”
Tác giả:	Boyd-Graber J., Blei D
Năm:	2008

16. Canini K. R., Shi L., Griffths T. (2009), “Online Inference of Topics with Latent Dirichlet Allocation”, In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics –AISTATS, 5, pp. 65-72

Sách, tạp chí

Tiêu đề:	Online Inference of Topics with Latent Dirichlet Allocation”, In "Proceedings of the 12th International Conference on Artificial Intelligence and Statistics –AISTATS
Tác giả:	Canini K. R., Shi L., Griffths T
Năm:	2009

17. Chemudugunta C., Holloway A., Smyth P., Steyvers M. (2008), “Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning”, In Proceedings of the International Semantic Web Conference

Sách, tạp chí

Tiêu đề:	Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning”, In
Tác giả:	Chemudugunta C., Holloway A., Smyth P., Steyvers M
Năm:	2008

18. Chemudugunta C., Smyth P., Steyvers M. (2006), “Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model”, Advances in Neural Information Processing Systems, 18

Sách, tạp chí

Tiêu đề:	Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model”
Tác giả:	Chemudugunta C., Smyth P., Steyvers M
Năm:	2006

19. Cohn D., Hofmann T. (2000), “The missing link - a probabilistic model of document content and hypertext connectivity”, Advances in Neural Information Processing Systems, 12

Sách, tạp chí

Tiêu đề:	The missing link - a probabilistic model of document content and hypertext connectivity”
Tác giả:	Cohn D., Hofmann T
Năm:	2000

20. Deerwester S., Dumais S. T., Fumas G. W., Landauer T. K., Harshman R. (1990), “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, 41, pp. 391-407

Sách, tạp chí

Tiêu đề:	Indexing by Latent Semantic Analysis”, "Journal of the American Society for Information Science
Tác giả:	Deerwester S., Dumais S. T., Fumas G. W., Landauer T. K., Harshman R
Năm:	1990

21. Dietz L., Bickel S., Scheffer T. (2007), “Unsupervised Prediction of Citation Influences”, In Proceedings of the 24th International Conference on Machine Learning –ICML, pp. 233 - 240

Sách, tạp chí

Tiêu đề:	Unsupervised Prediction of Citation Influences”, In "Proceedings of the 24th International Conference on Machine Learning –ICML
Tác giả:	Dietz L., Bickel S., Scheffer T
Năm:	2007