LNCS 10366 Lei Chen · Christian S Jensen Cyrus Shahabi · Xiaochun Yang Xiang Lian (Eds.) Web and Big Data First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I 123 Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany 10366 More information about this series at http://www.springer.com/series/7409 Lei Chen Christian S Jensen Cyrus Shahabi Xiaochun Yang Xiang Lian (Eds.) • • Web and Big Data First International Joint Conference, APWeb-WAIM 2017 Beijing, China, July 7–9, 2017 Proceedings, Part I 123 Editors Lei Chen Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong China Christian S Jensen Computer Science Aarhus University Aarhus N Denmark Xiaochun Yang Northeastern University Shenyang China Xiang Lian Kent State University Kent, OH USA Cyrus Shahabi Computer Science University of Southern California Los Angeles, CA USA ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-319-63578-1 ISBN 978-3-319-63579-8 (eBook) DOI 10.1007/978-3-319-63579-8 Library of Congress Control Number: 2017947034 LNCS Sublibrary: SL3 – Information Systems and Applications, incl Internet/Web, and HCI © Springer International Publishing AG 2017 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface This volume (LNCS 10366) and its companion volume (LNCS 10367) contain the proceedings of the first Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, called APWeb-WAIM This new joint conference aims to attract participants from different scientific communities as well as from industry, and not merely from the Asia Pacific region, but also from other continents The objective is to enable the sharing and exchange of ideas, experiences, and results in the areas of World Wide Web and big data, thus covering Web technologies, database systems, information management, software engineering, and big data The first APWeb-WAIM conference was held in Beijing during July 7–9, 2017 As a new Asia-Pacific flagship conference focusing on research, development, and applications in relation to Web information management, APWeb-WAIM builds on the successes of APWeb and WAIM: APWeb was previously held in Beijing (1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou (2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou (2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha (2014), Guangzhou (2015), and Suzhou (2016); and WAIM was held in Shanghai (2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou (2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009), Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014), Qingdao (2015), and Nanchang (2016) With the fast development of Web-related technologies, we expect that APWeb-WAIM will become an increasingly popular forum that brings together outstanding researchers and developers in the field of Web and big data from around the world The high-quality program documented in these proceedings would not have been possible without the authors who chose APWeb-WAIM for disseminating their findings Out of 240 submissions to the research track and 19 to the demonstration track, the conference accepted 44 regular (18%), 32 short research papers, and ten demonstrations The contributed papers address a wide range of topics, such as spatial data processing and data quality, graph data processing, data mining, privacy and semantic analysis, text and log data management, social networks, data streams, query processing and optimization, topic modeling, machine learning, recommender systems, and distributed data processing The technical program also included keynotes by Profs Sihem Amer-Yahia (National Center for Scientific Research, CNRS, France), Masaru Kitsuregawa (National Institute of Informatics, NII, Japan), and Mohamed Mokbel (University of Minnesota, Twin Cities, USA) as well as tutorials by Prof Reynold Cheng (The University of Hong Kong, SAR China), Prof Guoliang Li (Tsinghua University, China), Prof Arijit Khan (Nanyang Technological University, Singapore), and VI Preface Prof Yu Zheng (Microsoft Research Asia, China) We are grateful to these distinguished scientists for their invaluable contributions to the conference program As a new joint conference, teamwork is particularly important for the success of APWeb-WAIM We are deeply thankful to the Program Committee members and the external reviewers for lending their time and expertise to the conference Special thanks go to the local Organizing Committee led by Jun He, Yongxin Tong, and Shimin Chen Thanks also go to the workshop co-chairs (Matthias Renz, Shaoxu Song, and Yang-Sae Moon), demo co-chairs (Sebastian Link, Shuo Shang, and Yoshiharu Ishikawa), industry co-chairs (Chen Wang and Weining Qian), tutorial co-chairs (Andreas Züfle and Muhammad Aamir Cheema), sponsorship chair (Junjie Yao), proceedings co-chairs (Xiang Lian and Xiaochun Yang), and publicity co-chairs (Hongzhi Yin, Lei Zou, and Ce Zhang) Their efforts were essential to the success of the conference Last but not least, we wish to express our gratitude to the Webmaster (Zhao Cao) for all the hard work and to our sponsors who generously supported the smooth running of the conference We hope you enjoy the exciting program of APWeb-WAIM 2017 as documented in these proceedings June 2017 Xiaoyong Du Beng Chin Ooi M Tamer Özsu Bin Cui Lei Chen Christian S Jensen Cyrus Shahabi Organization Organizing Committee General Co-chairs Xiaoyong Du BengChin Ooi M Tamer Özsu Renmin University of China, China National University of Singapore, Singapore University of Waterloo, Canada Program Co-chairs Lei Chen Christian S Jensen Cyrus Shahabi Hong Kong University of Science and Technology, China Aalborg University, Denmark The University of Southern California, USA Workshop Co-chairs Matthias Renz Shaoxu Song Yang-Sae Moon George Mason University, USA Tsinghua University, China Kangwon National University, South Korea Demo Co-chairs Sebastian Link Shuo Shang Yoshiharu Ishikawa The University of Auckland, New Zealand King Abdullah University of Science and Technology, Saudi Arabia Nagoya University, Japan Industrial Co-chairs Chen Wang Weining Qian Innovation Center for Beijing Industrial Big Data, China East China Normal University, China Proceedings Co-chairs Xiang Lian Xiaochun Yang Kent State University, USA Northeast University, China Tutorial Co-chairs Andreas Züfle Muhammad Aamir Cheema George Mason University, USA Monash University, Australia VIII Organization ACM SIGMOD China Lectures Co-chairs Guoliang Li Hongzhi Wang Tsinghua University, China Harbin Institute of Technology, China Publicity Co-chairs Hongzhi Yin Lei Zou Ce Zhang The University of Queensland, Australia Peking University, China Eidgenössische Technische Hochschule ETH, Switzerland Local Organization Co-chairs Jun He Yongxin Tong Shimin Chen Renmin University of China, China Beihang University, China Chinese Academy of Sciences, China Sponsorship Chair Junjie Yao East China Normal University, China Web Chair Zhao Cao Beijing Institute of Technology, China Steering Committee Liaison Yanchun Zhang Victoria University, Australia Senior Program Committee Dieter Pfoser Ilaria Bartolini Jianliang Xu Mario Nascimento Matthias Renz Mohamed Mokbel Ralf Hartmut Güting Seungwon Hwang Sourav S Bhowmick Tingjian Ge Vincent Oria Walid Aref Wook-Shin Han Yoshiharu Ishikawa George Mason University, USA University of Bologna, Italy Hong Kong Baptist University, SAR China University of Alberta, Canada George Mason University, USA University of Minnesota, USA Fernuniversität in Hagen, Germany Yongsei University, South Korea Nanyang Technological University, Singapore University of Massachusetts Lowell, USA New Jersey Institute of Technology, USA Purdue University, USA Pohang University of Science and Technology, Korea Nagoya University, Japan Program Committee Alex Delis Alex Thomo University of Athens, Greece University of Victoria, Canada Organization Aviv Segev Baoning Niu Bin Cui Bin Yang Carson Leung Chih-Hua Tai Cuiping Li Daniele Riboni Defu Lian Dejing Dou Demetris Zeinalipour Dhaval Patel Dimitris Sacharidis Fei Chiang Ganzhao Yuan Giovanna Guerrini Guoliang Li Guoqiong Liao Hailong Sun Han Su Hiroaki Ohshima Hong Chen Hongyan Liu Hongzhi Wang Hongzhi Yin Hua Li Hua Lu Hua Wang Hua Yuan Iulian Sandu Popa James Cheng Jeffrey Xu Yu Jiaheng Lu Jiajun Liu Jialong Han Jian Yin Jianliang Xu Jianmin Wang Jiannan Wang Jianting Zhang Jianzhong Qi IX Korea Advanced Institute of Science and Technology, South Korea Taiyuan University of Technology, China Peking University, China Aalborg University, Denmark University of Manitoba, Canada National Taipei University, China Renmin University of China, China University of Cagliari, Italy University of Electronic Science and Technology of China, China University of Oregon, USA Max Planck Institute for Informatics, Germany and University of Cyprus, Cyprus Indian Institute of Technology Roorkee, India Technische Universität Wien, Vienna, Austria McMaster University, Canada South China University of Technology, China Universita di Genova, Italy Tsinghua University, China Jiangxi University of Finance and Economics, China Beihang University, China University of Southern California, USA Kyoto University, Japan Renmin University of China, China Tsinghua University, China Harbin Institute of Technology, China The University of Queensland, Australia Aalborg University, Denmark Aalborg University, Denmark Victoria University, Melbourne, Australia University of Electronic Science and Technology of China, China Inria and PRiSM Lab, University of Versailles Saint-Quentin, France Chinese University of Hong Kong, SAR China Chinese University of Hong Kong, SAR China University of Helsinki, Finland Renmin University of China, China Nanyang Technological University, Singapore Zhongshan University, China Hong Kong Baptist University, SAR China Tsinghua University, China Simon Fraser University, Canada City College of New York, USA University of Melbourne, Australia Using Word Triangles in Topic Discovery for Short Texts 649 model (WTTM), which finds the word triangles in the word co-occurrence network of the corpus By contrast, WTTM values the relationship of each pattern through the word triangles and excludes some weak related ones We conducted experiments and the results proved that WTTM did better than other baseline models Considering the outstanding performance of WTTM, We can say that WTTM is a good choice for topic inferring on short texts However, there are still a lot of improvements we can to perfect the model How to make use of the weight of triangles can be the next point to improve Applying our method to real-world situations is also a good direction for us to explore Acknowledgments This paper is supported by the National Key Research and Development Program of China (Grant No 2016YFB1001102) and the National Natural Science Foundation of China (Grant No 61375069, 61403156, 61502227), this research is supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University References Blei, D.M.: Probabilistic topic models Commun ACM 55(4), 77–84 (2012) Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation J Mach Learn Res 3(Jan), 993–1022 (2003) Durak, N., Pinar, A., Kolda, T.G., Seshadhri, C.: Degree relations of triangles in real-world networks and graph models In: Proceedings of the 21st ACM International Conference on Information and knowledge Management, pp 1712–1716 ACM (2012) Griffiths, T.L., Steyvers, M.: Finding scientific topics Proc Nat Acad Sci 101(suppl 1), 5228–5235 (2004) Hofmann, T.: Probabilistic latent semantic indexing In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 50–57 ACM (1999) Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter In: Proceedings of the First Workshop on Social Media Analytics, pp 80–88 ACM (2010) Lu, H.Y., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery In: Thirty-First AAAI Conference on Artificial Intelligence (2017) Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 262–272 Association for Computational Linguistics (2011) Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp 261–270 ACM (2010) 10 Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts In: Proceedings of the 22nd International Conference on World Wide Web, pp 1445– 1456 ACM (2013) 11 Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts Knowl Inf Syst 48(2), 379–398 (2016) Context-Aware Topic Modeling for Content Tracking in Social Media Jinjing Zhang1 , Jing Wang2 , and Li Li1(B) School of Computer and Information Science, Southwest University, Chongqing, China 1476509610@qq.com, lily@swu.edu.cn Economy and Technology Developing District, Zhengzhou, Henan, China 382876766@qq.com Abstract Content in social media is difficult to analyse because of its short and informal feature Fortunately, some social media data like tweets have rich hashtags information, which can help identify meaningful topic information More importantly, hashtags can express the context information of a tweet better To enhance the significant effect of hashtags via topic variables, this paper, we propose a context-aware topic model to detect and track the evolution of content in social media by integrating hashtag and time information named hashtag-supervised Topic over Time (hsToT) In hsToT, a document is generated jointly by the existing words and hashtags (the hashtags are treated as topic indicators of the tweet) Experiments on real data show that hsToT capture hashtags distribution over topics and topic changes over time simultaneously The model can detect the crucial information and track the meaningful content and topics successfully Keywords: Topic model media · Content evolution · Topic over time · Social Introduction In recent years, some conventional topic models such as LDA [1] and PLSA [2] have been proposed successfully in mining topics for a diverse range of document genres However, for the data in tweets, they always fail to achieve high quality underlying topics because of its short and informal feature Likely, there are several types of metadata could help identify the contents of tweets, such as the associated short url, picture, and #hashtag [3,4] Among these metadata types, hashtags always play crucial roles in content analysis Hashtags can not only express the context information of a tweet to the fullest, but also act as weaklysupervised information when sampling topics from certain tweets Meanwhile, hashtags enrich the expressiveness of topics Motivated by the above, we propose a context-aware topic model to identify and track the evolution of the contents in social media named hashtag-supervised c Springer International Publishing AG 2017 L Chen et al (Eds.): APWeb-WAIM 2017, Part I, LNCS 10366, pp 650–658, 2017 DOI: 10.1007/978-3-319-63579-8 49 Context-Aware Topic Modeling for Content Tracking in Social Media 651 Topic over Time(hsToT) This model extends the classical LDA [1] by integrating the hashtags and time information In hsToT, the distribution of hashtags over topics directly affects the topic sampling for a document A topic is defined as a set of words that is highly correlated In addition, in order to capture topic evolution over time for our method, we model each topic with a multinomial distribution over timestamps and uses a beta distribution over a time span covering all the data The remainder of paper is organized as following Section review several representative works Section presents our approach in detail We show the data preparation and discuss the experiment results in Sect The final section concludes the work Related Work Topic Model in Social Media As a powerful text mining tool, topic models have successfully applied to text analysis Unfortunately, traditional topic models LDA [1] and PLSI [2] not work well with the messy form of data in Twitter To overcome the noise in tweets, [5] merges user’s tweets as a document However, they ignore the content detection from topics Thus, [6] proposes a probabilistic model that a topic depends on not only the users preference but also the preference of users; TCAM [7,8] focuses on analyzing user behaviors combining users intrinsic interests and temporal context; Except the user features, mLDA [3] utilizes multiple contexts such as hashtags and time to discover consensus topics While these works focus far more on user interests rather than content mining Besides, some works take advantage semi-structured information, such as TWDA [9] and MA-LDA [10] However, these methods ignore the dynamic nature of contents in social media Topic over Time To capture the topics change over time, qualitative evolution and quantitative evolution are two main analysis patterns Qualitative evolution focus on some aspects of a topic like word distribution, inter-topic correlation, vocabulary, etc [11] uses state space model to model the time variation DTM [12] and TTM [13] are two typical models However, the time must be discretized and the length of time intervals must be determined Quantitative evolution focuses on the amount of data related to some topic at some timestamp, and models the time variation as an attribute of topics The pioneering works in the literature are TOT [14], COT [4] and [15] where each topic is associated with a beta distribution over time In this paper, we prefer quantitative evolution and replace the beta distribution with Dirichlet distribution so that the parameters could be simply estimated by Gibbs sampling Modeling Content Evolution in Social Media In this section, firstly we introduce some preliminaries, especially the parameters used in the method would be interpreted Then we explain the details of our hashtag based topic modeling solutions, namely hsToT 652 J Zhang et al 3.1 Preliminaries As the most popular topic model, Latent Dirichlet Allocation (LDA) has achieved numerous successful extensions hsToT is LDA-based model, and also a probabilistic generative model While different from LDA, hsToT includes two additional variables, namely hashtags and timestamps We can discover the content through a cluster of hashtags that frequently occur with a topic, then the content over time can be observed through topic distribution over timestamps Formally, we define a set of tweets as W = {d}M d=1 Each tweet is regarded as a document d and has a timestamp t Suppose that document d is related to a word sequence wd = {w1 , w2 , , wi , , wN } and a hashtag sequence hd = {h1 , h2 , , hi , , hL }, where N and H is the number of words and hashtags in document d The rest notations used in this paper are listed in Table For a corpus, T, N , and H are integer constants, while N and H are varying with different document T is set manually Table Notation in hsToT M, K Number of document and topics respectively N, H, T Number of words, hashtags and timestamps in a document respectively z, w, h, t, d Topic, word, hashtag, timestamps and document respectively 3.2 θ Multinomial distribution over topics for a hashtag ϕ Multinomial distribution over words for a topic ψ Multinomial distribution over timestamps for a topic α, β, μ Dirichlet prior parameters for θ, ϕ and ψ respectively Hashtag-Supervised Topic over Time In this subsection, we describe hashtag-supervised Topic over Time (hsToT) to directly uncover the latent relationship among topic, hashtag, and time In hsTor, hashtags act as the weakly-supervised information in topics sampling Figure shows the graphical models of the hsToT In Fig 1, each topic is typically represented by a distribution over words as ϕ with β as the Dirichlet prior hsToT also includes two distributions over hashtags and timestamps with respect to topics hsToT not directly sample the distribution over topics for a document d Instead, it sample a hashtag’s distribution over topics from the K × H matrix as the topics distribution of the document Furthermore, the time feature is first discretized and each tweet is annotated with a discrete timestamp label (e.g day, month, year) Naturally, time modality is captured by the variable t, and consequently topic evolution over time is obtained using multinomial distribution In particular, each hashtag is characterized by a distribution over Context-Aware Topic Modeling for Content Tracking in Social Media 653 topics as θ with α as the Dirichlet prior We allocate a topic assignment zi and a hashtag assignment hi for each word wi in the document d In hsToT, each word is associated with a “hashtag-topic” assignment pair and a “topic-timestamp” pair The generative process for hsToT is given as follows, shown in Fig We use variables zi , hi and ti to represent a certain topic, hashtag, and timestamp associated with the word wi respectively Fig Graphical model representation of hsToT For each hashtag h = : H, draw the mixture of topics θh ∼ Dir(α) For each topic z = : K, draw the mixture of words ϕz ∼ Dir(β) For each topic z = : K, draw the mixture of timestamps ψz ∼ Dir(μ) For each document d = : M , draw its words length N , and give its hashtag set hd (a) For each word wi , i = : Nd i Draw a hashtag hi ∼ U nif orm(hd ) ii Draw a topic zi ∼ M ult(θhi ) iii Draw a word wi ∼ M ult(ϕzi ) iv Draw a timestamp ti ∼ M ult(ψzi ) In hsToT, there are three posterior distributions: hashtag-topic distribution θ, topic-word distribution ϕ and topic-timestamp distribution ψ We assume that “topic-word” distribution and “hashtag-topic” distribution are conditionally independent To efficiently estimate posterior distribution, we employ Gibbs sampling [16] Thus, the joint probability of words, topics, hashtags and timestamps is Eq p(w, h, t, z|α, β, μ, hd ) = p(w|z, β)· p(h|hd )· p(t|z, μ)· p(z|hd , α) (1) In order to infer the hidden variables, we compute the posterior distribution of the hidden variables The likelihood of a document d is Eq p(wd |θ, ϕ, ψ, hd ) = = = = Nd Ld K T i=1 j=1 k=1 s=1 Nd Ld K T i=1 j=1 k=1 s=1 Nd Ld K T i=1 j=1 k=1 s=1 N i=1 p(wi |θ, ϕ, ψ, hd ) p(wi , zi = k, hi = j, ti = s|θ, ϕ, ψ, hd ) (2) p(wi |zi = k,ϕ)p(zi = k|hi = j, θ)p(ti = s|zi = k, ψ)pjhi ϕwi ,k θk,j ψs,k pjhi 654 J Zhang et al Where pjhi represents the probability of hi = j when sampling a hashtag hi from hd The generating probability of the corpus is: M p(wd |θ, ϕ, ψ, hd ) p(W|θ, ϕ, ψ, h) = (3) d=1 The estimation method of posterior distributions in hsToT is Eq p(zi = k, hi = j|w, h−i , t, z−i ) ∝ nk w +β i ,−i nk +V β w ,−i × njk,−i +α j nk ,−i +Kα × nk t i ,−i +μ nk +T μ t ,−i (4) where −i means assignments except for current word in the current document d In Eq 4, njk,−i is the number of words assigned to topic k in a document d, njk ,−i is total number of words in a document d; njwi ,−i is the number of words assigned to topic k and hashtags j, njk ,−i is total number of words assigned to topic k; njt,−i is the number of time words assigned to topic k and hashtags j, njt ,−i is total number of words assigned to topic k and hashtags j Finally, when the sampling process reaches the convergence, we can get the results of ϕ, θ, and ψ by: ϕw,k = nk w +β nk +V β w θk,j = njk +α njk +Kα ψs,k = nk t +μ nk +T μ t (5) From the generative process of hsToT, time modality is involved in topic discovery However, this may impact the homogeneity of topics because time modality is assumed having the same “weight” in word modality In practice, it is not To address this issue, we adopt the same strategy as in TOT [14] where a balancing hyperparameter is introduced in order to balance word and time contribution in topic discovery Naturally, we set the hyperparameter as the inverse of the number of words nd Experiments Experiments are conducted based on evaluation on results of topic detection and topic evolution over time We take three topic models TOT [14], COT [4] and hgToT, which is our another model as the baselines (hgToT is generally similar to COT in the usage of hashtags, but whose time modeling method changes the beta distribution into the multinomial distribution.) TOT is extended from LDA by adding a Beta distribution over timestamps for topics COT is extended from TOT by adding a multinomial distribution over hashtags for topics 4.1 Data Preparation The experiments are conducted on a twitter data set, named “TREC2011”1 The original data contains nearly 16 millions tweets posted from January 23rd to February 8th in 2011 Each tweet includes a user id and a timestamp The http://trec.nist.gov/data/microblog2011.html Context-Aware Topic Modeling for Content Tracking in Social Media 655 process of the raw data is similar to the steps in [17] The properties of the dataset are given in Table To guarantee the convergence of Gibbs sampling, all results were obtained after 1000 iterations Timestamps is divided by days The hyperparameters in generative models (hsToT, hgToT, TOT) are set as 50/K for α and 0.04, 0.04, 0.01 for β, γ and μ respectively [11] Table Dataset properties tweets Unique words 304,480 12,160 4.2 hashtags 98,649 Average words 5.11 Average hashtags 1.42 Evaluation on Topic Detection To assess the effectiveness of topic models, a typical metric like perplexity on a held-out test set [17] has been widely used The perplexity represents the performance of document modeling by comprehensively estimate the results of p(z|d) and p(w|z) Another automatic evaluation metrics coherent score[18] is proposed to measure the quality of topics from the perspective of topic visualization and semantic coherent In this paper, we choose perplexity and coherent score as evaluation criteria Perplexity Perplexity indicates the uncertainty in predicting a single word The lower the perplexity score is, the higher the performance will be We compute this metric according to the method [1] To equilibrate the different usage of hashtags in these methods, the computation of p(wd ) is different For TOT, the computation method of p(wd ) is same as [1] While for hsToT, hgToT and COT, p(wd ) = Nd p(wi ) + Ld p(hi ) In this step, we hold out 10% of data for test and train these methods on the remaining 90% data Figure states the perplexity results with topic number k = 20, 30, 40, 50, 60 From the Fig 2, hsToT obviously outperforms other methods This indicates that we can indeed improve the document modeling performance by taking advantage of hashtags especially regarding hashtags as weak-supervised information In addition, For hsToT, the perplexity value reduces gradually with the increase of topic number, and then tends to stable when k ≥ 50 While TOT is running into over-fitting This means that our method is more stable Based on this observation, the number of topic K is fixed at 50 in the remaining experiments Coherent Score To intuitively investigate the quality of topics, we analyze the topics from visualization perspective For each topic, we take top words or hashtags ordered by p(w|z) or p(h|z)) as their semantic representation By observing the 50 topics, there are two major kind of topics One is “common topics”, which are related to users’s daily lives The other is “time-sensitive 656 J Zhang et al Fig Perplexity results with different topic number K topics” which may be some emergencies or hot news events Overall, compared with TOT and hsToT can discover more meaningful hashtags and words highly related to a topic Besides, hsToT can detect the content from a topic which TOT can not achieve Table lists an example of a common topic “EGYPT” learn by all hsToT and TOT methods Table A sample of semantic representation of topic “EGYPT” hgToT #egypt, hsToT egypt TOT #egypt, egypt obama #election, obama #mubarak, people mubarak #25-Jan, #egypt #election, obama #mubarak #tcot, mubarak #turbulence, mubarak turbulence #sotu, egyptian #news, egyptian egypt For TOT, both words and hashtags occur in the “topic-word” distribution While in hgToT and hsToT, words and hashtags only occur in “topic-word” and “topic-hashtag” distribution respectively In addition, topic “EGYPT” also shows that hsToT outperforms TOT in discovering meaningful hashtag i.e., the semantics of topics for a common topic For example, TOT only discovers one hashtag “#mubarak”, whereas hsToT discovered more meaningful hashtags In order to quantitatively evaluate the topic quality of all test methods, we further utilize the automated metric, namely coherent score The coherence score is that words belonging to a single concept will tend to co-occur within the same documents [19] A larger coherence score means the topics are more coherent (z) (z) Given a topic z and its top n words V (z) = (v1 , , ) ordered by p(w|z), the coherent score can be defined as: n t (z) (z) D(vt , vl ) + (z) C(z; V ) = log (6) (z) D(vl ) t=2 l=1 where D(v) is the document frequency of word v, D(v, v , ) is the number of document in which words v and v , co-occurred The final results are Context-Aware Topic Modeling for Content Tracking in Social Media 657 K (zk ) ) For TOT, the average coherent score can be directly capk C(zk ; V tured by this way While for other three methods, a topic is jointly associated with a multinomial distribution over words and a multinomial distribution over hashtags Therefore, we also need consider the top n hashtags when computing the coherent score of a given topic The final average coherent score in these (z ) (zk ) ) + C(zk ; H k )) three methods is 2K k (C(zk ; V The result is shown in Table 4, the number of top words ranges from to 20 From Table 4, hsToT achieves the best performance(with p-value