203. Predicting the Popularity of Social Curation

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	12
Dung lượng	440,89 KB

Nội dung

203. Predicting the Popularity of Social Curation tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập lớn v...

Predicting the Popularity of Social Curation Binh Thanh Kieu, Ryutaro Ichise, and Son Bao Pham Abstract The amount and variety of social media content such as status, images, movies, and music are increasing rapidly Accordingly, the social curation service is emerging as a new way to connect, select, and organize information on a massive scale One noticeable feature of social curation services is that they are loosely supervised: the content that users create in the service is manually collected, selected, and maintained A large proportion of these contents are arbitrarily created by inexperienced users In this paper, we look into social curation, particularly, the Storify website1 This is the most popular social curation for creating stories included in various domains such as Twitter, Flicker, and YouTube We propose a machine learning method with feature extraction to filter these contents and to predict the popularity of social curation data Keywords: curation, social curation, social network service, prediction, popularity Introduction Along with the rapid growth of the Internet, social networks are increasingly attracting users, young people in particular Therefore, the study of social networks is getting more and more attention Social network services such as Facebook, Myspace, and Twitter have become viable sources of information for many online users These websites are increasingly used for communicating breaking news, sharing eyewitness accounts, and organizing groups of people At the most basic level, Binh Thanh Kieu · Son Bao Pham Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam e-mail: binhkt.vnu@gmail.com, sonpb@vnu.edu.vn Ryutaro Ichise National Institute of Informatics, Tokyo, Japan e-mail: ichise@nii.ac.jp http://www.storify.com/ © Springer International Publishing Switzerland 2015 V.-H Nguyen et al (eds.), Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_33 413 414 B.T Kieu, R Ichise, and S.B Pham a curation service offers the ability to manually collect, select, and maintain this social media information This is very different from other social information sources, and we can utilize this characteristic for efficient content mining The emergence of Web 2.0 and online social networking services, such as Digg, YouTube, Facebook, and Twitter, has changed how users generate and consume online contents For example, YouTube, well-known for its fast-growing user-generated contents, reports 100 hours worth of video upload every minute [22] Online social networking services, augmented with multimedia content support, sharing, and commenting on other users’ contents, constitute a significant part of the web experience by Internet users The question is how users find interesting contents? Or, how certain contents rise in popularity? If we can answer these questions, we can predict the most likely contents to become popular and filter out others Moreover, when we can filter out unpopular contents that get little attention, good contents can be used to build an automatic system for curating social content However, predicting the popularity of content is a difficult task for many reasons Among these, the effects of external phenomena (e.g., media, natural, and geo-political) are difficult to incorporate into models [16], and cascades of information are difficult to forecast [3] Finally, the underlying contexts, such as locality, relevance to users, resonance, and impact, are not easy to decipher [2] The rest of the paper is organized as follows In the second section, we explain the social curation service, our target data source, and details of the dataset specifications In the third section, we review related work The fourth section is devoted to the formulation of predicting view counts of a curation list The fifth section describes experiments and the evaluation of our results The last section concludes this paper with a discussion about future work Social Curation The word “curate" is defined as selecting, organizing, and looking after the items in a collection or exhibition The word is derived from the Latin root “curare" or “to cure", which means “to care" Curation involves assembling, managing and presenting some types of collections For example, curators of art galleries and museums research, select, and acquire pieces for their institutions’ collections and oversee interpretation, displays, and exhibits Social curation is the collaborative oversight of collections of web content organized around types of content such as Pinterest (a site for sharing and organizing images) and Storify (a site for collecting and publishing stories) 2.1 Social Curation Service Social networks are spaces for dialog and conversation that have grown into ubiquitous information exchanges Youth today refer to social networks, aggregators, and mobile apps for most of their information instead of singling out specific media for news, politics, personal communication, and leisure In turn, social networks have Predicting the Popularity of Social Curation 415 provided new functions that help users curate information in meaningful and productive ways Social curation involves aggregating, organizing, and sharing the content created by others to add context, narrative, and meaning Artists, changemakers, and organizations use social curation to showcase the full range of conversations around a topic, add more nuance to their own original content, and crowdsource content from their community members The rise of social curation can be attributed to three broad trends Firstly, people are creating a constant stream of social media content, including updates, location check-ins, blog posts, photos, and videos Secondly, people are using their social networks to filter relevant content by following others who share similar interests Thirdly, social media platforms are also curating content by giving curation tools to users (YouTube playlists, Flickr galleries, Amazon lists, Foodspotting guides), using editors and volunteers (YouTube Politics, Tumblr Tags), or using algorithms (YouTube Trends, Autogenerated YouTube channels, LinkedIn Today) As a result, a number of niche social curation platforms have emerged to enable people to curate different types of content, including links, photos, sounds, and videos We should emphasize that each curation list is a kind of loosely supervised but organized social dataset This means that social media items in the same curation list are expected to share the same context to a certain degree: a curation list is manually generated to fully convey one idea to the consumer This is a very distinct characteristic compared to other social media that are unorganized in many cases 2.2 Storify The website Storify is the most well-known site for people telling stories by curating social media Storify was launched in September 2010 and accounts were invitation-only until April 2011 The site is now open to everyone and users only need a Twitter account Storify provides a function to filter out poor content and unreliable sources If social media changes or misinterprets context, Storify can help curators put it back together again [13] Storify allows curators to embed dynamic images, text, tweets, and even Facebook status updates, and then knit these all together with background and context provided by the storyteller It is an engaging way for us to learn how to work out what is true and what is speculation We have also found that using Twitter has taught us how to look for sources and news and Storify has helped teach us how to think and write context and narrative Each story is a curation list which shares some characteristics: manually collected (bundling a collection of content from diverse sources), manually selected (re-organizing them to give one’s own perspective), manually maintained (publishing the resulting story for consumers) The Storify data is in the form of lists of Twitter messages An example of a list is shown in Figure A list of tweets corresponds to what we called a story, which represents a manually filtered and organized bundle Lists in Storify draw on Twitter as 416 B.T Kieu, R Ichise, and S.B Pham Fig Example of a Storify list Table Statistics of curated domains Domain Number of Elements Proportion Twitter 8,514,006 75.5% Storify 1,206,794 10.7% YouTube 190,611 1.7% Facebook 169,361 1.5% Instagram 155,762 1.4% Flickr 127192 1.2% Others 920,089 8% its source The lists may be created individually in private or collaboratively in public as determined by the initial curator In the Storify curation interface, the curator begins the list curation process by looking through his Twitter timeline (tweets from users that he or she follows), or directly searching tweets via relevant words/hashtags The curator can drag-and-drop these tweets into a list, reorder them freely, and also add annotations such as a list header and in-place comments We first provide some data statistics to get a feel for the curation data We collected all the data from Predicting the Popularity of Social Curation 417 Table Element types Types Number Proportion Quote 7,715,616 68.4% Text 1,195,625 10.6% Image 1,436,673 12.7% Video 206,265 1.8% Link 732,096 6.5% Table Storify action statistics Action Views Comments Element comments Likes Number 642,666,347 21,306 21,133 206,265 Average 1823 per story 0.06 per story 0.002 per element 0.12 per story 2010 to April 2013, which amounted to 63,419 users and 352,540 stories This corresponds to a total of 11,283,815 elements from various domains Table describes the various domains used in the stories Twitter is the largest domain source with more than 75% elements, and Flickr is the smallest specific source with 1.2% elements The statistics of the element types is shown in Table The five types of elements in stories are quote, text, image, video, and link Because Storify users use a huge number of tweets, the number of quote contents accounts for a large percentage of nearly 70% Media content as images and movies make up approximately 15% Text contents are written by the curator to add more information, explain, or link elements The Storify API provides the four main actions shown in Table The Storify website allows users to comment on each element or on all parts of a story However, the average numbers of comments, element comments and similar actions are quite small Therefore, approaches utilizing user comments and actions are not suitable for this dataset Related Work Several studies have investigated social curation as a new source of data mining Pinterest2 is the most popular website for sharing images and video, and the third most popular social network in the US behind Facebook and Twitter The website is built around the activity of collecting digital images and videos and pinning them to a pin board Each pin is essentially a visual bookmark and the pin boards are thematic collections of the bookmarks, where context is added to the collected information Hall http://www.pinterest.com/ 418 B.T Kieu, R Ichise, and S.B Pham and Zarro described some of the user actions on Pinterest and created a dataset to find the pin content of Pinterest users across a wide variety of subject areas [8] [23] Besides only curating images or video, other sites curate status, comments, news sources to write blogs, stories Storyful3 is a social media news agency established in 2010 with the aim of filtering news, or newsworthy content, from the vast quantities of noisy data on social networks such as Twitter and YouTube Storyful invests considerable time into the manual curation of content on these networks It sounds more or less like the same goal as Storify’s but there is one important difference Storyful aims to deliver content for news organizations, whereas Storify is more of a tool for journalists It allows journalists to use its template to write stories that include relevant tweets and Facebook posts without losing the original formatting or links Journalists can create interactive stories with clear links to original pictures or tweets Greene et al proposed a variety of criteria for generating user list recommendations based on content analysis, network analysis, and the “crowdsourcing" of existing user lists [7] In addition, the Togetter website4 is a rapidly growing social curation website in Japan Togetter averaged more than million user-views per month in 2011 The Togetter curation data mainly exist in the form of lists of Twitter messages Ishiguro et al used Togetter data for the automatic understanding and mining of images [11] and created a system [6] that suggests new tweets to increase the curator’s productivity and breadth of perspective Our research discovered another social curation website, Storify The structure of a Storify list is quite similar to that of a Togetter list The only difference is the language: the common language of Togetter is Japanese and Storify is English However, we interested in another aspect which show the quality of curation list made by users The problem of predicting online content highlights how much attention it will ultimately receive Research shows that user attention is allocated in a rather asymmetric way, with most content getting only some views and downloads, whereas a few receive a significant amount of user attention; thus, filtering these contents will help to save much time for viewers There are different ways to formulate how much attention of contents Many researchers interested in the number of views as the popularity of online content such as YouTube (the number of views [20]), Vimeo (the number of views [1]), Flickr (the number of views [24]) Otherwise, the popularity is presented by users’ actions like Dig (the number of user votes [12]), Twitter (the number of retweets [10]) Moreover, others formulate the problem to a change of the number of views that contents receive over time Predicting the popularity of news articles is a complex and difficult task and different prediction methods and strategies have been proposed in several recent studies [20] [21] The common point of all these methods is that they focus on predicting the exact attention that an article will generate in the near future First, some researchers have studied features that describe the underlying social network of the users and contents that can be leveraged to predict popularity [9] [12] [18] [21] The authors in [14] [16] [17] [21] studied features that take into account the http://www.storyful.com/ http://www.togetter.com/ Predicting the Popularity of Social Curation 419 comments found in blogs to predict popularity However, few other works forecast a value for the actual popularity of individual content Lee et al used survival analysis to evaluate the probability that a given content receives more than some x number of hits [16] [17] Hong et al developed a coarse multi-class classifier-based approach to determine whether given Twitter hashtags are retweeted x ≤ (0; 100; 10000; ∞) times [10] Similarly, Lakkaraju and Ajmera used support vector machines (SVMs) to predict whether a given content falls into a group that attracts x ≤ (10%; 25%; 50%; 75%; 100%) of the attention in a system [15], while Jamali and Rangwala predicted the popularity of content by using an entropy measure [12] Finally, Szabo and Huberman presented a linear regression model based on the number of views [20]; this method was applied to build predictive popularity by applying regression to different feature spaces [2] [9] [18] [21] In this work, the popularity of Social Curation is shown by the number of views that the content will receive in the near future We propose three groups for categorizing the popularity level of Social Curation We build a predictor based on a machine learning method, SVM, with feature selection to classify into these groups Predicting the Popularity of Social Curation 4.1 Problem Formulation Similar to normal content, the popularity of social curation is defined by the number of users’ view We predict how much view which stories will receive in the near future However, it is difficult to predict exact amount of attention and people are almost interested in the popularity of content; thus, instead of predicting exactly the number, we cast the task as a multi-class classification problem that predicts the popularity that a curation list will receive after three months based on the number of views Although our system cannot predict exactly the number of attention, but this system partly helps users to be able to identify popular contents and not popular contents We divide the number of views into three different classes: class – not popular, with the number of views less than 10, class – less popular, with the number of views between 10 and 1000, class – very popular, with the number of views more than 1000 We used an SVM to classify these classes LibSVM [4] with a radial basic function (RBF) kernel and default parameters, and the feature selection tool [5] were used to optimize the result We extracted two types of features, namely curation features and curator features Curator features are features of users who collect and organize elements from some domains and create curation lists Curation features are features related to the content of the curation lists 4.2 Features Social curation lists contain many kinds of information that are useful for classifying For example, if the curation list includes many Twitter contents, the view 420 B.T Kieu, R Ichise, and S.B Pham count of the contents is expected to increase; or, if elements match the context of the curation list, the content will attract much more attention In this study, as the social curation list included a large number of Twitter messages, we used applicable features for predicting the number of retweets and microblogging popularity We divided the features into the two distinct sets mentioned above: curator features (which are related to the author of the story) and curation features (which encompass various statistics of the content in the story) 4.2.1 Curator Features The following are the five curator features: The number of users who follow the curator of the content The number of users who the curator of the content follows The number of stories written by the curator The user’s language (English or not) When the curator of the content started using Storify These features were selected from the content creator features proposed by Ishiguro et al [11] We implemented these features as our baseline system The number of followers and friends has been consistently shown to be a good indicator of retweetability, whereas the number of stories has not been found to have a significant impact [19] Our prior analysis also showed that stories written in English are more likely to be viewed, so we used a binary feature indicating if the user’s language is English The date when a curator started using Storify shows their experience Normally, longtime users have more experience producing more popular curator stories than new users We are not aware of any prior work that analyzes the effect of language or date on content popularity 4.2.2 Curation Features The following are the seven curation features: The number of hashtags The number of versions The number of embeds The story’s language (English or not) The number of popular tweet elements/total elements (the number of retweets greater than 100) The number of popular image and video elements/total elements (the number of image views and video views greater than 1000) The total number of elements As a large proportion of elements in the curation list is from the Twitter domain, hashtags therefore play an important feature for predicting the popularity One paper showed that hashtags, URLs and mentions have a high correlation with predicting popular Twitter messages [19] Although the Storify API provides hashtags, URLs and mentions of each story, URLs and mentions have an insignificant impact on the Predicting the Popularity of Social Curation 421 Table Prediction accuracies for two feature types Type of feature Curation features Curator features Combined features No of features 12 Classification (10-fold) 75.08% 80.20% 82.62% result The version feature shows that users who modified their story can improve the story’s quality and get more attention The embed feature shows that more sharing is more popular The English language is the most well-known language in the world, so stories written in English are read by more people than those in any other language Although the feature is quite similar to the feature of the language of the curator, not all curators use their main language to write stories According to our experiments, tests using this feature achieved higher results Finally, the higher proportion of Twitter elements and media elements also increase the accuracy Moreover, using many elements in a story draws more attention than stories with fewer elements To the best of our knowledge, no prior work analyzed the effect of these features on content popularity Therefore, the features we proposed are based on the experiments and feature selection tool to acquire the highest result The feature selection tool, combined with libSVM, uses the F-score for selecting features [5] The F-score is a simple technique that measures the discrimination of two sets of real numbers The larger the F-score is, the more likely this feature is more discriminative Therefore, this score is used as a feature selection criterion Moreover, libSVM also provides a feature scaling function in order to absorb the scale differences among feature values, then we re-scaled them between [0,1] Finally, twelve of above features had the highest result for predicting the popularity of Storify data Experimental Results 5.1 The Experimental Dataset We used Storify’s streaming API to collect a random sample of public stories created from March 1, 2013 to March 31, 2013 with 34,810 curation lists We suppose that these stories have the same published time We crawled them in June 2013 so we predict how much attention of these contents in the three months later Finally, we divided this dataset into 10 groups and ran 10 cross validations 5.2 Results The different popular levels are displayed followed by the three classes, as mentioned in Section 4.1 Statistically, nearly half of the stories are class 1, nearly 20% are class 2, and the remaining are class The prediction accuracy for the two types of 422 B.T Kieu, R Ichise, and S.B Pham Table Accuracy of 10 tests Test 10 Curator features 83.61% 82.26% 79.88% 78.55% 82.23% 73.56% 76.31% 87.60% 68.44% 88.78% Combined features 87.42% 85.58% 83.91% 80.38% 85.49% 71.19% 78.92% 89.83% 70.38% 93.22% features are shown in Table The result of curation features (7 features) is the worst at 75.08%, curator features (5 features) at 80.02%, and the best result is combined features (combined between curator and curation features for a total of 12 features) at 82.62% Therefore, both types of features are necessary for prediction with high accuracy Table shows more detailed results for the 10 tests The curator features (as baseline features) and combined features show different results Most tests using combined features are more accurate than tests using only curator features except for test We analyzed test and realized that the percentage of class is approximately 40%, which is double the normal percentage It is shown that combined features cannot perform well for class In addition, most tests using combined features attain roughly 83% accuracy except some tests such as tests 6, (lower accuracy barely over 70%) and tests 8, 10 (high accuracy over 90%) Although the distribution ratio of classes in these tests is quite different from the others, the difference is irregular and not significant This is an open problem in our research; finding the answer for this question would improve the result 5.3 T-Test Evaluation The (student’s) t-test is a statistical examination of two population means In simple terms, the t-test assesses whether the means of two groups are statistically different from each other It is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size In our case, we used the t-test to evaluate two group results of the above 10 tests (small sample size) The decision rule is a 95% confidence interval of the difference from −3.8929 to −1.1271 and we calculate our value t = −4.1059 Therefore, we conclude that this difference is considered to be very significant It indicates that our proposal to use both features is effective to predict the popularity of social curation data Predicting the Popularity of Social Curation 423 Conclusion In this paper, we presented a method to predict the popularity of social curation content as the first step for mining social curation A key insight is that a curation list, which is unique compared to other social data, is the manual collection, selection, and maintenance by curators We used a machine learning approach and selected key features Analyzing the features, we found that social features (curator features) perform very well, but the system can be improved by combining the content features (curation features) A comparison by the t-test showed the significance On the other hand, the paper investigated only a specific curation dataset for a specific task We are aware that there are many open problems We have to investigate social features in a larger dataset or other domains In addition, analyzing and explaining the effect of features for predicting the popularity of social curation could improve the result Finally, our research is the first task for mining social curation data Based on this research, we could consider future tasks such as an automatic system or a recommendation system for curating social data Acknowledgment This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04 References [1] Ahmed, M., Spagna, S., Huici, F., Niccolini, S.: A peek into the future: Predicting the evolution of popularity in user generated content In: Proc WSDM 2013 (2013) [2] Bandari, R., Asur, S., Huberman, B.A.: The pulse of news in social media: Forecasting popularity, CoRR, abs/1202.0332 (2012) [3] Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the Flickr social network In: Proc WWW 2009(2009) [4] Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines ACM Trans Intelligent Systems and Technology (2011) [5] Chen, Y.-W., Lin, C.-J.: Combining SVMs with various feature selection strategies In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A (eds.) Feature Extraction STUDFUZZ, vol 207, pp 315–324 Springer, Heidelberg (2006) [6] Duh, K., Hirao, T., Kimura, A., Ishiguro, K., Iwata, T., Au Yeung, C.-M.: Creating stories: Social curation on twitter messages In: Proc ICWSM (2012) [7] Greene, D., Sheridan, G., Smyth, B., Cunningham, P.: Aggregating content and network information to curate twitter user lists In: Proc ACM RecSys workshop, RSWeb (2012) [8] Hall, C., Zarro, M.: Social curation on the website Pinterest.com In: Proc ASIST 2012 (2012) [9] Hogg, T., Lerman, K.: Social dynamics of Digg, CoRR, abs/1202.0031 (2012) [10] Hong, L., Dan, O., Daviso, B.D.: Predicting popular messages in Twitter In: Proc WWW 2011 (2011) [11] Ishiguro, K., Kimura, A., Takeuchi, K.: Towards automatic image understanding and mining via social curation In: Proc ICDM 2012 (2012) [12] Jamali, S., Rangwala, H.: Digging Digg: Comment mining, popularity prediction, and social network analysis In: Proc WISM 2009 (2009) [13] Fincham, K.: Storify The National Association for Media Literacy Educations Journal of Media Literacy Education (2011) 424 B.T Kieu, R Ichise, and S.B Pham [14] Kim, S.-D., Kim, S.-H., Cho, H.-G.: Predicting the virtual temperature of web-blog articles as a measurement tool for online popularity In: Proc CIT 2011(2011) [15] Lakkaraju, H., Ajmera, J.: Attention prediction on social media brand pages In: Proc CIKM 2011(2011) [16] Lee, J.G., Moon, S., Salamatian, K.: An approach to model and predict the popularity of online contents with explanatory factors In: Proc WI-IAT (2010) [17] Lee, J.G., Moon, S., Salamatian, K.: Modeling and predicting the popularity of online contents with Cox proportional hazard regression model Neurocomputing (2012) [18] Lerman, K., Hogg, T.: Using a model of social dynamics to predict popularity of news CoRR, abs/1004.5354 (2010) [19] Suh, B., Hong, L., Pirolli, P., Chi, E.H.: Want to be retweeted? Large scale analytics on factors impacting retweet in Twitter network In: Social Computing, SocialCom (2010) [20] Szabo, G., Huberman, B.A.: Predicting the popularity of online content Communications of the ACM 53, 80–88 (2010) [21] Tsagkias, M., Weerkamp, W., de Rijke, M.: News comments:Exploring, modeling, and online prediction In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Răuger, S., van Rijsbergen, K (eds.) ECIR 2010 LNCS, vol 5993, pp 191–203 Springer, Heidelberg (2010) [22] YouTube Statistics, http://www.youtube.com/yt/press/statistics.html [23] Zarro, M., Hall, C.: Exploring social curation D-Lib Magazine 18(11/12) (November/December 2012) [24] Van Zwol, R., Rae, A., Pueyo, L.G.: Prediction of favourite photos using social, visual, and textual signals In: Proc ACMMM (2010) ... to predict the popularity of social curation data Predicting the Popularity of Social Curation 423 Conclusion In this paper, we presented a method to predict the popularity of social curation. .. Features The following are the five curator features: The number of users who follow the curator of the content The number of users who the curator of the content follows The number of stories... classify into these groups Predicting the Popularity of Social Curation 4.1 Problem Formulation Similar to normal content, the popularity of social curation is defined by the number of users’ view

Ngày đăng: 16/12/2017, 15:53