World wide web , tập 16, số 5 6, 2013

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	247
Dung lượng	7,75 MB

Nội dung

Volume 16 November 2013 Issue 5-6 World Wide Web (2013) 16:541–544 DOI 10.1007/s11280-013-0254-0 Guest editorial: social networks and social Web mining Guandong Xu & Jeffrey Yu & Wookey Lee Received: 14 August 2013 / Accepted: September 2013 / Published online: 27 September 2013 # Springer Science+Business Media New York 2013 Nowadays the emergence of web-based communities and hosted services such as social networking sites – Facebook, LinkedIn, wikis – Wikipedia, microblogging - Twitter and folksonomies – Delicious, Flickr and so on, brings in tremendous freedom of Web autonomy and facilitate collaboration and sharing between users And along with the interactions between users and computers, social Web is rapidly becoming an important part of our digital experience, ranging from digital textual information to rich multimedia formats Social networks have played an important role in different domains for about one decade, particularly involved in a broad range of social activities like user interaction, establishing friendship relationships, sharing and recommending resources, suggesting friends, creating groups and communities, commenting friend activities and opinions and so on Recent years, has witnessed the rapid progress in the study of social networks for diverse applications, such as user profiling in Facebook and group recommendation via Flickr These aspects and characteristics form the most active and challenging parts of Web 2.0 A large amount of challenges and opportunities have arisen with the propagation and popularity of new applications and technologies A prominent challenge lies in modeling and mining this vast volume of data to extract, represent and exploit meaningful knowledge, and to leverage structures and dynamics of emerging social networks residing in the social Web, especially social media Social networks and social Web mining combines data mining with social computing as a promising direction and offers unique opportunities for developing novel algorithms and tools ranging from text and content mining to link mining and community detection and so on This special issue has gained overwhelming attention and received 52 submissions from researchers and practitioners working on social network analysis and social media mining After initial examining of all submissions, 42 papers are selected into the regular rigorous review process and each submission has been reviewed by at least three reviewers After 2–3 G Xu (*) University of Technology Sydney, Sydney, Australia e-mail: Guandong.Xu@uts.edu.au J Yu Chinese University of Hong Kong, Hong Kong, China e-mail: yu@se.cuhk.edu.hk W Lee Inha University, Incheon, Korea e-mail: trinity@inha.ac.kr 542 World Wide Web (2013) 16:541–544 round reviews, eventually ten quality papers are recommended to be included into this special issue, which are summarized as below The paper, titled “Mesoscopic Analysis of Networks with Genetic Algorithms”, presents a genetic based approach to discover communities in networks is proposed The algorithm optimizes a simple but efficacious fitness function to identify densely connected groups of nodes with sparse connections between groups, thus sensibly reducing the search space of possible solutions Experiments on synthetic and real life networks show the ability of the method to successfully detect the network structure Complex networks have received increasing attention by the scientific community, in line with the increasing availability of real-world network data Apart from the network analysis that has focused on the characterization and measurement of local and global properties of graphs, such as diameter, degree distribution, centrality, and so on, the multidimensional nature of real world networks has discovered, i.e many networks containing multiple connections between any pair of nodes have been analyzed The paper “Multidimensional Networks: Foundations of Structural Analysis” discusses the basis for multidimensional network analysis by presenting a solid repertoire of basic concepts and analytical measures, which take into account the general structure of multidimensional networks The framework has been tested on different real world multidimensional networks, showing the validity and the meaningfulness of the measures introduced, that are able to extract important and nonrandom information about complex phenomena in such networks In “A Time Decoupling Approach for Studying Forum Dynamics”, authors propose an approach that decouples temporal information about users into sequences of user events and inter-event times Online forums are rich sources of information about user communication activity over time Finding temporal patterns in online forum communication threads can advance our understanding of the dynamics of conversations The main challenge of temporal analysis in this context is the complexity of forum data There can be thousands of interacting users, who can be numerically described in many different ways Moreover, user characteristics can evolve over time Authors aim to develop a new feature space to represent the event sequences as paths, and model the distribution of the inter-event times They study over 30, 000 users across four Internet forums, and discover novel patterns in user communication The paper, titled “Who Blogs What: Understanding the Publishing Behavior of Bloggers”, investigates the bloggers’ publishing style and impact by grouping bloggers based on the analysis of topical coverage and comparing their publishing behaviors From a blog website with more than 370,000 posts, first two types of bloggers are identified: specialists and generalists Then they study and compare the respective publishing behaviors in the blogosphere, finding that bloggers with different topical coverage behave in different ways Specialists generally make more contributions than generalists and tend to publish more on weekdays, during business hours, and on a more regular basis Moreover, they also observe that specialists also have different publishing behaviors, with only a small fraction creating a large “buzz” or producing a voluminous output Online discussion threads are conversational cascades in the form of posted messages that can be generally found in social systems that comprise many-to-many interaction such as blogs, news aggregators or bulletin board systems The paper, “A likelihood-based framework for the analysis of discussion threads”, proposes a framework based on generative models of growing trees to analyze the structure and evolution of discussion threads The authors consider the growth of a discussion to be determined by an interplay between popularity, novelty and a trend (or bias) to reply to the thread originator The relevance of these features is estimated using a full likelihood approach and allows to characterize the World Wide Web (2013) 16:541–544 543 habits and communication patterns of a given platform and/or community They apply the proposed framework on four popular websites: Slashdot, Barrapunto (a Spanish version of Slashdot), Meneame (a Spanish Digg-clone) and the article discussion pages of the English Wikipedia, to evaluate their model Social recommender systems largely rely on user-contributed data to infer users’ preference, which might introduce unreliability to recommenders as users are allowed to insert data freely Although detecting malicious attacks from social spammers has been studied for years, detecting Noisy but Non-Malicious Users (NNMUs), which refers to those genuine users who may provide some untruthful data due to their imperfect behaviors, remains an open research question The paper “Noisy but Non-Malicious User Detection in Social Recommender Systems”, studies how to detect NNMUs in social recommender systems Based on the assumption that the ratings provided by a same user on closely correlated items should have similar scores, the authors propose an effective method for NNMU detection by capturing and accumulating user’s “self-contradictions”, i.e., the cases that a user provides very different rating scores on closely correlated items They show that self-contradiction capturing can be formulated as a constrained quadratic optimization problem w.r.t a set of slack variables, which can be further used to quantify the underlying noise in each test user The paper, titled “SocialSearch+: Enriching Social Network with Web Evidences”, addresses the problem of searching for social network accounts, e.g., Twitter accounts, with the rich information available on the Web, e.g., people names, attributes, and relationships to other people Existing solutions building upon naive textual matching inevitably suffer low precision due to false positives (e.g., fake impersonator accounts) and false negatives (e.g., accounts using nicknames) To overcome these limitations, the authors leverage “relational” evidences extracted from the Web corpus, namely web-scale entity relationship graphs that extracted from name co-occurrences of Web and web-scale relational repositories, such as Freebase with complementary strength Using both textual and relational features obtained from these resources, a ranking function is learned to aggregate these features for the accurate ordering of candidate matches Another key contribution of this paper is to formulate confidence scoring as a separate problem from relevance ranking The proposed system is evaluated by using real-life internet-scale entity-relationship and social network graphs The recommender systems utilizing Collaborative Filtering (CF) as the key algorithm are vulnerable to shilling attacks which insert malicious user profiles into the systems to push or nuke the reputations of targeted items There are only a small number of labeled users in most the practical recommender systems, while a large number of users are unlabeled because it is expensive to obtain their identities In “Shilling Attack Detection Utilizing Semi-supervised Learning Method for Collaborative Recommender System”, a new semisupervised learning based shilling attack detection algorithm, namely Semi-SAD, is proposed to take advantage of both types of data It first trains a naive Bayes classifier on a small set of labeled users, and then incorporates unlabeled users with EM to improve the initial naive Bayes classifier Experiments on MovieLens datasets indicate that Semi-SAD can better detect various kinds of shilling attacks than others, especially against obfuscated and hybrid shilling attacks Mobile and pervasive computing technologies enable us to obtain real-world sensing data for sociological studies, such as exploring human behaviors and relationships In “Understanding Social Relationship Evolution by Using Real-World Sensing Data”, the authors present a study of understanding social relationship evolution by using real-life anonymized mobile phone data Through the study the authors show that social relationships (not only reciprocal friends and non-friends, but non-reciprocal friends) can be likely predicted by 544 World Wide Web (2013) 16:541–544 using real-world sensing data In terms of the friendship evolution, they verify that the principles of reciprocality and transitivity play an important role in social relation evolution The paper titled “Can Predicate-Argument Structures be used for Contextual Opinion Retrieval from Blogs?” presents the use of predicate-argument structures for contextual opinion retrieval Different from the keyword-based opinion retrieval approaches, which use the frequency of certain keywords, their solution is based on frequency of contextually relevant and subjective sentences They use a linear relevance model that leverages semantic similarities among predicate argument structures of sentences The model features with a linear combination of a popular relevance model, the proposed transformed terms similarity model, and the absolute value of a sentence subjectivity scoring scheme The predicateargument structures are then derived from the grammatical derivations of natural language query topics and the well-formed sentences from blog documents Evaluation and experimental results demonstrate the feasibility of using predicate-argument structures for contextual opinion retrieval Finally, we would like to appreciate all authors who submitted manuscripts for consideration, and over 120 anonymous dedicated reviewers for their criticism and time to help us making final decisions Without their valuable and strong supports, we cannot make this special issue successful Our sincere gratitude will also go to the WWWJ EiC, Prof Yanchun Zhang, Ms Jennylyn Roseiento, Mr Hector Nazario from the Springer Journal Editorial Office for helping us to presenting this special issue to readers World Wide Web (2013) 16:545–565 DOI 10.1007/s11280-012-0174-4 Mesoscopic analysis of networks with genetic algorithms Clara Pizzuti Received: 15 July 2011 / Revised: 17 May 2012 / Accepted: 22 May 2012 / Published online: June 2012 © Springer Science+Business Media, LLC 2012 Abstract The detection of communities is an important problem, intensively investigated in recent years, to uncover the complex interconnections hidden in networks In this paper a genetic based approach to discover communities in networks is proposed The algorithm optimizes a simple but efficacious fitness function able to identify densely connected groups of nodes with sparse connections between groups The method is efficient because the variation operators are modified to take into consideration only the actual correlations among the nodes, thus sensibly reducing the search space of possible solutions Experiments on synthetic and real life networks show the ability of the method to successfully detect the network structure Keywords genetic algorithms · data mining · clustering · community detection · networks Introduction The suitability of networks to represent many real world systems has given an impressive spur to the recent research area of complex networks Collaboration networks, biological networks, communication and transport networks, the Internet, and the World-Wide-Web [25] are just some examples Networks, in general, are constituted by a set of objects and by a set of interconnections among these objects In social networks, for example, the objects are people and the connections represent social relations, such as common interests, friendship, religion, and so on Members of networks and relationships between them can be modeled as a graph of nodes and edges Each participant is denoted by a distinct node, and interactions are represented by edges connecting two objects Complex networks can be analyzed C Pizzuti (B) Institute for High Performance Computing and Networking (ICAR), Italian National Research Council (CNR), Via P Bucci 41/C, 87036 Rende (CS), Italy e-mail: pizzuti@icar.cnr.it 546 World Wide Web (2013) 16:545–565 at different levels of granularity The node level is the smallest scale to study At this level the node degree can give valuable information on the role played by the objects participating in the network More interestingly, the community or sub-graph level investigates the division of a network into groups (also called clusters or modules) having dense intra-connections, and sparse inter-connections, thus delivering a mesoscopic description of a network where the elements are the communities and not the nodes This partitioning is typical to many networks, thus the study of community structure can give important information and useful insights to understand how the structure of ties affects individuals and their relationships In fact, members of a community interact with each other, they share information, and can have a remarkable influence on the behavior of the other objects of the community The problem of community detection has been receiving a lot of attention in the last few years, and many different approaches have been proposed [1, 3, 4, 10, 17, 22, 23, 26, 29, 31–33, 37, 39] In this paper an algorithm, named GA-Net, to discover communities in networks by employing Genetic Algorithms (GAs) [14] is proposed The approach introduces the concept of community score to measure the quality of a network partitioning in communities, and tries to optimize this quantity by running the genetic algorithm All the dense communities present in the network structure are obtained at the end of the algorithm by selectively exploring the search space, without the need to know in advance the exact number of groups Specialized variation operators allow to reduce the space of the possible solutions thus improving the convergence of the algorithm The method requires an input parameter that biases the search towards a different number of communities The number of communities found is determined by the optimal value of the community score Experiments on synthetic and real life networks show the capability of the genetic approach to correctly detect communities with results comparable to state-of-the-art approaches The paper is organized as follows In the next section an overview of the main proposals of community detection algorithms is given Section provides the necessary background to formalize the problem and defines the quality metric employed to detect communities In Section a description of the method along with the representation adopted and the variation operators used are provided In Section the results of the method on synthetic and real life data sets are presented Section discusses the advantages of using GA-Net Finally, Section concludes the paper Related work Many different algorithms have been proposed to detect communities in complex networks [1, 3, 4, 7, 11, 13, 17, 22, 23, 26, 27, 29, 31–33, 35, 37, 39] In the following we review some of the most known algorithms Overviews of community identification methods in complex networks can be found in [6, 8, 10] One of the most famous algorithm has been presented by Newman and Girvan in [11, 29] The method is a divisive hierarchical clustering method based on an iterative removal of edges from the network The edge removal splits the network in communities An agglomerative, instead of a divisive, hierarchical algorithm that optimizes the concept of modularity, introduced in [29], is presented by Newman in [26] The modularity is the fraction of edges inside communities minus the expected value of World Wide Web (2013) 16:545–565 547 the fraction of edges, if edges fall at random without regard to the community structure Values approaching indicate strong community structure Thus the algorithm computes the modularity of all the clusters obtained by applying the hierarchical approach, and returns as result the clustering having the highest value of modularity A faster version of the method, based on the same strategy, is described in [4] Recently, some studies [9] have indicated that the optimization of modularity has a main disadvantage It can fail in finding communities smaller than a fixed scale, even if these modules are well defined The scale depends on the total size of the network and the interconnection degree of the modules This resolution limit can constitute a weakness for all those methods whose objective to optimize is modularity Wakita and Tsurumi [37] improved the method of [4] by identifying the cause of inefficiency of this latter agglomerative method in the strategy adopted to merge communities To this end they introduced three metrics that try to balance the size of the communities to be merged The modularity criterion enriched with these metrics allows for a sensible improvement of the algorithm efficiency Radicchi et al [32] proposed a divisive hierarchical algorithm to identify communities based on the concept of edge-clustering coeff icient, defined in analogy with the node clustering coefficient.1 The edge-clustering coefficient is the number of triangles an edge participates, divided by the number of triangles it might belong to, given the degree of the adjacent nodes Their algorithm works like that of Newman and Girvan, but it is faster The main difference is that instead of choosing to remove the edge with the highest edge betweenness, the removed edges are those having the smallest value of edge-clustering coefficient However, a quantitative measure for the evaluation of the dendrograms generated by the hierarchical approach is not defined Thus the choice of a solution with respect to another must rely on the intuitive concept of community that a user has Pons and Latapy [31] introduced an agglomerative hierarchical algorithm to compute the community structure of a network The algorithm starts from a partition of the graph in which each node is a community, and then merges the two adjacent communities (i.e having at least a common edge) that minimize the mean of the square distances between each vertex and its community The distances between communities are recomputed and the previous step is repeated until all the nodes belong to the same community In order to decide the best partitioning to choose, the modularity criterion of Girvan and Newmann is adopted Blondel et al [3] presented a method that partitions large networks based on the modularity optimization The algorithm consists of two phases that are repeated iteratively until no further improvement can be obtained At the beginning each node of the network is considered a community Then, for each node i, all its neighbors j are considered and the gain in modularity of removing i from its community and adding it to the j community is computed The node is placed in the community for which the gain is positive and maximum If no community has positive gain, i remains in its original group This first phase is repeated until no node move can improve the The clustering coefficient has been defined by [38] Given a node i, let n be the number of links i connecting the ki neighbors of i to each other The clustering coefficient of i is Ci = 2ni /ki (ki − 1) ni represents the number of triangles passing through i, and ki (ki − 1)/2 the number of possible triangles that could pass through node i The clustering coefficient a graph is the average of the clustering coefficients of the nodes it contains 548 World Wide Web (2013) 16:545–565 modularity The second phase builds a network where the communities obtained are considered as the new nodes and a link between two communities a, b exists if there is an edge between a node belonging to a and a node belonging to b The network can be weighted, in such a case the weight of the edge between a and b is the sum of the weights of the links between nodes of the corresponding communities At this point the method can be reiterated until no more changes can be done to improve modularity The method is very accurate, however, it is unable to detect modules at a particular scale Approaches to community detection based on Genetic Algorithms can be found in [7, 13, 22, 35] In [35] the authors present a genetic algorithm that uses as fitness function the network modularity proposed by Newmann and Girvan An individual is constituted by N genes, where N is the number of objects The ith gene corresponds to the ith node, and its value is the community identifier of node i They use a non standard one-way crossover operation in which, given two individuals A and B, a community identifier j is chosen at random, and the identifier j of the nodes j1 , jh of A is transferred to the same nodes of B Gog et al [13] proposed a collaborative evolutionary algorithm that uses also the modularity as fitness function to optimize The main novelty of this approach is that each individual is endowed with the knowledge about the best potential solution already obtained during the search process, and the value of its best ancestor The sharing of this information helps the method to find significative community structure Both the two above methods could fail to uncover community structure when the network contains modules satisfying the conditions of the limit resolution property stated in [9] A different approach is described in [7] where a random walk distance measure between graphs is integrated in a genetic algorithm to cluster networks The representation used is the k-medoids, where each cluster center is represented by one of the nodes of the network The fitness function tries to minimize the sum of all the pair-wise distances between nodes The main limitation of this approach is that the number k of clusters must be known in advance An agglomerative clustering method based on Genetic Algorithms has been proposed by Lipczak et al [22] In this approach each individual represents a single community, instead of the whole clustering solution Two fitness functions are considered The former considers the normalized cut, i.e it assumes that a graph is divided into two disjoint sets A and B, and defines the score of this division as the fraction of all the connections between A and B with respect to the number of connections involving A and B separately The other fitness function is essentially the modularity of Girvan and Newman The authors compared their approach with U PGM A [34], a well known hierarchical method, and showed the good performance of their approach A main difference of this approach with respect to the other GA-based methods is the representation used In fact Lipczak et al proposed to represent each cluster with a chromosome, thus a solution is represented by the whole population The motivation of this choice, as stated from the authors, was to reduce the size of an individual and the fitness computational cost This kind of representation implies that the method, in order to obtain a partitioning of the network in k clusters, needs to use a population of k individuals Thus the method must be executed for an increasing number of clusters, and thus a population of increasing size, to find the best result Another drawback comes from the variable length of the individuals In World Wide Web (2013) 16:545–565 549 order to perform crossover, a mapping to the fixed-length representation of the two individuals involved in the crossover operation is needed The mapping of a parent adds null genes in places of genes present in the other parent This strategy partially destroys the objective of reducing the size of individuals Recently, the problem of community detection has been tackled by means of particle swarm optimization (PSO) [40] In this approach a fixed number of particles are deployed onto the search space and move according to their velocity vector Each particle has size equal to the number of nodes of the network and represents a partitioning At each iteration, the fitness of particles is computed, and that having the best fitness is stored as the current best solution The fitness function adopted is the modularity The particles then update their position and velocity vector, and repeat the same steps until the stop condition is not reached Community detection problem A network N can be modeled as a graph G = (V, E) where V is a set of n =| V | objects, called nodes or vertices, and E is a set of m =| E | links, called edges, that connect two elements of V In the following, without loss of generality, the graph modeling a network is assumed to be undirected A community in a network is a group of vertices (i.e a sub-graph) having a high density of edges within them, and a lower density of edges between groups In [8] it is observed that a formal definition of community does not exist because this definition often depends on the application domain In this paper we assume the intuitive definition given by Radicchi et al [32] of weak community A weak community is interpreted as a set of nodes having the total number of intra-connections higher than the number of inter-connections among different communities The partitioning of the graph G, modeling a network N , in k weak communities {S1 , , Sk }, can be transformed into that of partitioning the adjacency matrix A of G in k sub-matrices, such that the sum of densities of the sub-matrices is maximized A naive density measure for a sub-matrix of n rows/columns is the number of ones (i.e interactions) it contains The higher the number of ones, the more connected the n nodes However, counting the number of interactions does not give any information about the interconnections among the nodes A quality measure of a community S that maximizes the in-degree of the nodes belonging to S can be defined as follows score(S) = i∈S |S| j∈S Aij |S| r × Aij i, j∈S where | S | is the cardinality of S, |S| j∈S Aij is the fraction of edges connecting node i to the other nodes in S, and i, j∈S Aij is the double of the number of edges connecting vertices inside S, i.e the number of entries in the adjacency sub-matrix of A corresponding to S The community score of a clustering {S1 , Sk } of a network is defined as k CS = score(Si ) i World Wide Web (2013) 16:763–791 777 Table Term-term similarity matrix showing contextual relevance between the predicateargument structures of a query and a sentence Qpredargs Designed Spredargs Clothes Oscar Awards made 0 clothes 0 Oscar 0 1 awards 0 1 predicate-argument structure, while another sentence with much longer grammatical dependencies may have five or more terms in its predicate-argument structure Thus, we use the À Á relevance model proposed by Lavrenko and Croft [31], to estimate P ijQpredargs \ Spredargs Again, this has shown outstanding performance inÁ many information retrieval systems [36] À The relevance model P ijQpredargs \ Spredargs is estimated in terms of the joint probability of observing a term i together with terms from Qpredargs \ Spredargs [31] Consider, for example, the following scenario: q Qpredargs s1 s1predargs s2 s2predargs regulations that China proposed proposed\regulations/China China proposed regulations on tobacco proposed\regulations/China/tobacco The proposed regulation on tobacco is a new development in China proposed\regulation/China/tobacco/development Above, the predicate-argument structures Qpredargs, S1predargs, and S2predargs are not squared Each predicate-argument structure has different length of vector Thus, to solve this problem, we can make a pairwise independence assumption that i and all terms from Qpredargs \ Spredargs are sampled independently and identically to each other By this way, the relevance model would be appropriate to determine when the predicate-argument structure Qpredargs is independently present as part of the predicate-argument structure of some long range sentences such as shown in S1predargs and S2predargs above The relevance model is computed as follows: Pði; j1 jk ị ẳ Piị k X Y PMi jiịPj1 jMi Þ ð6Þ i¼1 Mi Z where Mi is a unigram distribution from which i and j are sampled identically and independently, Z denotes a universal set of unigram distributions according to [31] Thus, for each blog document, our aim is to a linear combination of the result of TTS in Eq and the result of the relevant model in Eq and then the summation of all subjective scores identified by the tagged subjective predicate-argument structures Intuitively, the predicate-argument structure of a given query can be checked against predicate-argument structures of sentences with either short-range or long-range grammatical dependencies Therefore, we believe the TTS approach would solve the short-range dependencies efficiently and the relevance model approach would solve the long-range dependencies efficiently We also believe that the subjective scoring scheme by which subjective sentences are identified and tagged for compensation would ensure the retrieved relevant blog documents 778 World Wide Web (2013) 16:763–791 are ranked according to their level of subjectivity The linear relevance score for a given blog document is computed as follows: RL ðq; d ị ẳ k v kỵv X k X v X TTSscore ỵ RMscore ỵ Subjscore N iẳ1 N iẳ1 iẳ1 ð7Þ where RL ðq; dÞ is the linear relevance function that takes input as query q and document d satisfying a linear combination expression C ẳ aX ỵ bY ỵ Z , where a and b can be empirical constants, and X, Y and Z are the values to be linearly combined k is the number of sentences with short-range dependencies that satisfy TTS, N is the total number of sentences in document d, TTSscore is derived using Eq 5, v is the number of sentences with long-range dependencies that satisfy RMscore, RMscore is derived using Eq 6, and Subjscore is the subjective score empirically set to 0.5 for each subjective sentence Evaluation and experiment In Section 4, we have modeled opinion relevance and then derived a linear relevance model Eq that can retrieve contextual opinion using semantically related predicate-argument structures In this section we will evaluate the linear relevance model against some standard baselines using different evaluation measures To begin with, we will describe the overview of the evaluations to be performed in this section First, to aid performance results we filtered the TREC Blog datasets by developing a heuristic rule-based algorithm to retrieve only English blog documents Documents written in foreign languages are removed by comparing English function words with each blog document An empirical frequency threshold is set for the number of English function words that must be found within each blog document Blog documents that have less than the frequency threshold are considered non-English and discarded For the purpose of this work, we set the threshold to 15 as it effectively retrieved English blog documents compared to a lower threshold Our heuristic rule-based algorithm also removes all markup tags and special formatting such as scripts, element attributes, comment lines, and META description tags The heuristic rule-based algorithm (BlogTEX)9 is seen to have worked with better performance on TREC Blog 08 dataset compared to [41], and we make it available freely for research purposes In order to prepare each blog document for sentence-level grammatical parsing, we use LingPipe Sentence Model10 to identify sentence boundaries in blog documents We modified the sentence model to detect omitted sentences and sentences without boundary punctuation marks Second, in Section 5.1, we describe a prior pre-evaluation process on TREC query topics in order to have explicit type natural language queries The explicit type queries are assumed to be the ideal natural language queries as would be entered by humans They give unbiased grammatical parse trees for the grammatical tree derivation process, thus showing the original and intended underlying meaning of each query Third, in session 5.2, query topics and blog documents are transformed to their equivalent grammatical parse trees The TREC query topics used in this study are available in natural language sentences with an average of 10 words per each query Query topics and sentences that can be successfully transformed by the syntactic parser are assumed to be well formed http://sourceforge.net/projects/blogtex 10 http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html World Wide Web (2013) 16:763–791 779 Although, query formulation/expansion techniques [27] and spelling correction techniques [17] can be used to pre-process queries and sentences that are not well formed, however, the scope of this study does not cover queries or sentences that are not well formed Thus, we ignored document sentences and queries that could not be parsed, but at least 90% of the query topics and the selected blog documents were successfully transformed to their equivalent grammatical parse trees Using the output parse trees of the query topics and the sentences in each blog document, predicate-argument structures are derived to represent the respective underlying semantic meaning Fourth, in Section 5.3, we the subjectivity identification of sentences in the blog documents For each blog document, we verify if each sentence contains subjective expression by using the GATE11 implementation of [57] Each subjective sentence is tagged with the keyword “subj” implemented as a string constant concatenated with the equivalent predicate-argument structure of the sentence Our model compensates subjective sentences with an empirical score of 0.5 unlike objective sentences Fifth, in Section 5.4, we perform topic relevance evaluation based on cross-entropy We compare the cross-entropy between an “ideal” relevance model (based on human judgments), our proposed linear relevance model, the standard BM25 [47] with empirical parameters k01.2 and b00.75, and the Language Model using Dirichlet prior smoothing [36] with μ02000 The idea is to observe the best approximation model that has minimum cut in terms of cross-entropy by simply measuring the uncertainty of query topics shared by a given collection of blog documents Both BM25 and the language model were used on the Terrier12 open source information retrieval platform [43], with effective indexing technique Sixth, in Section 5.5, we show the performance of our opinion retrieval model in a more practical scenario, we perform experiment for opinion retrieval task using TREC Blog 2006 and TREC Blog 2008 query topics on TREC Blog08 dataset The evaluation metrics used are based on Mean Average Precision (MAP), R-Precision (R-prec), and precision at ten (P@10) Finally, in Section 5.6, we perform fitness evaluation for the proposed linear relevance model by using a model selection (model comparison) technique This assured the query relevant evaluation results for the proposed linear relevance model has not resulted from overfitting 5.1 Pre-processing TREC queries We use the description fields of 2006 and 2008 TREC query topics Initially, all the queries are available in descriptive natural language sentences However, the descriptive representation of most of the queries as made available by TREC makes each query start with certain descriptive words or phrases Thus, base on human judgments, we manually removed the descriptive words or phrases that precede some of the queries in order to make the queries more explicit and unbiased by irrelevant query words Note that the pre-processing output of some of the queries differs from the usual “title” field provided by TREC For example, the “title” field of the query Provide opinion of the film documentary “March of the Penguins” is “March of the Penguins”, whereas our pre-processing output gives Film documentary “March of the Penguins” The query pre-processing enables specific inputs to the CCG parser in order to derive predicate-argument structures from the output parse trees Table shows examples of descriptive queries and their equivalent forms upon removing the 11 12 http://gate.ac.uk/download/ http://ir.dcs.gla.ac.uk/wiki/Terrier 780 World Wide Web (2013) 16:763–791 Table Pre-processing of TREC query topics Each query was modified from descriptive to explicit TREC Query # Descriptive Explicit 868 Provide opinion concerning the aviation defense program 'Joint Strike Fighter' The aviation defense program 'Joint Strike Fighter' 1019 Find opinions about China's law that restricts families to only one child China's law that restricts families to only one child preceding descriptive words or phrases by human judges All human judges included the authors of this paper 5.2 Parsing of blog documents We parsed selected blog documents by using the C&C tools.13 The offline parsing is one important idea that compensates parsing of blog documents at real-time during the search process Although the log-linear CCG parser have credible performance for parsing textual contents in affordable time, we believe the access time to opinionated blog documents may be affected by combining the syntactic parse process with the opinion search process The parsing process was done on a 2.66 GHz Intel Core Duo CPU with GB of RAM The CPU time for a complete parse process for each blog document was seconds on average Note that this performance may vary on a different hardware other than what we have used for our experiments In terms of precision and recall, the C&C parser has also shown credible efficiency and performance, respectively [12] The parsing process involved two phases which we programmatically merge into one In the first phase we use the C&C log-linear CCG parser to derive the grammatical output parse tree For using the C&C parser, one could either train new log-linear models or use existing models14 provided by the C&C community The existing models are enhanced models trained to take POS tags from CCGbank15 gold standard data The second phase involves the derivation of predicate-argument structures by using C&C Boxer tool as an alternative to predicate-argument extractor As at the time of this study, we not know of any predicatearguments extractor for CCG The C&C Boxer tool takes as input CCG output parse tree and produces Discourse Representation Structures (DRSs) which shows the semantic representations of the given sentence The DRSs are more or less predicate-argument structures as they show the underlying structural meaning of the given sentences Thus, in our offline parse process, selected blog documents are parsed and stored in a structural text format containing the predicate-argument structures which are equivalent to the original sentences The predicate argument structures of subjective sentences are then tagged accordingly Our model can then be used on the structural format representing the blog documents 5.3 Identification of subjective sentences Since we can only identify subjectivity from the original natural language format of each sentence, we perform the subjectivity identification process on the original sentences that derived the predicate-argument structures Again, we tag the predicate-argument structures 13 14 15 http://svn.ask.it.usyd.edu.au/trac/candc/wiki http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Models http://groups.inf.ed.ac.uk/ccg/ccgbank.html World Wide Web (2013) 16:763–791 781 of equivalent subjective sentences with a string constant “subj” The subjectivity of each sentence is determined by using the annotation technique described in Wiebe et al [57], for identifying subjective expressions as follows: INSIDE(text: “Alice love Chocolate”; source: writer/Ryan); DIRECT-SUBJECTIVE(text: “love”; intensity: high expression−intensity: high; polarity: positive; insubstantial: false; source: writer/Alice; target: Chocolate); Using the above output, a sentence is considered subjective if the output contains a “direct-subjective” annotation component Also, the “intensity” attribute must be “high”, the “expression-intensity” attribute must be “high”, and the “insubstantial” attribute must be “false” A sentence is also considered subjective if the annotation output contains an “expressive-subjectivity" annotation component with all “intensity” attributes that must not be “low” It is also possible to use the presence of opinion words such as subjective adjectives or a subjective lexicon to determine the subjectivity of sentences [59] We observed the annotation technique to have better precision, but require some manual efforts either for identifying explicit and implicit subjectivity or for training specific a classifier that identifies subjective sentences 5.4 Evaluating topic-relevance using cross-entropy An effective opinion retrieval model is strongly dominated by good performance of the included topic-relevance component Since our model is a combination of topic-relevance and subjective components, we use the cross-entropy method to show the difference in expected performances of our model and two topic-relevance models The cross-entropy method or Kullback–Leibler divergence uses an information-theoretic approach that measures the distance between distributions The cross-entropy is measured in bits, and a minimized cross-entropy indicates better performance over other distributions The best model distribution that has minimum cut in terms of cross-entropy is believed to have better performance Thus, in terms of the relevance of a given document to a query topic, we are able to show the difference between the expected performances of our linear relevance model, the BM25 model, and the language model Since our linear relevance model is able to account for short and long-range grammatical dependencies for determining document relevance to a query topic, we expect it to have more pronounced and minimized crossentropy over the baselines Note that we not include the subjective score in this evaluation since we are only considering topic-relevance We plot the graph of measured cross-entropy as a function of the number of top-ranked blog documents, first from the “ideal” relevant documents, and then from the retrieved documents by the proposed linear relevance model Eq 7, BM25, and the language model with Dirichlet prior smoothing We select one document from the manually judged relevant documents to represent the relevant result of the ideal model Thereafter, for each approximation model, we limit the number of retrieved top-ranked documents from 1000 to 100 since we are more concerned about contextual opinionated documents at affordable browsing level We then compute their respective cross-entropy or KL-divergence by measuring the uncertainty of TREC query topics shared by the top-ranked 782 World Wide Web (2013) 16:763–791 document collection Note that we considered the synonyms and hyponyms of terms in comparing predicate-argument structures in our TTS technique (Section 4.6.1) Thus, we ensured the frequency count for calculating the probabilities for cross-entropy also include either synonyms or hyponyms of terms in query topics and the retrieved blog documents X CE ẳ PtjQị log w2V PtjQị Pc tị 8ị Cross-entropy of approximation models in bits where V is the entire vocabulary of the top-ranked collection, w is the word element of the vocabulary V, PðtjQÞ is the relative frequency of term t in each TREC query topic, and Pc(t) is the relative frequency of term t in the entire top-ranked collection Note that since Pc(t) has a fixed reference distribution, it perfectly makes sense to refer to cross-entropy and KLdivergence as identical in our case Both cross-entropy and KL-divergence would have minimal values when PtjQị ẳ Pc tị (Figure 3) With the cross-entropy of a true relevant document shown as the horizontal line marked with “x”, all the approximation models show minimum cross-entropy between 35 and 60 documents Initially, all models compete between the top10 documents except the Language Model at around 12 documents This implies that all the models have the tendencies to retrieve query relevant documents but with slight performance variations at top-ranked documents The variations can also be observed upon calculating the Precision@K and the MAP However, the linear relevance model has more pronounced cross-entropy values at the top-ranked documents Moreover, it shows the absolute minimum cross-entropy compared to other approximation models It could also be observed that the cross-entropy of the linear relevance model flattens relatively slower than others at around 70 to 90 documents Obviously, this is an indication that the linear relevance model approximation has more weight and robust performance than other approximations The evaluation shown above is query-dependent but it is sufficient enough to show the general relationship shared by a blog document collection and a given query topic However, to further demonstrate the effectiveness of our model, we need to show relevance clarity CR which describes a strong relevance relationship between the blog documents retrieved by the three approximation models Thus, using a document-dependent evaluation, we can assume that PðtjQÞ and PðtjDÞ are the estimated query and document distributions respectively (assuming PðtjDÞ has a linearly smoothed distribution) The relevance value of document D Linear Relevance Model BM25 Language Model True Relevant Document (ideal model) 20 40 60 80 100 Number of top-ranked blog documents Figure Minimum cross-entropy between three approximation models and one ideal topic model World Wide Web (2013) 16:763–791 783 with respect to query Q can then be measured as the negative KL-divergence between the approximation models CR ẳ X PtjQịlogPtjDị ỵ w2V X PtjQịlogPtjQịị 9ị w2V In order to show a strong relevance relationship among the approximation models, we select the top-20 relevant results of each model and compute their respective negative KL-divergence to indicate relevance clarity The idea is to show if each retrieved document (as per the rank) contains the same opinion target across the three models Interestingly, the evaluation shows the top-2 documents are likely to contain the same topic or opinion target It also shows that linear relevance model has more relevant documents at top-ranked Although the proposed linear relevance model has a more desired result compared to other approximation models, we learned that document-dependent evaluation using negative KL-divergence gives more consistent results 5.5 Evaluation on opinion retrieval task We perform opinion relevance experiments using TREC Blog 2006 and TREC Blog 2008 query topics on TREC Blog 08 dataset which is a continuation of Blog 06 dataset with a wider scope of document colection The TREC Blog 2006 consists of 50 query topics (Topic 851-900) and the TREC Blog 2008 consists of 150 query topics which include Blog 2006, Blog 2007 (Topic 901-950), and Blog 2008 (1001-1050) query topics Generally, Blog 2006 query topics are good in showing the effectiveness of different models, while Blog 2008 query topics are used for performance test We compared our results with Blog06 and Blog08 best runs We also compared our results with Blog08’s KLEDocOpinTD and the Blog06’s mixture relevance model proposed in [26] We selected KLEDocOpinTD because it uses the description fields of the query topic for retrieval which is also applicable to our model The mixture relevance model was selected since it also tends to capture information need using opinionated seed words Inspired by the evaluation results shown in Figure 4, we now focus our attention to produce context-dependent opinionated documents with high precision at top-ranked documents For this purpose, we use a re-ranking technique by initially retrieving top 20 documents using BM25 popular ranking model with only stop word removal during indexing Again, we use empirical parameters k01.2 and b00.75 for BM25 as they have been reported to give acceptable retrieval results [47] These top 20 documents are then re-ranked by our model We use the general evaluation metrics for IR measures which include MAP, R-Prec, and P@10 A comparison of opinion MAPs with increasing number of top-K blog documents is shown in Figure Our linear relevance model shows improved performance over Blog08 best run by more than 15% and significantly outperforms Blog08’s KLEDocOpinTD However, the performance of our model on Blog06 was significantly reduced but the performance is still encouraging given the improvement on Blog06 best run The performance is also comparable to the mixture model proposed in [26] We think the performance decrease may be due to the limited scope of the Blog06 documents Performance improvement in precision and recall curves upon the re-ranking technique is shown in Figure The Linear Relevance Model shows improved performance with our re-ranking technique The result also complements our cross-entropy evaluation with persistent improvements in the 784 World Wide Web (2013) 16:763–791 Linear Relevance with Predicate-Arguement relations BM25 Language Model Negative KL- Divergence 6.9 6.8 6.7 6.6 6.5 6.4 10 12 14 16 18 20 Top-20 relevant blog documents Figure Negative KL-Divergence between the three approximation models Our model shows robust performance at top-20 documents same order This shows the effectiveness of our model, hence suggests the importance of reranking technique Our model also shows improved performance without any re-ranking mechanism (i.e when BM25 is not used at all) In Table below, the best significant improvements are indicated with * Our model also shows improvement in terms of R-prec in all cases This shows the effectiveness of our model in terms of retrieving contextually relevant opinions from blogs For both Blog06 and Blog08, all query topics were used and more than 50% of the query topics received significant improvement in terms of MAP Query topics that did not show improvements were observed to contain one or two terms in their respective predicateargument structures For example, topic 854 “What opinions readers have of Ann Coulter?” and topic 1050 “what are peoples’ opinions of George Clooney?” performed poorly since their predicate-argument structures derived only “Ann Coulter” and “George Clooney” respectively We believe the search with such queries was limited to the distribution of only the two query terms that appear in subjective sentences We also think this is why the mixture relevance model [26] has comparable performance with our model since it used query expansion technique to represent the information need However, care must be taken to ensure the query expansion process not bias the original natural language query b 0.6 0.32 0.55 0.3 Opinion MAP Opinion MAP a 0.65 0.5 0.45 0.4 0.35 Linear Relevance Model Blog08's KLEDocOpinTD Blog08 Best run 0.3 0.25 10 0.28 0.26 Linear Relevance Model Mixture Relevance Model Blog06 Best run 0.24 0.22 0.2 20 30 40 50 60 70 80 Top-K blog documents 90 100 10 20 30 40 50 60 70 80 Top-K blog documents 90 100 Figure a MAP comparisons for increasing Top-K documents on Blog08 b MAP comparisons for increasing Top-K documents on Blog06 World Wide Web (2013) 16:763–791 a 0.9 b Our Liniear Relevance Model with re-ranking Blog08's KLEDocOpinTD Our Liniear Relevance Model without re-ranking 0.8 0.7 0.8 Our Linear Relevance Model with re-ranking Mixture Relevance Model on Blog06 Our Linear Relevance Model without re-ranking 0.7 0.6 0.6 Precision Precision 785 0.5 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Recall 0.7 0.8 0.9 0 0.2 0.4 0.6 Recall 0.8 Figure a Precision and Recall curves on Blog08 b Precision and Recall curves on Blog06 for the intended opinionated information need We think such process could be very difficult to evaluate 5.6 Evaluating model fitness using AIC model selection IR models may suffer from overfitting16 if there is no proper evaluation to measure the fitness of such models We evaluate the fitness of our linear relevance model in order to show sufficient evidence that the distribution of the model can minimize KL information17 considerably and that our topic-relevance experimental results are not due to overfitting For this purpose, we used Akaike Information Criterion (AIC) to perform model selection between BM25, language model, and our linear relevance model AIC is effective in selecting the best approximating model from finite samples by minimizing the KL information through parameters estimation [60–63] Thus, we are able to understand the approximating capability of the linear relevance model by simply computing and comparing the AIC values for each of the participating models (Tables and 8) Intuitively, the model with the least AIC value is believed to be the best fitted model for the opinion relevance retrieval task [2, 8, 9] Moreover, AIC has strong theoretical foundations built over KL distance [6], for selecting appropriate approximating model for statistical inference from many types of empirical data [2, 3, 8, 9] We show AIC’s criteria for estimating the expected relative distance between models as follows: AIC ẳ 2log L b jy ỵ 2K: 10ị where, logðLðb θjyÞÞ is the numerical value of the log likelihood at its maximum point, and K is the number of estimable parameters in each participating model [9] Again, for consistency reasons, we assume empirical parameters k01.2 and b00.75 for BM25 and μ02000 for Language Model with Dirichlet prior smoothing Using BM25 and the Language Model on the Terrier platform, we retrieved blog documents for a single query topic “opinions expressed about the rumor that Google is planning a desktop operating system named Goobuntu” In Table 7, MAP values 0.4650 and 0.4371 computed by using the standard BM25 and LM respectively were used as the estimated likelihood values The 16 17 Error or bias randomly amounts to significant performance of a model A measure of difference between two probability distributions 786 World Wide Web (2013) 16:763–791 Table Performance and improvements on opinion retrieval Dataset Blog08 Blog06 Model MAP R-Prec P@10 Best run at Blog08 0.5610 0.6140 0.8980 Linear Relevance Model re-ranked 0.6485 * 0.7324 * 0.9231 Improvement over Best run at Blog08 15.6% 19.3% 2.8% Linear Relevance Model without re-ranking Improvement over Best run at Blog08 0.6369 * 13.5% 0.6894 * 12.3% 0.9317 3.8% Blog 08’s KLEDocOpinTD 0.3937 0.4231 0.6273 Improvement with Linear Relevance Model re-ranked 64.7% 73.1% 47.2% Best run at Blog06 0.2052 0.2881 0.4680 Linear Relevance Model re-ranked 0.3152 * 0.3564 * 0.5639 * Improvement over Best run at Blog06 53.6% 23.7% 20.5% Linear Relevance Model without re-ranking 0.2930 * 0.3172 * 0.4830 Improvement over Best run at Blog06 Mixture relevance model [26] 42.8% 0.3147 10.1% 0.3546 3.2% 0.5670 Improvement with Linear Relevance Model re-ranked 0.2% 0.5% 0% estimated likelihood for the linear relevance model is computed as 0.6689 based on 1000 relevant blog documents identified by human judges Indeed, the 0.6689 estimated likelihood value shows the effectiveness of the proposed linear relevance model In comparison to the MAP of the actual linear relevant model reported in Table 6, the MAP value of relevant blog documents that are based on human judgment only maximizes the actual linear relevant model by approximately 5% This shows that our model can retrieve opinion and relevant documents with almost the same effectiveness shown by the humanjudges In Table 8, we compute the strength of evidence or evidence ratio wwij wwminj as a rule of thumb [2, 8, 9] This shows the best approximating model, given the data and the candidate models The strength of evidence for linear relevance model against BM25 and then LM can be computed as 0.703542/0.081306≈8.7, and 0.703542/0.215152≈3.3 respectively The strength of evidence for the linear relevance model is stronger against LM and BM25 although at a reasonable level In addition, candidate model with wi less than 10% of the highest cannot be considered as good approximating model [2, 8, 9] In this case, 10% of the highest wi (i.e 0.703542 * 0.10) is 0.0703542 Interestingly, all the three participating models have wi greater than 10% of w1, but the proposed linear relevance model appears to lead the 10% value by a considerable difference Surprisingly, BM25 appear to have a fairly bad result in terms of model fitness compared to LM We suggest this could be as a result of frequency of query words upon which it is built We also noticed that parameterization has a huge impact Table Parameters and likelihood values for AIC candidate models Model Parameters No of parameters Likelihood Linear relevance model none 0.6689 BM25 k101.2, b00.75 0.4650 LM μ02000 0.4371 World Wide Web (2013) 16:763–791 787 Table AIC values for the candidate models derived based on [8, 9] with the appropriate formulas shown in footnotes on this page The least AIC or AICc and the highest wi values respectively denote the best approximating model [8, 9] Model K AICa AICcb Δic Linear relevance model 0.349277 0.001350 0.703542 BM25 4.665094 0.020727 4.315817 0.081306 LM 2.718838 0.008736 2.369561 0.215152 a wi d Formula is shown in Eq 10 This is optional, AIC requires bias-adjustment if n/K < 40 [9] thus, AICc ẳ AIC ỵ 2K ỵ 2KK ỵ 1ịị=n K 1ị; where n0number of observations and K0number of parameters b c Differences between AIC, that is, Δi ¼ AICi À AICmin [9] R P Akaike weight is the normalized relative likelihood, that is, wi ẳ exp0:5i ị= exp0:5r ị; where rẳ1 the denominator is the sum of all the differences [9] d on the results A higher parameterized model is likely to reduce the performance and robustness of the model Therefore, with sufficient evidence, we conclude, that the linear relevance model is indeed a good and fit model for the relevance and opinion retrieval task Discussion and limitations It is worth mentioning, that the performance of the linear relevance model may depend on the accuracy of the chosen syntactic parser from which the predicate-argument structures are derived We might also expect the linear relevance model to have improved performance if the syntactic parser can be trained on blog documents The log-linear syntactic parser upon which we based our experiment was trained on CCGBank (a consistent conversion of the WSJ Penn Treebank news paper texts) One thing we have learnt from the result of our evaluations and experiments is that, improved performance over blog document shows the likelihood that the log-linear syntactic parser could as well be trained using blog texts rather than news paper texts available in Penn Treebank and CCGBank Perhaps not surprisingly, in terms of vocabulary, news paper articles share certain lexical features with blog documents On the other hand, the training process may be computational intensive and more demanding as the contents within blog documents can be very dynamic In any case, the use of syntactic parser to derive predicate-argument structures for opinion retrieval from blog documents will always give sufficient accuracy Our feeling is that future improvement on such or similar syntactic parsers would also boost the performance of our linear relevance model towards retrieving more accurate contextual opinions The linear relevance model also gives room to flexibility of expanding the scope of the opinion relevance function In this study, we have only focused on using the semantic similarity between pair of predicate-argument structures for the query and the sentences We believe other features can be introduced to boost the performance of the model For example, as part of our ongoing work, we plan to integrate a background model that considers the influence of multiple sentences on the relevance of a single sentence [20] We believe the relevance of a sentence to a query topic may not only be determined by pair-wise semantic comparison between predicate-argument structures of the query topic and each given sentence It may as well include looking into multi-sentence dependencies, whereby the 788 World Wide Web (2013) 16:763–791 contextual relevance of a given sentence to the query topic is enriched by implicit relevance from other sentences in the document Conclusions and future work We investigated the possibility of using predicate-argument structures for contextual opinion retrieval from blogs We proposed using semantically related predicate-argument structures of natural language queries and well formed sentences to identify relevant and subjective sentences Consequentially, we retrieve relevant and opinionated blog documents The predicate-argument structures were derived from the output parse trees of a log-linear CCG parser As part of our study, we developed a transformed term similarity model that is based on semantic similarity between a pair of predicate-argument structures for a query topic and each given sentence We computed the subjectivity score of each sentence by using the popular MPQA subjectivity annotation scheme Thus, we performed linear combination of the transformed term similarity score, a long-range dependency relevance score, and the subjectivity score to form a more effective linear relevance model for contextual opinion retrieval from blogs We have also evaluated the topic-relevance between our model and two other topicrelevance models We performed opinion retrieval experiments using TREC Blog 2006 and TREC Blog 2008 query topics onTREC Blog 08 dataset We compared the linear relevance model against the standard TREC Blog 08 best run and Blog 08’s KLEDocOpinTD Our model showed improvements over TREC Blog 08 and 06 best runs and Blog08’s KLEDocOpinTD which not consider semantic similarity between grammatical derivations of query topic and sentences This improvement shows the significance of our novel approach and indeed the viability of using predicate-argument structure for contextual opinion retrieval from blogs There are some significant outcomes of our study We observed that relevant documents not only contain query words but sentences that lexically combine synonyms and hyponyms of query words that are largely used to refer to an opinion target This implies that frequency of query words alone may not be helpful to contextual opinion retrieval Our study also support previous studies that shows less parameterized topic-relevance models have better performances compared to highly parameterized models such as probabilistic and language models According to the model fitness evaluation performed, the linear relevance model being a parameter-free model, showed substantial robustness compared to BM25 and LM For future work, we plan to integrate a background model or multi-sentence contextual dependencies into our model, whereby the underlying meaning derived from a particular sentence may depend on prior or preceding sentences in the same document This could be intuitive for faceted opinion retrieval We are also interested in developing an algorithm that can clearly identify and differentiate between explicit and implicit relevant sentences within an opinionated document Such algorithm would enable us study the contextual relationship between sentences in the same document, and can be very useful to opinion retrieval and document summarization References Agarwal, N., Liu, H.: Blogosphere: research issues, tools, and applications SIGKDD Explor Newsl 10 (1), 18–31 (2008) Akaike, H.: Likelihood of a model and information criteria Econometrics 16, 3–14 (1981) Akaike, H.: Factor analysis and AIC Psychometrika 52(3), 317–332 (1987) World Wide Web (2013) 16:763–791 789 Amati, G., Rijsbergen, C.J.V.: Probabilistic models of information retrieval based on measuring the divergence from randomness ACM Trans Inf Syst 20(4), 357–389 (2002) Amati, G., Amodeo, G., Bianchi, M., Gaibisso, C., Gambosi, G.: A Uniform Theoretic Approach to Opinion and Information Retrieval In: Armano, G., de Gemmis, M., Semeraro, G., Vargiu, E (eds.) Intelligent Information Access, vol 301 Studies in Computational Intelligence, pp 83-108 Springer Berlin/Heidelberg, (2010) Bermingham, A., Smeaton, A.F.: A study of inter-annotator agreement for opinion retrieval In: Proc of the 32nd international ACM SIGIR conference on Research and development in information retrieval, Boston, MA, USA (2009) Boiy, E., Moens, M.-F.: A machine learning approach to sentiment analysis in multilingual Web texts Inf Retriev 12(5), 526–558 (2009) Bozdogan, H.: Model selection and Akaike’s Information Criterion (AIC): the general theory and its analytical extentions Psychometrika 52(3), 345–370 (1987) Burnham, K.P., Anderson, D.R: Model Selection and Multimodel Inference Springer-Verlag New York, Inc (2002) 10 Charniak, E.: A maximum-entropy-inspired parser In: Proc of the 1st North American chapter of the Association for Computational Linguistics conference, Seattle, Washington (2000) 11 Charniak, E.: Top-down nearly-context-sensitive parsing In: Proc of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, Massachusetts (2010) 12 Clark, S., Curran, J.R.: Wide-coverage efficient statistical parsing with ccg and log-linear models Comput Linguist 33(4), 493–552 (2007) 13 Curran, J.R., Clark, S., Bos, J.: Linguistically motivated large-scale NLP with C & C and boxer In: Proc of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic (2007) 14 Ding, X., Liu, B.: The utility of linguistic rules in opinion mining In: Proc of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands (2007) 15 Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining In: Proc of the international conference on Web search and web data mining, Palo Alto, California, USA (2008) 16 Du, W., Tan, S.: An iterative reinforcement approach for fine-grained opinion mining In: Proc of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado (2009) 17 Duan, H., Hsu, B.-J.: Online spelling correction for query completion In: Proc of the 20th international conference on World Wide Web, Hyderabad, India (2011) 18 Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information In: Proc of the SIGCHI conference on Human factors in computing systems, Washington, D.C., USA, (1988) 19 Esuli, A.: Automatic generation of lexical resources for opinion mining: models, algorithms and applications SIGIR Forum 42(2), 105–106 (2008) 20 Fernández, R.T., Losada, D.E, Azzopardi, L.A: Extending the language modeling framework for sentence retrieval to include local context Information Retrieval, 1-35 (2010) 21 Gerani, S., Carman, M.J., Crestani, F.: Proximity-Based Opinion Retrieval SIGIR ACM,Geneva, Switzerland, 978 (2010) 22 Gildea, D., Hockenmaier, J.: Identifying semantic roles using Combinatory Categorial Grammar In: Proc of the 2003 conference on Empirical methods in natural language processing, Sapporo, Japan (2003) 23 He, B., Macdonald, C., Ounis, I.: Ranking opinionated blog posts using OpinionFinder In: Proc of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Singapore (2008) 24 Hiemstra, D.: Using language models for information retrieval Centre for Telematics and Information Technology, The Netherlands (2000) 25 Hofmann, T.: Probabilistic latent semantic indexing In: Proc of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA (1999) 26 Huang, X., Croft, W.B.: A unified relevance model for opinion retrieval In: Proc of the 18th ACM conference on Information and knowledge management, Hong Kong, China (2009) 27 Huang, J., Efthimiadis, E.N.: Analyzing and evaluating query reformulation strategies in web search logs In: Proc of the 18th ACM conference on Information and knowledge management, Hong Kong, China (2009) 28 Javanmardi, S., Gao, J., Wang, K.: Optimizing two stage bigram language models for IR In: Proc of the 19th international conference on World Wide Web, Raleigh, North Carolina, USA (2010) 29 Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions In: Proc of the 15th international conference on World Wide Web, Edinburgh, Scotland (2006) 790 World Wide Web (2013) 16:763–791 30 Kanayama, H., Nasukawa, T.: Fully automatic lexicon expansion for domain-oriented sentiment analysis In: Proc of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia (2006) 31 Lavrenko, V., Croft, W.B.: Relevance based language models In: Proc of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, USA (2001) 32 Lee Y, Jung H-y, Song W, Lee J-H Mining the blogosphere for top news stories identification In: Proc of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland; 2010 33 Lee, S.-W., Lee, J.-T., Song, Y.-I., Rim, H.-C.: High precision opinion retrieval using sentiment-relevance flows In: Proc of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland (2010) 34 Leung, C., Chan, S., Chung, F-l, Ngai, G.: A probabilistic rating inference framework for mining user preferences from reviews World Wide Web 14(2), 187–215 (2011) 35 Liu, B.: Sentiment analysis and subjectivity Handbook of Natural Language Processing, Second Edition (2010) 36 Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback In: Proc of the 18th ACM conference on Information and knowledge management, Hong Kong, China (2009) 37 Macdonald, C., Santos, R.L.T., Ounis, I., Soboroff, I.: Blog track research at TREC SIGIR Forum 44(1), 58–75 (2010) 38 Mukherjee, S., Ramakrishnan, I.V.: Automated semantic analysis of schematic data World Wide Web 11 (4), 427–464 (2008) 39 Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G., Kurimo, M., Mandl, T., Peñas, A., Petras, V (eds.) Evaluating Systems for Multilingual and Multimodal Information Access, vol 5706 Lecture Notes in Computer Science, pp 219-226 Springer Berlin/Heidelberg (2009) 40 Munson, S.A., Resnick, P.: Presenting diverse political opinions: how and how much In: Proc of the 28th international conference on Human factors in computing systems, Atlanta, Georgia, USA (2010) 41 Nam, S.-H., Na, S.-H., Lee, Y., Lee, J.-H.: DiffPost: Filtering Non-relevant Content Based on Content Difference between Two Consecutive Blog Posts In: Boughanem, M., Berrut, C., Mothe, J., SouleDupuy, C (eds.) Advances in Information Retrieval, vol 5478 Lecture Notes in Computer Science, pp 791-795 Springer Berlin/Heidelberg (2009) 42 Natalie, S.G., Matthew, H., Takashi, T.: BlogPulse: Automated Trend Discovery for Weblogs In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation (2004) 43 Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform In: Proc of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006), Seattle, Washington, USA (2006) 44 Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques In: Proc of the ACL-02 conference on Empirical methods in natural language processing, Philadelphia, USA (2002) 45 Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis Found Trends Inf Retr 2(1–2), 1–135 (2008) 46 Rijsbergen, C.J.V.: A Theoretical Basis for the use of Co-Occurrence Data in Information Retrieval J Doc 33(2), 106–119 (1977) 47 Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond Found Trends in Inf Retriev 3(4), 333–389 (2009) 48 Santos, R.L.T, He, B., Macdonald, C., Ounis, I.: Integrating Proximity to Subjective Sentences for Blog Opinion Retrieval ECIR Advances in Information Retrieval 5478/2009, 325-336 (2009) 49 Sarmento, S., Carvalho, P., Silva, M.-J., Eugénio de Oliveira: Automatic creation of a reference corpus for political opinion mining in user-generated content In: Proc of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion, Hong Kong, China (2009) 50 Siersdorfer, S.,Chelaru, S., Pedro, J.-S: How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings In: Proc of the 19th International World Wide Web Conference, Raleigh, North Carolina, USA, 891-900 (2010) 51 Steedman, M.: The Syntactic Process (Language, Speech, and Communication) The MIT Press (2000) 52 Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P.: Using predicate-argument structures for information extraction In: Proc of the 41st Annual Meeting on Association for Computational Linguistics, Sapporo, Japan (2003) 53 Tata, S., Patel, J.M.: Estimating the selectivity oftf-idfbased cosine similarity predicates SIGMOD Rec 36(4), 75–80 (2007) World Wide Web (2013) 16:763–791 791 54 Thet, T.T., Na, J.-C., Khoo, C.S.G., Shakthikumar, S.: Sentiment analysis of movie reviews on discussion boards using a linguistic approach In: Proc of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion, Hong Kong, China (2009) 55 Tumasjan, A., Sprenger, T.-O., Sandner, P.-J., Welpe, I.-M.: Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment In: Proc of the Fourth International AAAI Conference on Weblogs and Social Media (2010) 56 Wei, Z., Clement, Y.: UIC at TREC 2006 Blog Track In: TREC (ed.) (2006) 57 Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language Lang Res and Eval 39(2/3), 165–210 (2005) 58 Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., Riloff, E., Patwardhan, S.: OpinionFinder: a system for subjectivity analysis In: Proc of HLT/EMNLP on Interactive Demonstrations, Vancouver, British Columbia, Canada (2005) 59 Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis Comput Linguist 35(3), 399–433 (2009) 60 Xu, X., Liu, Y., Xu, H., Yu, X., Song, L., Guan, F., Peng, Z., Cheng, X.: ICTNET at Blog Track TREC 2009 TREC 2009 (2009) 61 Zafarani, R., Cole, W., Liu, H.: Sentiment propagation in social networks: a case study in livejournal In: Chai, S.-K., Salerno, J., Mabry, P (eds.) Advances in Social Computing, vol 6007 Lecture Notes in Computer Science, pp 413–420 Springer, Berlin (2010) 62 Zhai, C.: Statistical language models for information retrieval a critical review Foundations and Trends in Inf Retriev 2(3), 137–213 (2008) 63 Zhang, W., Yu, C., Meng, W.: Opinion retrieval from blogs In: Proc of the sixteenth ACM conference on Conference on information and knowledge management, Lisbon, Portugal (2007) 64 Zhang, R., Tran, T., Mao, Y.: Opinion helpfulness prediction in the presence of “words of few mouths” World Wide Web, 1-22 (2011) ... (1999) 39 Wei, F ., Quian, W ., Wang, C ., Zhou, A.: Detecting overlapped communities in networks World Wide Web J 1 2, 2 35 261 (2009) 40 Xiaodong, D ., Cunrui, W ., Xiangdong, L ., Yanping, L.: Web community... (KDD’03 ), pp 54 1 54 6 (2003) 18 Jeong, H ., Tombor, B ., Albert, R ., Oltvai, Z ., Barabási, A.-L.: The large-scale organization of metabolic networks Nature 47 0, 651 – 655 (2000) 19 Lancichinetti, A ., Fortunato,... Finally, Section concludes the paper Related work Many different algorithms have been proposed to detect communities in complex networks [ 1, 3, 4, 7, 1 1, 1 3, 1 7, 2 2, 2 3, 2 6, 2 7, 2 9, 31–3 3, 3 5, 37,

Ngày đăng: 23/06/2019, 13:22