Managing and Mining Graph Data part 30 ppt

10 235 0
Managing and Mining Graph Data part 30 ppt

Đang tải... (xem toàn văn)

Thông tin tài liệu

Chapter 9 A SURVEY OF CLUSTERING ALGORITHMS FOR GRAPH DATA Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Haixun Wang Microsoft Research Asia Beijing, China 100190 haixunw@microsoft.com Abstract In this chapter, we will provide a survey of clustering algorithms for graph data. We will discuss the different categories of clustering algorithms and recent ef- forts to design clustering methods for various kinds of graphical data. Clustering algorithms are typically of two types. The first type consists of node clustering algorithms in which we attempt to determine dense regions of the graph based on edge behavior. The second type consists of structural clustering algorithms, in which we attempt to cluster the different graphs based on overall structural behavior. We will also discuss the applicability of the approach to other kinds of data such as semi-structured data, and the utility of graph mining algorithms to such representations. Keywords: Graph Clustering, Dense Subgraph Discovery 1. Introduction Graph mining has been a popular area of research in recent years because of numerous applications in computational biology, software bug localization and computer networking. In addition, many new kinds of data such as semi- © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_9, 275 276 MANAGING AND MINING GRAPH DATA structured data and XML [2] can typically be represented as graphs. In partic- ular, XML data is a popular representation of different kinds of data sets. Since core graph-mining algorithms can be extended to this scenario, it follows that the extension of mining algorithms to graphs has tremendous applicability of a wide variety of data sets which are represented as semi-structured data. Many traditional algorithms such as clustering, classification, and frequent-pattern mining have been extended to the graph scenario. A detailed discussion of various kinds of graph mining algorithms may be found in [15]. In this chapter, we will study the clustering problem for the graph domain. The problem of clustering is defined as follows: For a given set of objects, we would like to divide it into groups of similar objects. The similarity between objects is typically defined with the use of a mathematical objective function. This problem is useful in a number of practical applications such as marketing, customer-segmentation, and data summarization. The problem of clustering is extremely important in a number of important data domains. A detailed description of clustering algorithms may be found in [24]. Clustering algorithms have significant applications in a variety of graph sce- narios such as congestion detection, facility location, and XML data integration [28]. The graph clustering problems are typically defined into two categories: Node Clustering Algorithms: Node-clustering algorithms are gener- alizations of multi-dimensional clustering algorithms in which we use functions of the multi-dimensional data points in order to define the dis- tances. In the case of graph clustering algorithms, we associate numer- ical values with the edges. These numerical values need not satisfy tra- ditional properties of distance functions such as the triangle inequality. We use these distance values in order to create clusters of nodes. We note that the numerical value associated with a given node may either be a distance value or a similarity value. Correspondingly, the objec- tive function associated with the partitioning may either be minimized or maximized respectively. We note that the problem of minimizing the inter-cluster similarity for a fixed number of clusters essentially re- duces to the problem of graph partitioning or the minimum multi-way cut problem. This is also referred to the problem of mining dense graphs and pseudo-cliques. Recently, the problem has also been studied in the database literature as that of quasi-clique determination. In this prob- lem, we determine groups of nodes which are “almost cliques”. In other words, an edge exists between any pair of nodes in the set with high probability. A closely related problem is that of determining shingles [5, 22]. Shingles are defined as those sub-graphs which have a large number of common links. This is particularly useful for massive graphs which contain a large number of nodes. In such cases, a min-hash ap- A Survey of Clustering Algorithms for Graph Data 277 proach [5] can be used in order to summarize the structural behavior of the underlying graph. Graph Clustering Algorithms: In this case, we have a (possibly large) number of graphs which need to be clustered based on their underlying structural behavior. This problem is challenging because of the need to match the structures of the underlying graphs, and use these structures for clustering purposes. Such algorithms are discussed both in the con- text of classical graph data sets as well as semi-structured data. In the case of semi-structured data, the problem arises in the context of a large number of documents which need to be clustered on the basis of the un- derlying structure and attributes. It has been shown in [2] that the use of the underlying document structure leads to significantly more effective algorithms. This chapter is organized as follows. In the next section, we will discuss a variety of node clustering algorithms. Methods for clustering multiple graphs and XML records are discussed in section 3. Section 4 discusses numerous applications of graph clustering algorithms. Section 5 contains the conclusions and summary. 2. Node Clustering Algorithms A number of algorithms for graph node clustering are discussed in [19]. In [19], the graph clustering problem is related to the minimum cut and graph partitioning problems. In this case, it is assumed that the underlying graphs have weights on the edges. It is desired to partition the graph in such a way so as to minimize the weights of the edges across the partitions. In general, we would like to partition the graph into 𝑘 groups of nodes. However, since the special case 𝑘 = 2 is efficiently solvable, we would like to first provide a special discussion for this case. This version is polynomially solvable, since it is the mathematical dual of the maximum flow problem. This problem is also referred to as the minimum-cut problem. 2.1 The Minimum Cut Problem The simplest case is the 2-way minimum cut problem, in which we wish to partition the graph into two clusters, so as to minimize the weight of the edges across the partitions. This version of the problem is efficiently solvable, and can be resolved by use of the maximum flow problem [4]. The minimum-cut problem is defined as follows. Consider a graph 𝐺 = (𝑁, 𝐴) with node set 𝑁 and edge set 𝐴. The node set 𝑁 contains the source 𝑠 and sink 𝑡. Each edge (𝑖, 𝑗) ∈ 𝐴 has a weight associated with it which is denoted by 𝑢 𝑖𝑗 . We note that the edges may be either undirected or directed, 278 MANAGING AND MINING GRAPH DATA though the undirected case is often much more relevant for connectivity ap- plications. We would like to partition the node set 𝑁 into two groups 𝑆 and 𝑁 −𝑆. The set of edges such that one end lies in 𝑆 and the other lies in 𝑁 −𝑆 is denoted by 𝐶(𝑆, 𝑁 − 𝑆). We would like to partition the node set 𝑁 into two sets 𝑆 and 𝑁 − 𝑆, such that the sum of the weights in 𝐶(𝑆, 𝑁 − 𝑆) is minimized. In other words, we would like to minimize  (𝑖,𝑗)∈𝐶(𝑆,𝑁 −𝑆) 𝑢 𝑖𝑗 . This is the unrestricted version of the minimum-cut problem. We will examine two variations of the minimum-cut problem: We wish to determine the global minimum 𝑠-𝑡 cut with no restrictions on the membership of nodes to different partitions. We wish to determine the minimum 𝑠-𝑡 cut, in which one partition con- tains the source node 𝑠 and the other partition contains the sink node 𝑡. It is easy to see that the former problem can be solved by using repeated ap- plications of the latter algorithm. By fixing 𝑠 and choosing different values of the sink 𝑡, it can be shown that the global minimum-cut may be effectively determined. It turns out that the maximum flow problem is the mathematical dual of the minimum cut problem. In the maximum-flow problem, we assume that the weight 𝑢 𝑖𝑗 is a capacity of the edge (𝑖, 𝑗). Each edge is allowed to have a flow 𝑥 𝑖𝑗 which is at most equal to the capacity 𝑢 𝑖𝑗 . Each node other than the source 𝑠 and sink 𝑡 is assumed to satisfy the flow conservation property. In other words, for each node 𝑖 ∈ 𝑁 we have:  𝑗:(𝑖,𝑗)∈𝐴 𝑥 𝑖𝑗 =  𝑗:(𝑗,𝑖)∈𝐴 𝑥 𝑗𝑖 (9.1) We would like to maximize the total flow originating from the source and reaching the sink 𝑡, subject to the above constraints. The maximum flow prob- lem is solved with the use of a variety of augmenting-path and preflow push algorithms [4]. In augmenting-path methods, we pick a path from 𝑠 to 𝑡 which has current unused capacity, and increase the flow on this path, such that at least one edge on this path is filled to capacity. We repeat this process, until no path with unfilled capacity exists from source 𝑠 to sink 𝑡. Many different variations of this technique exist in terms of the choice of path used in order to augment the flow from source 𝑠 to the sink 𝑡. Example, include the shortest-paths or maximum-capacity augmenting paths. Different choices of augmenting-paths will typically lead to different trade-offs in running time. These trade-offs are discussed in [4]. In general, the two-way cut problem can be solved quite effi- ciently in polynomial time with these different methods. It can be shown that the minimum-cut may be determined by determining all nodes 𝑆 which are A Survey of Clustering Algorithms for Graph Data 279 reachable from 𝑠 by some path of unfilled capacity. We note that 𝑆 will not contain the sink node 𝑡 at maximum flow, since the sink is not reachable from the source with the use of a path of unfilled capacity. The set 𝐶(𝑆, 𝑁 − 𝑆) is the minimum 𝑠-𝑡 cut. Every edge in this set is saturated, and the total flow across the cut is essentially equal to the 𝑠-𝑡 maximum flow. We can then deter- mine the global minimum cut by fixing the source 𝑠, and varying the sink node 𝑡. The minimum cut over all these different possibilities will provide us with the global minimum-cut value. A particularly important variant of this method is the shortest augmenting-path approach. In this approach we always augment the maximum amount of flow from the source to sink along the corresponding shortest path. It can be shown that for a network containing 𝑛 nodes, and 𝑚 edges, the shortest path is guaranteed to increase by at least one after 𝑂(𝑚) augmentations. Since the shortest path cannot be larger than 𝑛, it follows that the maximum number of augmentations is 𝑂(𝑛⋅𝑚). It is possible to implement each augmentation in 𝑂(log(𝑛)) time with the use of dynamic data structures. This implies that the overall technique requires at most 𝑂(𝑛 ⋅𝑚 ⋅log(𝑛)) time. A second class of algorithms which are often used in order to solve the maximum flow problem are preflow push algorithms, which do not maintain the flow conservation constraints in their intermediate solutions. Rather, an excess flow is maintained at each node, and we try to push as much of this flow as possible along any edge on the shortest path from the source to sink. A detailed discussion of preflow push methods is beyond the scope of this chapter, and may be found in [4]. Most maximum flow methods require at least Ω(𝑛 ⋅ 𝑚) time, where 𝑛 is the number of nodes, and 𝑚 is the number of edges. A closely related problem to the minimum 𝑠-𝑡 cur problem is that of deter- mining a global minimum cut in an undirected graph. This particular case is more efficient than that of finding the 𝑠-𝑡 minimum cut. One way of determin- ing a minimum cut is by using a contraction-based edge-sampling approach. While the previous technique is applicable to both the directed and undirected version of the problem, the contraction-based approach is applicable only to the undirected version of the problem. Furthermore, the contraction-based ap- proach is applicable only for the case in which the weight of each edge is 𝑢 𝑖𝑗 = 1. While the method can easily be extended to the weighted version by varying the edge-sampling probability, the polynomial running time bounds discussed in [37] do not apply to this case. The contraction approach is a prob- abilistic technique in which we successively sample edges in order to collapse nodes into larger sets of nodes. By successively sampling different sequences of edges and picking the optimum value [37], it is possible to determine a global minimum cut. The broad idea of the contraction-based approach is as follows. We pick an edge randomly in the graph, and contract its two end points into a single node. We remove all self-loops which are created as a result of 280 MANAGING AND MINING GRAPH DATA the contraction. We may also create some parallel edges, which are allowed to remain, since they influence the sampling probability 1 of contractions. The process of contraction is repeated until we are left with two nodes. We note that each of this pair of “super-nodes” corresponds to a set of nodes in the original data. These two sets of nodes provide us with the final minimum cut. We note that the minimum cut will survive in this approach, if none of the edges in the minimum cut are sampled during the contraction. An immediate observation is that cuts with smaller number of edges are more likely to survive using this approach. This is because the edges in cuts which contain a large number of edges are much more likely to be sampled. One of the key observations in [37] is the following: Lemma 9.1. When a graph containing 𝑛 nodes is contracted to 𝑡 nodes, the probability that the minimum-cut survives during the contraction is given by 𝑂(𝑡 2 /𝑛 2 ). Proof: Let the minimum-cut have 𝑘 edges. Then, each vertex must have de- gree at least 𝑘, and therefore the graph must contain at least 𝑛⋅𝑘/2 edges. Then, the probability that the minimum cut survives the first contraction is given by 1 − 𝑘/(#Edges) ≥ 1 − 2/𝑛. This relationship is derived by substituting the lower bound of 𝑛 ⋅𝑘/2 for the number of edges. Similarly, in the second round of contractions, the probability of survival is given by 1−2/(𝑛−1). Therefore, the overall probability 𝑝 𝑠 of survival is given by: 𝑝 𝑠 = Π 𝑛−𝑡−1 𝑖=0 (1 −2/(𝑛 − 𝑖)) = 𝑡 ⋅(𝑡 −1) 𝑛 ⋅(𝑛 −1) (9.2) This provides the result. □ Thus, if we contract to two nodes, the probability of the survival of the mini- mum cut is 2/(𝑛 ⋅ (𝑛 −1)). By repeating the process 𝑛 ⋅ (𝑛 − 1)/2 times, we can show that the probability that the minimum-cut survives is given by at least 1−1/𝑒. If we further scale up by a constant factor 𝐶 > 1, we can show that the probability of survival is given by 1 − (1/𝑒) 𝐶 . By picking 𝐶 = log(1/𝛿), we can assure that the cut survives with probability at least 1 − 𝛿, where 𝛿 << 1. The logarithmic relationship assures that we can determine minimum cuts with very high probability at a small additional cost. An additional implication of Lemma 9.1 is that the total number of distinct minimum cuts is bounded above by 𝑛 ⋅(𝑛 −1)/2. This is because the probability of the survival of any particu- lar minimum cut is at least 2/(𝑛 ⋅(𝑛 − 1)), and the probability of the survival of any minimum cut cannot be greater than 1. 1 Alternatively, we may replace parallel edges by a single edge of weight which is equal to the number of parallel edges. We use this weight in order to bias the sampling process. A Survey of Clustering Algorithms for Graph Data 281 Another observation is that the probability of survival of the minimum cut in the first iteration is the largest, and it reduces in successive iterations. For example, in the first iteration, the probability of survival is 1 − (2/𝑛), but the probability of survival in the last iteration is only 1/3. Thus, most of the errors are caused in the last few iterations. This is particularly reflected in the cumulative error across many iterations, since the probability of maintaining the correct cut on contracting down to 𝑡 nodes is 𝑡 2 /𝑛 2 , whereas the probability of maintaining the correct cut in the remaining contractions is 1/𝑡 2 . Therefore, a natural solution is to use a two-phase approach. In the first phase, we do not contract down to 2 nodes, but we contract down to 𝑡 nodes. The probability of maintaining the correct cut by the use of this approach is at least Ω(𝑡 2 /𝑛 2 ). Therefore, 𝑂(𝑛 2 /𝑡 2 ) contractions are required in order to reduce the graph to 𝑡 nodes. Since each contraction requires 𝑂(𝑛) time, the running time of the first phase is given by 𝑂(𝑛 3 /𝑡 2 ). In the second phase, we use a standard maximum flow based method in order to determine the mini- mum cut. This maximum flow problem needs to be repeated 𝑡 times for a fixed source and different sinks. However, the base graph on which this is performed is much smaller, and contains only 𝑂(𝑡) nodes. Each maximum flow problem requires 𝑂(𝑡 3 ) time by using the method discussed in [8], and therefore the to- tal time for all 𝑡 problems is given by 𝑂(𝑡 4 ). Therefore, the total running time is given by 𝑂(𝑛 3 /𝑡 2 + 𝑡 4 ). By picking 𝑡 = √ 𝑛, we can obtain a running time of 𝑂(𝑛 2 ). Thus, by using a two-phase approach, it is possible to obtain a much better running time, than by using a single-phase contraction approach. The key idea behind this improvement is that since most of the error probability is concentrated in the last contractions, it is better to stop the contraction process when the the underlying graph is “small enough”, and then use conventional algorithms in order to determine the minimum cut. This combination approach is theoretically more efficient than any other known algorithm. 2.2 Multi-way Graph Partitioning The multi-way graph partitioning problem is significantly more difficult, and is NP-hard [21]. In this case, we wish to partition a graph into 𝑘 > 2 components, so that the total weight of the edges whose ends lie in different partitions is minimized. A well known technique for graph partitioning is the Kerninghan-Lin algorithm [26]. This classical algorithm is based on a hill- climbing (or more generally neighborhood-search technique) for determining the optimal graph partitioning. Initially, we start off with a random cut of the graph. In each iteration, we exchange a pair of vertices in two partitions, to see if the overall cut value is reduced. In the event that the cut value is reduced, then the interchange is performed. Otherwise, we pick another pair of vertices in order to perform the interchange. This process is repeated until we converge 282 MANAGING AND MINING GRAPH DATA to a optimal solution. We note that this optimum may not be a global optimum, but may only be a local optimum of the underlying data. The main variation in different versions of the Kerninghan-Lin algorithm is the policy which is used for performing the interchanges on the vertices. Some examples of strategies which may be used in order to perform the interchange are as follows: We randomly pick a pair of vertices and perform the interchange, if it improves the underlying solution quality. We test all possible vertex-pair interchanges (or a sample of possible interchanges), and pick the interchange which improves the solution by the greatest amount. A 𝑘-interchange is one in which a sequence of 𝑘 interchanges are per- formed at one time. We can test any 𝑘-interchange and perform it, if it improves the underlying solution quality. We can pick the optimal 𝑘-interchange from a sample of possibilities. We note that the use of more sophisticated strategies allows a better improve- ment in the objective function for each interchange, but also requires more time for each interchange. For example, the determination of an optimal 𝑘- interchange requires much more time than a straightforward interchange. This is a natural tradeoff which may work out differently depending upon the nature of the application at hand. Furthermore, the choice of the policy also affects the likelihood of getting stuck at a local optimum. For example, the use of 𝑘- interchange techniques are far less likely to result in local optimum for larger values of 𝑘. In fact, by choosing the best interchange across all possible values of 𝑘, it is possible to ensure that a global optimum is always reached. On the other hand, it because increasingly difficult to implement the algorithm effi- ciently with increasing value of 𝑘. This is because the time-complexity of the interchange increases exponentially with the value of 𝑘. A detailed survey on different methods for optimal graph partitioning may be found in [18]. 2.3 Conventional Generalizations and Network Structure Indices Two well known (and related) techniques for clustering in the context of multi-dimensional data [24] are the 𝑘-medoid and 𝑘-means algorithms. In the 𝑘-medoid algorithm (for multi-dimensional data), we sample a small number of points from the original data as seeds and assign every other data point from the clusters to the closest of these seeds. The closeness may be defined based on a user-defined objective function. The objective function for the cluster- ing is defined as the sum of the corresponding distances of data points to the corresponding seeds. In the next iteration, the algorithm interchanges one of A Survey of Clustering Algorithms for Graph Data 283 the seeds for another randomly selected seed from the data, and checks if the quality of the objective function improves upon performing the interchange. If this is indeed the case, then the interchange is accepted. Otherwise, we do not accept the interchange and try another sample interchange. This process is repeated, until the objective function does not improve over a pre-defined number of interchanges. A closely related method is the 𝑘-means method. The main difference with the 𝑘-medoid method is that we do not use representa- tive points from the original data after the first iteration of picking the original seeds. In subsequent iterations, we use the centroid of each cluster as the seed set for the next iteration. This process is repeated until the cluster membership stabilizes. A method has been proposed in [35], which uses characteristics of both 2 the 𝑘-means and 𝑘-medoids algorithms. As in the case of the conventional partitioning algorithms, it picks 𝑘 graph nodes as seeds. The main differences from the conventional algorithms are in terms of computation of distances (for assignment purposes), and in determination of subsequent seeds. A natural distance function for graphs is the geodesic distance, or the smallest number of hops between a pair of nodes. In order to determine the seed set for the next iteration, we compute the local closeness centrality [20] for each cluster, and use the corresponding node as the sample seed. Thus, while this algorithm con- tinues to use seeds from the original data set (as in the 𝑘-medoids algorithm), it uses intuitive ideas from the 𝑘-means algorithms in order to determine the identity of these seeds. There are some subtle challenges in the use of the graphical versions of dis- tance based clustering algorithms. One challenge is that since distances are integers, it is possible for data points to be equidistant to several seeds. While ties can be resolved by randomly selecting one of the best assignments, this may result in clusterings which do not converge. In order to handle this insta- bility, a more relaxed threshold is imposed on the number of medoids which may change from iteration to iteration. Specifically, a clustering is considered stable, when the change between iterations is below a certain threshold (say 1 to 3%). Another challenge is that the computation of geodesic distances can be very challenging. The computational complexity of the all-pairs shortest paths al- gorithm can be 𝑂(𝑛 3 ), where 𝑛 is the number of nodes. Even pre-storage of all-pairs shortest paths can require 𝑂(𝑛 2 ) time. This is computationally not feasible in most practical scenarios, especially when the underlying graphs are large. Even the space-requirement can be infeasible for very large graphs may 2 In [35], the method has been proposed as a generalization of the 𝑘-medoid algorithm. However, it actu- ally uses characteristics of both the 𝑘-means and 𝑘-medoid algorithms, since it uses centrality notions in determination of subsequent seeds. 284 MANAGING AND MINING GRAPH DATA not be practical. In order to handle such cases, the method in [36] uses the concept of network-structure indices, which can summarize the behavior of the network by using randomized division into zones. In this case, the graph is divided into multiple zones. The set of zones form a connected, mutually exclusive and exhaustive partitioning of the graph. The partitioning of the graph into zones is accomplished with the use of a competitive flooding algorithm. In this algorithm, we start off with randomly selected seeds which are labeled by zone identification, and randomly select some unlabeled neighbor of a currently labeled node, and add a label which is matching with its current value. This approach is repeated until all nodes have been labeled. We note that while this approach is extremely fast, it may sometimes result in zones which do not reflect locality well. In order to deal with this situation, we use multiple sets of randomly selected partitions. Each of these partitions is considered a dimension. Note that when we use multiple such random partitions, each node becomes distinguishable from other nodes by virtue of its membership. The distance between a node 𝑖 and a zone containing node 𝑗 is denoted as 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)), and is defined as the shortest path between node 𝑖 and any node in zone 𝑗. The distance between 𝑖 and 𝑗 along a particular zone partitioning (or dimension) is approximated as 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)) + 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑗, 𝑧𝑜𝑛𝑒(𝑖)). This value is then averaged over all the sets of randomized partitions in order to provide better robustness. It has been shown in [36] that this approach seems to approximate pairwise distances quite well. The key observation is that the value of 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)) can be pre-computed in 𝑛⋅𝑞 space, where 𝑞 is the number of zones. For a small number of zones, this is quite efficient. Upon using 𝑟 different sets of partitions, the overall space requirement is 𝑛 ⋅ 𝑞 ⋅ 𝑟, which is much smaller than the Ω(𝑛 2 ) space-requirement of all-pairs computation, for typical values of 𝑞 and 𝑟 as suggested in [35]. 2.4 The Girvan-Newman Algorithm The Girvan-Newman algorithm [23] is a divisive clustering algorithm, which is based on the concept of edge betweenness centrality. Betweenness centrality attempts to identify edges which form critical bridges between dif- ferent connected components, and delete them, until a natural set of clusters remains. Formally, betweenness centrality is defined as the proportion of short- est paths between nodes which pass through a certain edge. Therefore, for a given edge 𝑒, we define the betweenness centrality 𝐵(𝑒) as follows: 𝐵(𝑒) = 𝑁𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗) 𝑁𝑢𝑚𝑆ℎ𝑜𝑟𝑡𝑃 𝑎𝑡ℎ𝑠(𝑖, 𝑗) (9.3) . Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_9, 275 276 MANAGING AND MINING GRAPH DATA structured data and XML [2] can typically be represented as graphs. In partic- ular, XML data. kinds of data such as semi-structured data, and the utility of graph mining algorithms to such representations. Keywords: Graph Clustering, Dense Subgraph Discovery 1. Introduction Graph mining. localization and computer networking. In addition, many new kinds of data such as semi- © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances

Ngày đăng: 03/07/2014, 22:21

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan