A Survey of Clustering Algorithms for Graph Data 295 is Φ = ∑ 𝑘 𝑖=1 Δ(ℳ 𝑖 , 𝒮 𝑖 )/𝑘. Similarly, let the average sub-structural self- similarity at the end of the the previous iteration be Φ ′ . In the beginning of the next iteration, the algorithm computes the increase of the average sub- structural self-similarity, Φ−Φ ′ , and checks if it is smaller than a user-specified threshold 𝜖. If not, the algorithm proceeds with another iteration. Otherwise, the algorithm terminates. In addition, an upper bound on the number of it- erations is imposed. This is done in order to effectively handle situations in which the threshold 𝜖 is chosen to be too small. Two further issues need to be implemented in order to effectively use the underlying algorithm: We need to determine effective methods for determining the similarity between a given document, and a group of other documents. Techniques for computing the similarity are discussed in [2]. We need to determine frequent structural patterns in the underlying doc- uments. This can be a huge challenge in many applications, especially since structural data is far more challenging to mine than transactional data. It has been shown in [2], how sequential pattern mining algorithms can be adapted to the case of structural data. The broad idea is to flat- ten out the tree structure into a sequential pattern by using a pre-order traversal. Then the clustering is performed on the resulting sequential patterns. It has been shown [2] that such an approach is able to retain most of the structural information in the data, while introducing some spurious relations. The overall approach has been shown in [2] to be experimentally quite effective. It has been shown in [2], that this method is far more effective than competing techniques such as those discussed in [10, 29]. 4. Applications of Graph Clustering Algorithms Graph clustering algorithms find numerous applications in the literature. As discussed in this chapter, graph mining algorithms fall into the categories of node clustering and more generally object-based clustering algorithms. Object-based clustering algorithms are similar to general clustering algorithms in the literature, except that we use the underlying graphs as records rather than standard multi-dimensional attributes. Such algorithms are useful in a number of data domains such as molecular biology, chemical graphs, and XML data. In general, any data domain which can represent the underlying records in terms of compact graphs can benefit from such algorithms. Node clustering algorithms can be used for a variety of real applications such as facility location. These algorithms can also be used for clustering with arbitrary distance functions between groups of objects. These algorithms 296 MANAGING AND MINING GRAPH DATA are more general than those used for clustering records with the use of multi- dimensional distance functions. Node clustering algorithms are closely related to the problem of graph par- titioning. These methods are particularly useful for applications which need to determine dense regions of the graphs. The determination of dense regions of the graph is closely related to the problem of graph summarization and dimen- sionality reduction. The process of dimensionality reduction on graphs can be used in order to represent them in a small space, so that they can be used effec- tively for indexing and retrieval. Furthermore, compressed graphs can be used in a variety of applications in which it is desirable to use the summary behav- ior in order to estimate the approximate structural properties of the network. These estimates can then be subsequently refined for more exact results at a later stage. Some specific applications for which clustering algorithms may be leveraged are as follows: 4.1 Community Detection in Web Applications and Social Networks Many web applications and social networks can be typically represented as massive graphs. For example, the structure of the web is itself a graph [22, 30, 34], in which nodes represent web pages, and hyperlinks represent the edges of this graph. Similarly social networks are graphs in which nodes represent the members of the social network, and the friendship relationship between members represent the corresponding links. Node clustering algorithms are a natural fit for community detection in massive graphs. The communities have natural interpretations in the context of a variety of web applications: For the case of web applications such as web sites, communities typi- cally refer to communities of closely linked pages. Such communities are typically linked because of common material in terms of topic, or similar interests in terms of readership. For the case of social networks, communities refer to groups of members who may know each other very well, and may therefore be closely linked with one another. This is useful in determining important associations in the underlying social network. Blogging communities often behave like social networks, and contain links between related blogs. The techniques discussed in this chapter are also useful for determining the closely related blogs with the use of community detection methods. Many of the node clustering applications discussed in this chapter are used in the context of social networks [22, 30, 34]. The min-hash approach [5, 22] A Survey of Clustering Algorithms for Graph Data 297 is commonly used when the underlying graph is massive in nature, such as that in the case of the web. This is because the min-hash approach is able to summarize the graph in a very small amount of space. This is very useful for practical applications in which it may be possible to represent the entire graph on disk. For example, the size of the web graph is so large, that it may not even be possible to store it on disk without the use of add-ons onto standard desktop hardware. Such situations lead to further constraints during the mining process, which are handled quite well by min-hash style approaches. This is because the min-hash summary is of extremely small size compared to the size of the graph itself. This compressed representation can even be maintained in main memory and used to determine the underlying communities in the network directly. It has been shown in [5, 22], that such an approach is able to determine communities of very high quality. 4.2 Telecommunication Networks Large telecommunication companies may have millions of customers who may make billions of phone calls to one another over a period of time. In this case, the individual phone numbers may be represented as node, and phone calls may be represented as edges. In such cases, it may be desirable to de- termine groups of customers who call each other frequently. This information can be very useful for target marketing purposes. Furthermore, we note that the graphs in a tele-communication network are represented in the form of edge streams, since the edges may be received continuously over time. These result in even greater challenges from the point of view of analysis, since the edges cannot be explicitly stored on disk. The methods discussed in [22] are particularly useful in such scenarios. 4.3 Email Analysis An interesting application in the context of the Enron crisis was to determine important email interactions between groups of Enron employees. In this case, the individuals are represented as nodes, and the emails sent between them are represented as edges. Node clustering algorithms are very useful in order to isolate dense email interactions between different groups of customers. This approach can be used for a variety of intelligence applications such as that of determining suspicious communities in groups of interactions. 5. Conclusions and Future Research In this chapter, we presented a review of the commonly known algorithms for clustering graph data. The problem of clustering graphs has been widely studied in the literature, because of its application to a variety of data mining and data management problems. Graph clustering algorithms are of two types: 298 MANAGING AND MINING GRAPH DATA Node Clustering Algorithms: In this case, we attempt to partition the graph into groups of clusters, so that each cluster contains groups of nodes which are densely connected. These densely connected groups of nodes may often provide significant information about how the entities in the underlying graph are inter-connected with one another. Graph Clustering Algorithms: In this case, we have complete graphs available, and we wish to determine the clusters with the use of the struc- tural information in the underlying graphs. Such cases are often encoun- tered in the case of XML data, which are commonly encountered in many real domains. We provided an overview of the different clustering algorithms available, and the tradeoffs with the use of different methods. The major challenges that remain in the area of graph clustering are as follows: Clustering Massive Data Sets: In some cases, the data sets containing the graphs may be so large that they may be held only on disk. For ex- ample, if we have a dense graph containing 10 7 nodes, then the number of edges may be as high as 10 13 . In such cases, it may not even be pos- sible to store the graph effectively on disk. In cases in which the graph can be stored on disk, it is critical that the algorithm should be designed in order to take the disk-resident behavior of the underlying data into account. This is especially challenging in the case of graph data sets, because the structural behavior of the graph interferes with our ability to process the edges sequentially for many applications. In cases in which the graph is too large to store on disk, it is essential to design summary structures which can effectively store the underlying structural behavior of the graph. This stored summary can then be used effectively for graph clustering algorithms. Clustering Graph Streams: In this case, we have large graphs which are received as edge streams. Such graphs are more challenging, since a given edge cannot be processed more than once during the computation process. In such cases, summary structures need to be designed in order to facilitate an effective clustering process. These summary structures may be utilized in order to determine effective clusters in the underlying data. This approach is similar to the case discussed above in which the size of the graph is too large to store on disk. In addition, techniques need to be designed for interfacing clustering algo- rithms with traditional database management techniques. In order to achieve this goal, effective representations and query languages need to be designed for graph data. This is a new and emerging area of research, and can be leveraged upon in order to further improve the effectiveness of graph algorithms. A Survey of Clustering Algorithms for Graph Data 299 References [1] J. Abello, M. G. Resende, S. Sudarsky, Massive quasi-clique detection. Proceedings of the 5th Latin American Symposium on Theoretical Infor- matics (LATIN), pp. 598-612, 2002. [2] C. Aggarwal, N. Ta, J. Feng, J. Wang, M. J. Zaki. XProj: A Framework for Projected Structural Clustering of XML Documents, KDD Conference, 2007. [3] R. Agrawal, A. Borgida, H.V. Jagadish. Efficient Maintenance of transitive relationships in large data and knowledge bases, ACM SIGMOD Confer- ence, 1989. [4] R. Ahuja, J. Orlin, T. Magnanti. Network Flows: Theory, Algorithms, and Applications, Prentice Hall, Englewood Cliffs, NJ, 1992. [5] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher, Syntac- tic clustering of the web, WWW Conference, Computer Networks, 29(8– 13):1157–1166, 1997. [6] D. Chakrabarti, Y. Zhan, C. Faloutsos R-MAT: A Recursive Model for Graph Mining. SDM Conference, 2004. [7] S.S. Chawathe. Comparing Hierachical data in external memory. Very Large Data Bases Conference, 1999. [8] J. Cheriyan, T. Hagerup, K. Melhorn An 𝑂(𝑛 3 )-time maximum-flow algo- rithm, SIAM Journal on Computing, Volume 25 , Issue 6, pp. 1144 – 1170, 1996. [9] F. Chung,. Spectral graph theory. Washington: Conference Board of the Mathematical Sciences, 1997. [10] T. Dalamagas, T. Cheng, K. Winkel, T. Sellis. Clustering XML Docu- ments Using Structural Summaries. Information Systems, Elsevier, Jan- uary 2005. [11] J. Cheng, J. Xu Yu, X. Lin, H. Wang, and P. S. Yu, Fast Computing Reach- ability Labelings for Large Graphs with High Compression Rate, EDBT Conference, 2008. [12] J. Cheng, J. Xu Yu, X. Lin, H. Wang, and P. S. Yu, Fast Computation of Reachability Labelings in Large Graphs, EDBT Conference, 2006. [13] E. Cohen. Size-estimation framework with applications to transitive clo- sure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997. [14] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick, Reachability and dis- tance queries via 2-hop labels, ACM Symposium on Discrete Algorithms, 2002. 300 MANAGING AND MINING GRAPH DATA [15] D. Cook, L. Holder, Mining Graph Data, John Wiley & Sons Inc, 2007. [16] E. W. Dijkstra, A note on two problems in connection with graphs. Nu- merische Mathematik, 1 (1959), S. 269-271. [17] M. Faloutsos, P. Faloutsos, C. Faloutsos, On Power Law Relationships of the Internet Topology. SIGCOMM Conference, 1999. [18] P O. Fjallstrom, Algorithms for Graph Partitioning: A Survey, Linkoping Electronic Articles in Computer and Information Science Vol 3, no 10, 1998. [19] G. Flake, R. Tarjan, M. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees, Internet Mathematics, 1(4), 385–408, 2003. [20] I. Freeman. Centrality in Social Networks, Social Networks, 1, 215–239, 1979. [21] M. S. Garey, D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-completeness,W. H. Freeman, 1979. [22] D. Gibson, R. Kumar, A. Tomkins, Discovering Large Dense Subgraphs in Massive Graphs, VLDB Conference, 2005. [23] M. Girvan, M. Newman. Community Structure in Social and Biological Networks, Proceedings of the National Academy of Science, 99, 7821– 7826, 2002. [24] A. Jain and R. Dubes, Algorithms for Clustering Data, Prentice Hall, New Jersey, 1998. [25] H. Kashima, K. Tsuda, A. Inokuchi. Marginalized Kernels between La- beled Graphs, ICML, 2003. [26] B.W. Kernighan, S. Lin. An efficient heuristic procedure for partitioning graphs, Bell System Tech. Journal, vol. 49, Feb. 1970, pp. 291-307. [27] T. Kudo, E. Maeda, Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS Conf. 2004. [28] M. Lee, W. Hsu, L. Yang, X. Yang. XClust: Clustering XML Schemas for Effective Integration. ACM Conference on Information and Knowledge Management, 2002 [29] W. Lian, D.W. Cheung, N. Mamoulis, S. Yiu. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure, IEEE Transac- tions on Knowledge and Data Engineering, Vol 16, No. 1, 2004. [30] R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal. The Web as a Graph. ACM PODS Conference, 2000. [31] M. Matsuda et al. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theoretical Computer Science, 210(2):305-325, 1999. A Survey of Clustering Algorithms for Graph Data 301 [32] J. Pei, D. Jiang, A. Zhang. On Mining Cross-Graph Quasi-Cliques, ACM KDD Conference, 2005. [33] J. Pei, D. Jiang, A. Zhang. Mining Cross-Graph Quasi-Cliques in Gene Expression and Protein Interaction Data, ICDE Conference, 2005. [34] S. Raghavan, H. Garcia-Molina. Representing web graphs. ICDE Con- ference, pages 405-416, 2003. [35] M. Rattigan, M. Maier, D. Jensen: Graph Clustering with Network Sruc- ture Indices. ICML, 2007. [36] M. Rattigan, M. Maier, D. Jensen: Using structure indices for approxi- mation of network properties. ACM KDD Conference, 2006. [37] A. A. Tsay, W. S. Lovejoy, David R. Karger, Random Sampling in Cut, Flow, and Network Design Problems, Mathematics of Operations Re- search, 24(2):383-413, 1999. [38] H. Wang, H. He, J. Yang, J. Xu-Yu, P. Yu. Dual Labeling: Answering Graph Reachability Queries in Constant Time. ICDE Conference, 2006. [39] X. Yan, J. Han. CloseGraph: Mining Closed Frequent Graph Patterns, ACM KDD Conference, 2003. [40] X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD Conference, 2008. [41] X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, SIGMOD Conference, 2004. [42] M. J. Zaki, C. C. Aggarwal. XRules: An Effective Structural Classifier for XML Data, KDD Conference, 2003. [43] Z. Zeng, J. Wang, L. Zhou, G. Karypis, Out-of-core Coherent Closed Quasi-Clique Mining from Large Dense Graph Databases, ACM Transac- tions on Database Systems, Vol 31(2), 2007. Chapter 10 A SURVEY OF ALGORITHMS FOR DENSE SUBGRAPH DISCOVERY Victor E. Lee Department of Computer Science Kent State University Kent, OH 44242 vlee@cs.kent.edu Ning Ruan Department of Computer Science Kent State University Kent, OH 44242 nruan@cs.kent.edu Ruoming Jin Department of Computer Science Kent State University Kent, OH 44242 jin@cs.kent.edu Charu Aggarwal IBM T.J. Watson Research Center Yorktown Heights, NY 10598 charu@us.ibm.com Abstract In this chapter, we present a survey of algorithms for dense subgraph discovery. The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences. For example, the problem of clustering is largely concerned with that of finding a fixed partition in the data, whereas the problem of dense subgraph discovery defines these dense compo- nents in a much more flexible way. The problem of dense subgraph discovery © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_10, 303 304 MANAGING AND MINING GRAPH DATA may wither be defined over single or multiple graphs. We explore both cases. In the latter case, the problem is also closely related to the problem of the frequent subgraph discovery. This chapter will discuss and organize the literature on this topic effectively in order to make it much more accessible to the reader. Keywords: Dense subgraph discovery, graph clustering 1. Introduction In almost any network, density is an indication of importance. Just as some- one reading a road map is interesting in knowing the location of the larger cities and towns, investigators who seek information from abstract graphs are often interested in the dense components of the graph. Depending on what properties are being modeled by the graph’s vertices and edges, dense regions may indicate high degrees of interaction, mutual similarity and hence collec- tive characteristics, attractive forces, favorable environments, or critical mass. From a theoretical perspective, dense regions have many interesting prop- erties. Dense components naturally have small diameters (worst case shortest path between any two members). Routing within these components is rapid. A simple strategy also exists for global routing. If most vertices belong to a dense component, only a few selected inter-hub links are needed to have a short average distance between any two arbitrary vertices in the entire network. Commercial airlines employ this hub-based routing scheme. Dense regions are also robust, in the sense that many connections can be broken without splitting the component. A less well-known but equally important property of dense subgraphs comes from percolation theory. If a graph is sufficiently dense, or equivalently, if messages are forwarded from one node to its neighbors with higher than a certain probability, then there is very high probability of propa- gating a message across the diameter of the graph [20]. This fact is useful in everything from epidemiology to marketing. Not all graphs have dense components, however. A sparse graph may have few or none. In order to understand this issue, we first need to define a formal notion of the words ‘dense’ and ‘sparse’. We will address this issue shortly. A uniform graph is either entirely dense or not dense at all. Uniform graphs, however, are rare, usually limited to either small or artificially created ones. Due to the usefulness of dense components, it is generally accepted that their existence is the rule rather than the exception in nature and in human-planned networks [39]. Dense components have been identified in and have enhanced understanding of many types of networks; among the best-known are social networks [53, 44], the World Wide Web [30, 17, 11], financial markets [5], and biological sys- A Survey of Algorithms for Dense Subgraph Discovery 305 tems [26]. Much of the early motivation, research, and nomenclature regarding dense components was in the field of social network analysis. Even before the advent of computers, sociologists turned to graph theory to formulate models for the concept of social cohesion. Clique, 𝐾-core, 𝐾-plex, and 𝐾-club are metrics originally devised to measure social cohesiveness [53]. It is not sur- prising that we also see dense components in the World Wide Web. In many ways, the Web is simply a virtual implementation of traditional direct human- human social networks. Today, the natural sciences, the social sciences, and technological fields are all using network and graph analysis methods to better understand complex systems. Dense component discovery and analysis is one important aspect of network analysis. Therefore, readers from many different backgrounds will benefit from understanding more about the characteristics of dense components and some of the methods used to uncover them. In the next section, we outline the graph terminology and define the fun- damental measures of density to be used in the rest of the chapter. Section 3 categorizes the algorithmic approaches and presents representative implemen- tations in more detail. Section 4 expands the topic to consider frequently- occurring dense components in a set of graphs. Section 5 provides examples of how these techniques have been applied in various scientific fields. Section 6 concludes the chapter with a look to the future. 2. Types of Dense Components Different applications find different definitions of dense component to be useful. In this section, we outline the many ways to define a dense component, categorizing them by their important features. Understanding these features of the various types of components are valuable for deciding which type of component to pursue. 2.1 Absolute vs. Relative Density We can divide density definitions into two classes, absolute density and rel- ative density. An absolute density measure establishes rules and parameter values for what constitutes a dense component, independent of what is out- side the component. For example, we could say that we are only interested in cliques, fully-connected subgraphs of maximum density. Absolute density measures take the form of relaxations of the pure clique measure. On the other hand, a relative density measure has no preset level for what is sufficiently dense. It compares the density of one region to another, with the goal of finding the densest regions. To establish the boundaries of components, a metric typically looks to maximize the difference between intra-component connectedness and inter-component connectedness. Often but not necessarily, . 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_10, 303 304 MANAGING AND MINING GRAPH DATA may wither be. Kaplan, and U. Zwick, Reachability and dis- tance queries via 2-hop labels, ACM Symposium on Discrete Algorithms, 2002. 300 MANAGING AND MINING GRAPH DATA [15] D. Cook, L. Holder, Mining Graph Data, . Clustering Algorithms for Graph Data 301 [32] J. Pei, D. Jiang, A. Zhang. On Mining Cross -Graph Quasi-Cliques, ACM KDD Conference, 2005. [33] J. Pei, D. Jiang, A. Zhang. Mining Cross -Graph Quasi-Cliques