Managing and Mining Graph Data part 53 potx

Graph Mining Applications to Social Network Analysis 509 [11] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner. Maximizing modularity is hard. Arxiv preprint physics/0608255, 2006. [12] T. Bu and D. Towsley. On distinguishing between internet power law topology generators. In Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, volume 2, pages 638– 647 vol.2, 2002. [13] L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting triangles in data streams. In PODS ’06: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Prin- ciples of database systems, pages 253–262, New York, NY, USA, 2006. ACM. [14] D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006. [15] A. Clauset, M. Mewman, and C. Moore. Finding community structure in very large networks. Arxiv preprint cond-mat/0408187, 2004. [16] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453:98–101, 2008. [17] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. arXiv, 706, 2007. [18] J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus "it’s always about the people. enron is no different". Comput. Math. Organ. Theory, 11(3):201–228, 2005. [19] Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 461–470, New York, NY, USA, 2007. ACM. [20] P. Erd - os and A. R « enyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5:17–61, 1960. [21] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM ’99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, pages 251–262, New York, NY, USA, 1999. ACM. [22] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–160, New York, NY, USA, 2000. ACM. [23] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st inter- 510 MANAGING AND MINING GRAPH DATA national conference on Very large data bases, pages 721–732. VLDB Endowment, 2005. [24] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based cluster- ing for social networks. Journal Of The Royal Statistical Society Series A, 127(2):301–354, 2007. [25] R. Hanneman and M. Riddle. Introduction to Social Network Methods. http://faculty.ucr.edu/ hanneman/, 2005. [26] P. D. Hoff and M. S. H. Adrian E. Raftery. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002. [27] J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Natural communities in large linked networks. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 541–546, New York, NY, USA, 2003. ACM. [28] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 611–617, New York, NY, USA, 2006. ACM. [29] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. Comput. Netw., 31(11-16):1481– 1493, 1999. [30] M. Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., 407(1-3):458–473, 2008. [31] J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of vi- ral marketing. In EC ’06: Proceedings of the 7th ACM conference on Electronic commerce, pages 228–237, New York, NY, USA, 2006. ACM. [32] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of social networks. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 462–470, New York, NY, USA, 2008. ACM. [33] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant- messaging network. In WWW ’08: Proceeding of the 17th international conference on World Wide Web, pages 915–924, New York, NY, USA, 2008. ACM. [34] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densifica- tion and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1):2, 2007. [35] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information net- Graph Mining Applications to Social Network Analysis 511 works. In WWW ’08: Proceeding of the 17th international conference on World Wide Web, pages 695–704, New York, NY, USA, 2008. ACM. [36] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cas- cading behavior in large blog graphs. In SIAM International Conference on Data Mining (SDM 2007), 2007. [37] B. McClosky and I. V. Hicks. Detecting cohesive groups. http://www.caam.rice.edu/ ivhicks/CokplexAlgorithmPaper.pdf, 2009. [38] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattachar- jee. Measurement and analysis of online social networks. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42, New York, NY, USA, 2007. ACM. [39] A. A. Nanavati, S. Gurumurthy, G. Das, D. Chakraborty, K. Dasgupta, S. Mukherjea, and A. Joshi. On the structural properties of massive tele- com call graphs: findings and implications. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 435–444, New York, NY, USA, 2006. ACM. [40] M. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003. [41] M. Newman. Power laws, Pareto distributions and Zipf’s law. Contem- porary physics, 46(5):323–352, 2005. [42] M. Newman. Finding community structure in networks using the eigen- vectors of matrices. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 74(3), 2006. [43] M. Newman. Modularity and community structure in networks. PNAS, 103(23):8577–8582, 2006. [44] M. Newman, A L. Barabasi, and D. J. Watts, editors. The Structure and Dynamics of Networks. 2006. [45] M. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004. [46] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochas- tic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001. [47] G. Palla, I. Der « enyi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435:814–818, 2005. [48] C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA, 2002. ACM. 512 MANAGING AND MINING GRAPH DATA [49] S. Papadopoulos, A. Skusa, A. Vakali, Y. Kompatsiaris, and N. Wagner. Bridge bounding: A local approach for efficient community discovery in complex networks. Feb 2009. [50] P. Sarkar and A. W. Moore. Dynamic social network analysis using latent space models. SIGKDD Explor. Newsl., 7(2):31–40, 2005. [51] T. Schank and D. Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In Workshop on Experimental and Efficient Algorithms, 2005. [52] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse frame- work for combining multiple partitions. J. Mach. Learn. Res., 3:583–617, 2003. [53] L. Tang and H. Liu. Relational learning via latent social dimensions. In KDD ’09: Proceeding of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. [54] L. Tang and H. Liu. Uncovering cross-dimension group structures in multi-dimensional networks. In SDM workshop on Analysis of Dynamic Networks, 2009. [55] L. Tang, H. Liu, J. Zhang, N. Agarwal, and J. J. Salerno. Topic taxonomy adaptation for group profiling. ACM Trans. Knowl. Discov. Data, 1(4):1– 28, 2008. [56] L. Tang, H. Liu, J. Zhang, and Z. Nazeri. Community evolution in dynamic multi-mode networks. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 677–685, New York, NY, USA, 2008. ACM. [57] S. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for the internet topology. In Global Telecommunications Confer- ence, volume 3, pages 1667–1671, 2001. [58] J. Travers and S. Milgram. An experimental study of the small world problem. Sociometry, 32(4):425–443, 1969. [59] C. E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. IEEE International Conference on Data Mining, 0:608–617, 2008. [60] K. Wakita and T. Tsurumi. Finding community structure in mega-scale social networks: [extended abstract]. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 1275–1276, New York, NY, USA, 2007. ACM. [61] S. Wasserman and K. Faust. Social Network Analysis: Methods and Ap- plications. Cambridge University Press, 1994. [62] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, 1998. Graph Mining Applications to Social Network Analysis 513 [63] K. Yu, S. Yu, and V. Tresp. Soft clsutering on graphs. In NIPS, 2005. Chapter 17 SOFTWARE-BUG LOCALIZATION WITH GRAPH MINING Frank Eichinger Institute for Program Structures and Data Organization (IPD) Universit-at Karlsruhe (TH), Germany eichinger@ipd.uka.de Klemens B - ohm Institute for Program Structures and Data Organization (IPD) Universit-at Karlsruhe (TH), Germany boehm@ipd.uka.de Abstract In the recent past, a number of frequent subgraph mining algorithms has been proposed They allow for analyses in domains where data is naturally graph- structured. However, caused by scalability problems when dealing with large graphs, the application of graph mining has been limited to only a few domains. In software engineering, debugging is an important issue. It is most challenging to localize bugs automatically, as this is expensive to be done manually. Several approaches have been investigated, some of which analyze traces of repeated program executions. These traces can be represented as call graphs. Such graphs describe the invocations of methods during an execution. This chapter is a sur- vey of graph mining approaches for bug localization based on the analysis of dynamic call graphs. In particular, this chapter first introduces the subproblem of reducing the size of call graphs, before the different approaches to localize bugs based on such reduced graphs are discussed. Finally, we compare selected techniques experimentally and provide an outlook on future issues. Keywords: Software Bug Localization, Program Call Graphs © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_17, 515 516 MANAGING AND MINING GRAPH DATA 1. Introduction Software quality is a huge concern in industry. Almost any software con- tains at least some minor bugs after being released. In order to avoid bugs, which incur significant costs, it is important to find and fix them before the re- lease. In general, this results in devoting more resources to quality assurance. Software developers usually try to find and fix bugs by means of in-depth code reviews, along with testing and classical debugging. Locating bugs is considered to be the most time consuming and challenging activity in this context [6, 20, 24, 26] where the resources available are limited. Therefore, there is a need for semi-automated techniques guiding the debugging process [34]. If a devel- oper obtains some hints where bugs might be localized, debugging becomes more efficient. Research in the field of software reliability has been extensive, and various techniques have been developed addressing the identification of defect-prone parts of software. This interest is not limited to software-engineering research. In the machine-learning community, automated debugging is considered to be one of the ten most challenging problems for the next years [11]. So far, no bug localization technique is perfect in the sense that it is capable of discovering any kind of bug. In this chapter, we look at a relatively new class of bug localization techniques, the analysis of call graphs with graph-mining techniques. It can be seen as an approach orthogonal to and complementing existing techniques. Graph mining, or more specifically frequent subgraph mining, is a relatively young discipline in data mining. As described in the other chapters of this book, there are many different techniques as well as numerous applications for graph mining. Probably the most prominent application is the analysis of chemical molecules. As the NP-complete problem of subgraph isomorphism [16] is an inherent part of frequent subgraph mining algorithms, the analysis of molecules benefits from the relatively small size of most of them. Compared to the analysis of molecular data, software-engineering artifacts are typically mapped to graphs that are much larger. Consequently, common graph-mining algorithms do not scale for these graphs. In order to make use of call graphs which reflect the invocation structure of specific program executions, it is key to deploy a suitable call-graph-reduction technique. Such techniques help to alleviate the scalability problems to some extent and allow to make use of graph-mining algorithms in a number of cases. As we will demonstrate, such approaches work well in certain cases, but some challenges remain. Besides scalability issues that are still unsolved, some call-graph-reduction techniques lead to another challenge: They introduce edge weights representing call fre- quencies. As graph-mining research has concentrated on structural and cat- egorical domains, rather than on quantitative weights, we are not aware of Software-Bug Localization with Graph Mining 517 any algorithm specialized in mining weighted graphs. Though this chapter presents a technique to analyze graphs with weighted edges, the technique is a composition of established algorithms rather than a universal weighted graph mining algorithm. Thus, besides mining large graphs, weighted graph mining is a further challenge for graph-mining research driven by the field of software engineering. The remainder of this chapter is structured as follows: Section 2 introduces some basic principles of call graphs, bugs, graph mining and bug localization with such graphs. Section 3 gives an overview of related work in software engineering employing data-analysis techniques. Section 4 discusses different call-graph-reduction techniques. The different bug-localization approaches are presented and compared in Section 5 and Section 6 concludes. 2. Basics of Call Graph Based Bug Localization This section introduces the concept of dynamic call graphs in Subsec- tion 2.1. It presents some classes of bugs in Subsection 2.2 and Subsection 2.3 explains how bug localization with call graphs works in principle. A brief overview of key aspects of graph and tree mining in the context of this chapter is given in Subsection 2.4. 2.1 Dynamic Call Graphs Call graphs are either static or dynamic [17]. A static call graph [1] can be obtained from the source code. It represents all methods 1 of a program as nodes and all possible method invocations as edges. Dynamic call graphs are of importance in this chapter. They represent an execution of a particular program and reflect the actual invocation structure of the execution. Without any further treatment, a call graph is a rooted ordered tree. The main-method of a program usually is the root, and the methods invoked directly are its children. Figure 17.1a is an abstract example of such a call graph where the root Node 𝑎 represents the main-method. Unreduced call graphs typically become very large. The reason is that, in modern software development, dedicated methods typically encapsulate every single functionality. These methods call each other frequently. Furthermore, iterative programming is very common, and methods calling other methods occur within loops, executed thousands of times. Therefore, the execution of even a small program lasting some seconds often results in call graphs consist- ing of millions of edges. The size of call graphs prohibits a straightforward mining with state-of- the-art graph-mining algorithms. Hence, a reduction of the graphs which com- 1 In this chapter, we use method interchangeably with function. 518 MANAGING AND MINING GRAPH DATA presses the graphs significantly but keeps the essential properties of an individ- ual execution is necessary. Section 4 describes different reduction techniques. 2.2 Bugs in Software In the software-engineering literature, there is a number of different defi- nitions of bugs, defects, errors, failures, faults and the like. For the purpose of this chapter, we do not differentiate between them. It is enough to know that a bug in a program execution manifests itself by producing some other results than specified or by leading to some unexpected runtime behavior such as crashes or non-terminating runs. In the following, we introduce some types of bugs which are particularly interesting in the context of call graph based bug localization. a b c b b b (a) a b c b b b (b) a b c b b b (c) Figure 17.1. An unreduced call graph, a call graph with a structure affecting bug, and a call graph with a frequency affecting bug. Crashing and non-crashing bugs: Crashing bugs lead to an unexpected termination of the program. Prominent examples include null pointer exceptions and divisions by zero. In many cases, e.g., depending on the programming language, such bugs are not hard to find: A stack trace is usually shown which gives hints where the bug occurred. Harder to cope with are non-crashing bugs, i.e., failures which lead to faulty results without any hint that something went wrong during the execution. As non-crashing bugs are hard to find, all approaches to discover bugs with call-graph mining focus on them and leave aside crashing bugs. Occasional and non-occasional bugs: Occasional bugs are bugs which occur with some but not with any input data. Finding occasional bugs is particularly difficult, as they are harder to reproduce, and more test cases are necessary for debugging. Furthermore, they occur more frequently, as non-occasional bugs are usually detected early, and occasional bugs might only be found by means of extensive testing. As all bug-localization techniques presented in this chapter rely on comparing call graphs of failing and correct program executions, they deal with oc- Software-Bug Localization with Graph Mining 519 casional bugs only. In other words, besides examples of failing program executions, there needs to be a certain number of correct executions. Structure and call frequency affecting bugs: This distinction is particularly useful when designing call graph based bug-localization techniques. Structure affecting bugs are bugs resulting in different shapes of the call graph where some parts are missing or occur additionally in faulty executions. An example is presented in Figure 17.1b, where Node 𝑏 called from 𝑎 is missing, compared to the original graph in Fig- ure 17.1a. In this example, a faulty if-condition in Node 𝑎 could have caused the bug. In contrast, call frequency affecting bugs are bugs which lead to a change in the number of calls of a certain subtree in faulty executions, rather than to completely missing or new substructures. In the example in Figure 17.1c, a faulty loop condition or a faulty if-condition inside a loop in Method 𝑐 are typical causes for the increased number of calls of Method 𝑏. As probably any bug-localization technique, call graph based bug localization is certainly not able to find all kinds of software bugs. For example, it is possible that bugs do not affect the call graph at all. For instance, if some mathematical expression calculates faulty results, this does not necessarily affect subsequent method calls and call graph mining can not detect this. There- fore, call graph based bug localization should be seen as a technique which complements other techniques, as the ones we will describe in Section 3. In this chapter we concentrate on deterministic bugs of single-threaded programs and leave aside bugs which are specific for such situations. However, the techniques described in the following might locate such bugs as well. 2.3 Bug Localization with Call Graphs So far, several approaches have been proposed to localize bugs by means of call-graph mining [9, 13, 14, 25]. We will present them in detail in the following sections. In a nutshell, the approaches consist of three steps: 1 Deduction of call graphs from program executions, assignment of labels correct or failing. 2 Reduction of call graphs. 3 Mining of call graphs, analysis of the resulting frequent subgraphs. Step 1: Deriving call graphs is relatively simple. They can be obtained by tracing program executions while testing, which is assumed to be done anyway. Furthermore, a classification of program executions as correct or failing is . Knowledge discovery and data mining, pages 81–90, New York, NY, USA, 2002. ACM. 512 MANAGING AND MINING GRAPH DATA [49] S. Papadopoulos, A. Skusa, A. Vakali, Y. Kompatsiaris, and N. Wagner. Bridge. the 31st inter- 510 MANAGING AND MINING GRAPH DATA national conference on Very large data bases, pages 721–732. VLDB Endowment, 2005. [24] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based. and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_17, 515 516 MANAGING AND MINING GRAPH DATA 1. Introduction Software quality is

Định dạng
Số trang	10
Dung lượng	1,39 MB