540 MANAGING AND MINING GRAPH DATA In concrete terms, we compare the following five alternatives: E 01m : The structural 𝑃 SN -scoring approach similar to [9] (cf. Subsec- tion 5.1), but based on the unordered R 01m unord reduction. E subtree : The frequency-based P freq -scoring approach as in [13, 14] (cf. Subsection 5.2) based on the R subtree reduction. E comb[13] : The combined approach from [13] (cf. Subsection 5.3) based on the R 01m unord and R subtree reductions. E comb[14] : The combined approach from [14] (cf. Subsection 5.3) based on the R subtree reduction. E total : The combined approach as in [14] (cf. Subsection 5.3) but with the R total w reduction like in [25] (but with weights and without temporal edges, cf. Subsection 5.1). We present the results (the number of the first position in which a bug is found) of the five experiments for all fourteen bugs in Table 17.3. We represent a bug which is not discovered with the respective approach with ‘25’, the total number of methods of the program. Note that with the frequency-based and the combined method rankings, there usually is information available where a bug is located within a method, and in the context of which subgraph it appears. The following comparisons leave aside this additional information. Exp.∖Bug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E 01m 25 3 1 3 2 4 3 1 1 6 4 4 25 4 E subtree 3 3 1 1 1 3 3 1 25 2 3 3 3 3 E comb[13] 1 3 1 2 2 1 2 1 3 1 2 4 8 5 E comb[14] 3 2 1 1 1 2 2 1 18 2 2 3 3 3 E total 1 5 1 4 3 5 5 2 25 2 5 4 6 3 Table 17.3. Experimental results. Structural, Frequency-Based and Combined Approaches. Comparing the results from E 01m and E subtree , the frequency-based approach (E subtree ) per- forms almost always as good or better than the structural one (E 01m ). This demonstrates that analyzing numerical call frequencies is adequate to locate bugs. Bugs 1, 9 and 13 illustrate that both approaches alone cannot find certain bugs. Bug 9 cannot be found by comparing call frequencies (E subtree ). This is because Bug 9 is a modified condition which always leads to the invocation of a certain method. In consequence, the call frequency is always the same. Bugs 1 and 13 are not found with the purely structural approach (E 01m ). Both are typical call frequency affecting bugs: Bug 1 is in an if-condition inside a Software-Bug Localization with Graph Mining 541 loop and leads to more invocations of a certain method. In Bug 13, a modified for-condition slightly changes the call frequency of a method inside the loop. With the R 01m unord reduction technique used in E 01m , Bug 2 and 13 have the same graph structure both with correct and with failing executions. Thus, it is difficult to impossible to identify structural differences. The combined approaches in E comb[13] and E comb[14] are intended to take structural information into account as well to improve the results from E subtree . We do achieve this goal: When comparing E subtree and E comb[14] , we retain the already good results from E subtree in nine cases and improve them in five. When looking at the two combination strategies, it is hard to say which one is better. E comb[13] turns out to be better in four cases while E comb[14] is better in six ones. Thus, the technique in E comb[14] is slightly better, but not with every bug. Furthermore, the technique in E comb[13] is less efficient as it requires two graph-mining runs. Reduction Techniques. Looking at the call-graph-reduction techniques, the results from the experiments discussed so far reveal that the subtree- reduction technique with edge weights (R subtree ) used in E subtree as well as in both combined approaches is superior to the zero-one-many reduction (R 01m unord ). Besides the increased precision of the localization techniques based on the reduction, R subtree also produces smaller graphs than R 01m unord (cf. Subsection 4.5). E total evaluates the total reduction technique. We use R total w as an instance of the total reduction family. The rationale is that this one can be used with E comb[14] . In most cases, the total reduction (E total ) performs worse than the subtree reduction (E comb[14] ). This confirms that the subtree-reduction tech- nique is reasonable, and that it is worth to keep more structural information than the total reduction does. However, in cases where the subtree reduction produces graphs which are too large for efficient mining, and the total reduc- tion produces sufficiently small graphs, R total w can be an alternative to R subtree . Temporal Order. The experimental results listed in Table 17.3 do not shed any light on the influence of the temporal order. When applied to the buggy programs used in our comparisons, the total reduction with temporal edges (R total tmp ) produces graphs of a size which cannot be mined in a reasonable time. This already shows that the representation of the temporal order with additional edges might lead to graphs whose size is not manageable any more. In preliminary experiments of ours, we have repeated E 01m with the R 01m ord reduction and the FREQT [2] rooted ordered tree miner in order to evaluate the usefulness of the temporal order. Although we systematically varied the differ- ent mining parameters, the results of these experiments in general are not better than those in E 01m . Only in two of the 14 bugs the temporal-aware approach 542 MANAGING AND MINING GRAPH DATA has performed better than E 01m , in the other cases it has performed worse. In a comparison with the R subtree reduction and the gSpan algorithm [32], the R 01m ord reduction with the ordered tree miner displayed a significantly in- creased runtime by a factor of 4.8 on average. 4 Therefore, our preliminary result is that the incorporation of the temporal order does not increase the pre- cision of bug localizations. This is based on the bugs considered so far, and more comprehensive experiments would be needed for a more reliable state- ment. Threats to Validity. The experiments carried out in this subsection, as well as in the respective publications [9, 13, 14, 25], illustrate the ability to locate bugs based on dynamic call graphs using graph mining techniques. From a software engineering point of view, three issues remain for further evaluations: (1) All experiments are based on artificially seeded bugs. Although these bugs mimic typical bugs as they occur in reality, a further investigation with real bugs, e.g., from a real software project, would prove the validity of the pro- posed techniques. (2) All experiments feature rather small programs contain- ing the bugs. The programs rarely consist of more than one class and represent situations where bugs could be found relatively easy by a manual investigation as well. When solutions for the current scalability issues are found, localiza- tion techniques should be validated with larger software projects. (3) None of the techniques considered has been directly compared to other techniques such as those discussed in Section 3. Such a comparison, based on a large number of bugs, would reveal the advantages and disadvantages of the differ- ent techniques. The iBUGS project [7] provides real bug datasets from large software projects such as AspectJ. It might serve as a basis to tackle the issues mentioned. 6. Conclusions and Future Directions This chapter has dealt with the problem of localizing software bugs, as a use case of graph mining. This localization is important as bugs are hard to detect manually. Graph mining based techniques identify structural patterns in trace data which are typical for failing executions but rare in correct. They serve as hints for bug localization. Respective techniques based on call graph mining first need to solve the subproblem of call graph reduction. In this chap- ter we have discussed both reduction techniques for dynamic call graphs and approaches analyzing such graphs. Experiments have demonstrated the use- fulness of our techniques and have compared different approaches. 4 In this comparison, FREQT was restricted as in [9] to find subtrees of a maximum size of four nodes. Such a restriction was not set in gSpan. Furthermore, we expect a further significant speedup when CloseGraph [33] is used instead of gSpan. Software-Bug Localization with Graph Mining 543 All techniques surveyed in this chapter work well when applied to relatively small software projects. Due to the NP-hard problem of subgraph isomorphism inherent to frequent subgraph mining, none of the techniques presented is di- rectly applicable to large projects. One future challenge is to overcome this problem, be it with more sophisticated graph-mining algorithms, e.g., scalable approximate mining or discriminative techniques, or smarter bug-localization frameworks, e.g., different graph representations or constraint based mining. One starting point could be the granularity of call graphs. So far, call graphs represent method invocations. One can think of smaller graphs representing interactions at a coarser level, i.e., classes or packages. [12] presents encour- aging results regarding the localization of bugs based on class-level call graphs. As future research, we will investigate how to turn these results into a scalable framework for locating bugs. Such a framework would first do bug localiza- tion on a coarse level before ‘zooming in’ and investigating more detailed call graphs. Call graph reduction techniques introducing edge weights trigger another challenge for graph mining: weighted graphs. We have shown that the analysis of such weights is crucial to detect certain bugs. Graph-mining research has focused on structural issues so far, and we are not aware of any algorithm for explicit mining of weighted graphs. Next to reduced call graphs, such algorithms could mine other real world graphs as well [3], e.g., in logistics [19] and image analysis [27]. Acknowledgments We are indebted to Matthias Huber for his contributions. We further thank Andreas Zeller for fruitful discussions and Valentin Dallmeier for his com- ments on early versions of this chapter. References [1] F. E. Allen. Interprocedural Data Flow Analysis. In Proc. of the IFIP Congress, 1974. [2] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient Substructure Discovery from Large Semi-structured Data. In Proc. of the 2nd SIAM Int. Conf. on Data Mining (SDM), 2002. [3] D. Chakrabarti and C. Faloutsos. Graph Mining: Laws, Generators, and Algorithms. ACM Computing Surveys (CSUR), 38(1):2, 2006. [4] R Y. Chang, A. Podgurski, and J. Yang. Discovering Neglected Condi- tions in Software by Mining Dependence Graphs. IEEE Transactions on Software Engineering, 34(5):579–596, 2008. 544 MANAGING AND MINING GRAPH DATA [5] Y. Chi, R. Muntz, S. Nijssen, and J. Kok. Frequent Subtree Mining – An Overview. Fundamenta Informaticae, 66(1–2):161–198, 2005. [6] V. Dallmeier, C. Lindig, and A. Zeller. Lightweight Defect Localization for Java. In Proc. of the 19th European Conf. on Object-Oriented Pro- gramming (ECOOP), 2005. [7] V. Dallmeier and T. Zimmermann. Extraction of Bug Localization Bench- marks from History. In Proc. of the 22nd IEEE/ACM Int. Conf. on Auto- mated Software Engineering (ASE), 2007. [8] I. F. Darwin. Java Cookbook. O’Reilly, 2004. [9] G. Di Fatta, S. Leue, and E. Stegantova. Discriminative Pattern Mining in Software Fault Detection. In Proc. of the 3rd Int. Workshop on Software Quality Assurance (SOQUA), 2006. [10] R. Diestel. Graph Theory. Springer, 2006. [11] T. G. Dietterich, P. Domingos, L. Getoor, S. Muggleton, and P. Tadepalli. Structured Machine Learning: The Next Ten Years. Machine Learning, 73(1):3–23, 2008. [12] F. Eichinger and K. B - ohm. Towards Scalability of Graph-Mining Based Bug Localisation. In Proc. of the 7th Int. Workshop on Mining and Learn- ing with Graphs (MLG), 2009. [13] F. Eichinger, K. B - ohm, and M. Huber. Improved Software Fault Detec- tion with Graph Mining. In Proc. of the 6th Int. Workshop on Mining and Learning with Graphs (MLG), 2008. [14] F. Eichinger, K. B - ohm, and M. Huber. Mining Edge-Weighted Call Graphs to Localise Software Bugs. In Proc. of the European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2008. [15] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynami- cally Discovering Likely Program Invariants to Support Program Evolu- tion. IEEE Transactions on Software Engineering, 27(2):99–123, 2001. [16] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. [17] S. L. Graham, P. B. Kessler, and M. K. Mckusick. gprof: A Call Graph Execution Profiler. In Proc. of the ACM SIGPLAN Symposium on Com- piler Construction, 1982. [18] M. J. Harrold, R. Gupta, and M. L. Soffa. A Methodology for Controlling the Size of a Test Suite. ACM Transactions on Software Engineering and Methodology (TOSEM), 2(3):270–285, 1993. [19] W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge Discovery from Transportation Network Data. In Proc. of the 21st Int. Conf. on Data Engineering (ICDE), 2005. Software-Bug Localization with Graph Mining 545 [20] J. A. Jones, M. J. Harrold, and J. Stasko. Visualization of Test Informa- tion to Assist Fault Localization. In Proc. of the 24th Int. Conf. on Software Engineering (ICSE), 2002. [21] P. Knab, M. Pinzger, and A. Bernstein. Predicting Defect Densities in Source Code Files with Decision Tree Learners. In Proc. of the Int. Work- shop on Mining Software Repositories (MSR), 2006. [22] B. Korel and J. Laski. Dynamic Program Slicing. Information Processing Letters, 29(3):155–163, 1988. [23] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug Isolation via Re- mote Program Sampling. ACM SIGPLAN Notices, 38(5):141–154, 2003. [24] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff. SOBER: Statistical Model-Based Bug Localization. SIGSOFT Software Engineering Notes, 30(5):286–295, 2005. [25] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining Behavior Graphs for “Backtrace” of Noncrashing Bugs. In Proc. of the 5th SIAM Int. Conf. on Data Mining (SDM), 2005. [26] N. Nagappan, T. Ball, and A. Zeller. Mining Metrics to Predict Com- ponent Failures. In Proc. of the 28th Int. Conf. on Software Engineering (ICSE), 2006. [27] S. Nowozin, K. Tsuda, T. Uno, T. Kudo, and G. Bakir. Weighted Sub- structure Mining for Image Analysis. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), 2007. [28] K. J. Ottenstein and L. M. Ottenstein. The Program Dependence Graph in a Software Development Environment. SIGSOFT Software Engineering Notes, 9(3):177–184, 1984. [29] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. [30] A. Schr - oter, T. Zimmermann, and A. Zeller. Predicting Component Fail- ures at Design Time. In Proc. of the 5th Int. Symposium on Empirical Software Engineering, 2006. [31] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Pub- lishers, 2005. [32] X. Yan and J. Han. gSpan: Graph-Based Substructure Pattern Mining. In Proc. of the 2nd IEEE Int. Conf. on Data Mining (ICDM), 2002. [33] X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. In Proc. of the 9th ACM Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2003. 546 MANAGING AND MINING GRAPH DATA [34] T. Zimmermann, N. Nagappan, and A. Zeller. Predicting Bugs from His- tory. In T. Mens and S. Demeyer, editors, Software Evolution, pages 69–88. Springer, 2008. Chapter 18 A SURVEY OF GRAPH MINING TECHNIQUES FOR BIOLOGICAL DATASETS S. Parthasarathy The Ohio State University 2015 Neil Ave, DL395, Columbus, OH srini@cse.ohio-state.edu S. Tatikonda The Ohio State University 2015 Neil Ave, DL395, Columbus, OH tatikond@cse.ohio-state.edu D. Ucar The Ohio State University 2015 Neil Ave, DL395, Columbus, OH ucar@cse.ohio-state.edu Abstract Mining structured information has been the source of much research in the data mining community over the last decade. The field of bioinformatics has emerged as important application area in this context. Examples abound ranging from the analysis of protein interaction networks to the analysis of phylogenetic data. In this article we survey the principal results in the field examining them both from the algorithmic contributions and applicability in the domain in ques- tion. We conclude this article with a discussion of the key results and identify some interesting directions for future research. Keywords: Graph Mining, Tree Mining, Biological Networks, Community Discovery © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_18, 547 548 MANAGING AND MINING GRAPH DATA 1. Introduction Advances in data collection and storage technology have led to a prolifera- tion of structured information available to organizations and individuals. This information is often also available to the user in a myriad of formats and across multiple media. This is especially true in the vibrant field of bioinformatics where an increasing large number of problems are represented in structured or semi-structured format. Examples abound ranging from protein interaction networks (graphs) to phylogenetic datasets (trees), and from XML repositories of proteomic data (trees) to regulatory networks (graphs). The size and number of such data stores is growing rapidly. Such data may arise directly out of experimental observations (e.g. PPI net- work complexes from mass spectrometry) or may be a convenient abstraction for housing relational information (e.g. Protein Data Bank). Other examples include mRNA measurements from microarray studies can be used to infer pairwise gene relations that imply co-expression of two genes. Regulatory re- lations between DNA binding proteins and genes can also be identified via various experimental technologies such as ChIP-chip, ChIP-seq, or DamID. Learning a biological network structure from experimental data that reflects the real world relations is a challenge in itself. Where data mining, in par- ticular graph mining, can help is in the analysis of such structure data for the discovery of useful information. such as identification of common or useful substructures and detecting anomalous or unusual structures. In this article we survey the use of graph mining for bioinformatics prob- lems. This topic has been heavily researched over the last decade and we review the relevant material. We take a broad view of the term graph mining here. Since trees are simply connected acyclic graphs we include approaches that leverage tree mining algorithms as well. Additionally within the domain of graph mining there are approaches that focus on harvesting patterns from a single large graph or network and those that focus on extracting patterns from multiple graphs. We also cover other variants of graphs in our discussion in- cluding different tree variants, directed and bi-partite graphs. The rest of this article is broadly divided into four sections. Section 2 dis- cusses the use of tree mining algorithms for bioinformatics problems. For example, RNA secondary structures can be represented in the form of a tree. A forest of such RNA structure trees can be employed to characterize a newly sequenced novel RNA structure by identification of common topological pat- terns [93]. In particular we survey the role played by frequent tree mining algorithms, tree alignment, and statistical methods in this context. In Section 3 we discuss algorithms that target the identification of frequent sub-patterns across multiple networks. For example in a recent study [53] it was shown how 39 co-expression networks of Budding Yeast can be analyzed A Survey of Graph Mining Techniques for Biological Datasets 549 for coherent dense subgraphs across many of these networks. The discovered subgraphs then used to predict functionality of unknown genes. In particu- lar we survey the role played by frequent graph mining algorithms and motif discovery algorithms in this context. In Section 4 we discuss approaches that mine single and large biological networks for the identification of important subnetwork structures, such as identification of densely interacting communities from PPI networks or gene co-expression networks. In particular we discuss the role played by commu- nity discovery and graph clustering algorithms in the presence of uncertainty and noise in this context. Finally in Section 5 we conclude this survey with a discussion of some open problems in the field. 2. Mining Trees Trees are widely used to represent various biological structures like glycans, RNAs, and phylogenies. Glycans are carbohydrate sugar chains attached to some lipids or proteins, and they are considered the third class of information-encoding biological macromolecules subsequent to DNA and proteins. The field of characteriz- ing and studying is known as glycomics, akin to genomics and proteomics. Glycans play a critical role in many biological processes including embryonic development, cell to cell communication, coordination of immune functions, tumor progression, and protein regulations and interactions. Glycans are com- posed of monosaccharides (sugars) that are linked by glycosidic bonds. Unlike DNA and proteins which are simple strings of nucleotides and amino acids, monosaccharides may be linked to one or more other sugars, thereby forming a branched tree structure – they are often represented as rooted ordered la- beled trees. In some cases, though rare, glycans may contain cycles due to rare cyclization of carbohydrate structures (e.g., cyclodextrins) [48]. There exist a number of representation schemes (KCF [5], LINUCS [13], GLYDE [87], Gly- coCT [48], and GLYDE-II [83]) and database systems (CarbBank 1 , SWEET- DB [75], KEGG/GLYCAN [45], EuroCarbDB 2 , GlycoSuiteDB [26]) to store glycan data. Ribonucleic acid (RNA) is a type of molecule that consists of a long chain of nucleotide units. RNA molecules play an important role in several key func- tionalities which include translation, splicing, gene regulation, and synthesis of proteins. As with all biomolecules, the function of RNAs is intimately related to their structure. The secondary structure of RNAs is a list of base 1 http://bssv01.lancs.ac.uk/gig/pages/gag/carbbank.htm 2 http://www.eurocarbdb.org/ . Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_18, 547 548 MANAGING AND MINING GRAPH DATA 1. Introduction Advances in data. Data Mining (ICDM), 2002. [33] X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. In Proc. of the 9th ACM Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2003. 546 MANAGING. on Mining and Learn- ing with Graphs (MLG), 2009. [13] F. Eichinger, K. B - ohm, and M. Huber. Improved Software Fault Detec- tion with Graph Mining. In Proc. of the 6th Int. Workshop on Mining