MINING, INDEXING AND SIMILARITY SEARCH IN LARGE GRAPH DATA SETSBYXIFENG YAN B.S., Zhejiang University, 1997M.S., State University of New York at Stony Brook, 2001... and how can we index
Trang 1MINING, INDEXING AND SIMILARITY SEARCH IN LARGE GRAPH DATA SETS
BYXIFENG YAN
B.S., Zhejiang University, 1997M.S., State University of New York at Stony Brook, 2001
Trang 2UMI Number: 3243031
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improperalignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion
®UMI
UMI Microform 3243031Copyright 2007 by ProQuest Information and Learning Company.All rights reserved This microform edition is protected againstunauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb RoadP.O Box 1346Ann Arbor, MI 48106-1346
Trang 3©by Xifeng Yan, 2006 All rights reserved.
Trang 4CERTIFICATE OF COMMITTEE APPROVAL
University of Illinois at Urbana-Champaign
Director of Research J IẠWEI HAN Head ofDepartment
Committee Member Committee Member
-* Required for doctoral degree but not for master’s degree
Trang 5Scalable analytical algorithms and tools for large graph data sets are in great demand acrossdomains from software engineering to computational biology as it is very difficult, if not im-possible, for human beings to manually analyze any reasonably large collection of graphs due
to their high complexity In this dissertation, we investigate two long standing fundamentalproblems: Given a graph data set, what are the hidden structural patterns and how can we findthem? and how can we index graphs and perform similarity search in large graph data sets?Graph pattern mining is an expensive computational problem since subgraph isomorphism
is NP-complete Previous solutions generate inevitable overheads since they rely on joiningtwo graphs to form larger candidates We develop a graph canonical labeling system, gSpan,showing both theoretically and empirically that this kind of join operation is unnecessary.Graph indexing, the second problem addressed in this dissertation, may incur an exponentialnumber of index entries if all of the substructures in a graph database are used for indexing.The solution, gIndex, proposes a novel, frequent and discriminative graph mining approachthat leads to the development of a compact but effective graph index structure that is orders
of magnitude smaller in size but an order of magnitude faster in performance than traditionalapproaches
Besides graph mining and search, this dissertation provides thorough investigation of tern summarization, pattern-based classification, constraint pattern mining, and graph similar-ity searching, which could leverage the usage of graph patterns It also explores several criticalapplications in bioinformatics, computer systems and software engineering, including gene rel-evance network analysis for functional annotation, and program flow analysis for automatedsoftware bug isolation
pat-The developed concepts, theories, and systems may significantly deepen the understanding
of data mining principles in structural pattern discovery, interpretation and search The mulation of a general graph information system through this study could provide fundamentalsupports to graph-intensive applications in multiple domains
Trang 6for-To my parents and sister
Trang 7There are no words to express my gratitude to my adviser, Prof Jiawei Han The researchpresented in this dissertation would not have happened without his support, guidance, andencouragement Nearly every aspect of my research has been improved due to his mentoring Iwas fortunate to spend two summers with Dr Philip S Yu at IBM Research, who helped medefine an important part of my doctoral work Thanks also to Dr Jasmine Xianghong Zhouwho brought me into the fantastic field of computational biology It was always inspiring andexciting to work with her I also felt honored to be a member in the Database and InformationSystem Lab, where I found many dedicated collaborators: Chao Liu for automated softwarebug isolation, Hong Cheng and Dong Xin for pattern summarization and interpretation, andFeida Zhu for complexity analysis
It was my honor to have Dr Christos Faloutsos, Dr Marianne Winslett, and Dr ang Zhai as my Ph.D committee members I am very grateful to them for providing insightfulcomments regarding this dissertation
Chengxi-I am also greatly indebted to many teachers in the past who educated me and got meinterested in scientific research A special thank goes to my primary school teacher Jingzhi Sunand my middle school mathematics teacher Shanshan Wu
I would like to thank my parents and sister for their love, trust, and encouragement throughhard times and for their unconditional support which enables me pursue my interests overseas.This research is funded in part by the U.S National Science Foundation grants NSF IIS-
0209199, IIS-0308215, CCR-0325603, and DBI-0515813
Trang 8Table of Contents
List of Figures - -dd a ẼẮẶẶẼ ix
List of Tables ca xiiGlossary of Notation © 0.0 AT + XHỈChapter
I6 25 ố ààặằaaAa a aaaa aẶRẶ da 1
Trang 92.4 Variant Graph PatteTAS HQ Hạ nàn và kg kia 322.4.1 Contrast Graph Pattern 0 nạ gà và và 322.4.2 Coherent Graph Pattern 2 Q Q ng và gà kg va 32
2.4.4 Dense Graph Pattern 2 2 cv Hà gà và ky T va 332.4.5 Approximate Graph Patlern cu kg va 34
2.5.1 Clustering 2 Ha 372.0.2 Relevance-Aware Top-K 2 và xà vàn a 392.6 Pattern-Based Classification © 0 kg ky kg và 412.7 Automated Software Bug Isolation HQ Quà Tà sa 462.7.1 Uncover “Backtrace” for Noncrashing Bugs 47
"„I N?9 on ố ẽ “=“e= —— nH 49
Graph Patterns with ConstrainiS c c c c c c cv vn ng gà và an a 523.1 Highly Connected Graph Patterns 1 LH Q vn và v22 533.1.1 CloseCut: A Pattern Growth Approach 0.0 00004 es 553.1.2 SPLAT: A Pattern Reduction Approach 04 593.1.3 Experiments 2 0 2 Q HQ k v g va k KT kia 60
Trang 104.5.2 Index ConstructiOn cà ga v v kg va 84
4.5.5 Insert/Delete Maintenance ca 894.6 ExperimentS ch ng cv gi kg kg Và và kia 93
5 Graph Similarity Search c c c c c cu ng à gg gi kg Q k va 995.1 Substructure Similarity Search cu 2 kg ky 101
Trang 11A Big Picture - .ẽ Ha Ẽẽ Š.ÍÍš«š 8
Program Caller/Callee Graphs 2 11
Frequent Graph Patterns Q Q Q Q cu gu g kg kg kg va 11
13
Right-Most Extension 2.0 ee u v v kg kg va 16Lexicographic Search ÏTree cu kg ga kg kg 19Extended Subgraph lsomorphism cu 00.0.0 0 0+ nae 24Failure of Early Termination 00 0000002 eee eee 25
Detect the Failure of Early TerminatioOn' Q Q Q Qua 27
Mining Performance in Class CA Compounds co 30Discovered Patterns in Class CA Compounds 31Performance vs Varying Parameters 2 2 HQ ee 32Pattern Summarization: Top-k, Clustering, and Relevance-aware Top-k 36
Classification Accuracy Boost 1 0 ga 48Entrance Precision and Exit Precision 2 2 ee 50
Trang 12Pattern vs Data Antimonotonicity 0 0 gà gà v Q k sa 64Pruning Properties of Graph Constraints 2.0- 00004 68Number of Highly Connected Patterns 1 2 ee 70
Genes Related with Subtelomerically Encoded Proteins 71Genes Having Helicase Activity 0 0 2 nu ng ga kg kg xa 71Genes Involved in Ribosomal Biogenesis 2 0.000000 eee 72Genes Involved in rRNA Processing 2 0 0002 epee eee eee 72
Size-increasing Support FUnetiOn§S 0.0.0.0 ee eee ee 83
GraphGrep vs gIndex: Index Size 2 vu gà xà 94GraphGrep vs gIndex: Performance 00000 eee va 95gIndex: Sensitivity and Scalability 0.0.0.0 0.000 eee eee 95Index Incremental Maintenance 1 1 ee va va 96Sampling-based Index Construction 2 0 2 ee 97GraphGrep vs gIndex: Performance on Synthetic Datasets 98
A Chemical Database 2 Q Q Q Q Q Q ng ng ng ga g và NT sa 99
Feature-Graph Matrix Index LH ee va 108Edge-Feature Matrix 2 01 ng Nà va 106Frequency Difference 2 0 v v và và à 112Geometric InterpretatiO' c c c Q c n Q ng vn và gà va 115
A Query Graph 1n ee 118
Trang 13Superposition 0 131
Overlapping-Relation Graph cv ng v v và và 138Greedy Partition Selection 2 0 ng ng ga gà va 140PIS: Performance on Chemical Datasets 00.0000 eee ees 143PIS: Parameter Sensitivity 2 HH kg và v va 144
Trang 14List of Tables
2.1 DFS code for Figures 2.5(b), 2.5(c), and 2.5(d) 2 2 ee ee eee 18 2.2 Parameters of Synthetic Graph Generator 6 h h h h hhhh hnỜ 31
2.3 Bug-Relevant Functions with 6 = 20% © ee ee 51
3.1 Parameters of Synthetic Relational Graph Generator 1 6 ee ee eee 61
4.1 Sufficient Sample Size Given é, d,andp 2 ee hh hhh h hỢ 92
Trang 15set minusvertex set of graph Gedge set of graph Gvertex label setedge label setdata setsupporting data set of pattern asupport of a
minimum supportsubpattern set of aedge extension
Trang 16Chapter 1
Introduction
Data mining, as well as database systems research, is facing a new challenge raised by theemergence of large volumes of network and graph data, which are pervasive in bioinformatics,chem-informatics, the Web, and many other applications Due to their adaptive capability ofmodeling complicated structures, such as proteins, images, documents, and other schemalessdata, graph representation of data is well accepted in domains ranging from software engineer-ing to computational biology In computer vision, graphs are used to represent the organization
of features in images, where the interlinks between features are critical in recognition of scenesand objects In chemical informatics and bio-informatics, scientists use graphs to representcompounds and proteins Systems for searching and registering chemical compounds have al-ready been developed Benefiting from such systems, researchers can do screening, designing,and knowledge discovery in large-scale compound and molecular data sets Figure 1.1 showsthree kinds of real graphs: program caller/callee flow, protein structure, and chemical com-
pound Figure 1.2 shows a protein-protein interaction network [46], where edges denote the
interactions between proteins
1.1 Motivation
Overall, the applications of knowledge discovery in graphs are numerous In software neering, it is observed that suspicious buggy codes can be automatically identified throughanalyzing the classification accuracy change [71] in program traces (e.g., program flow graph in
engi-Figure 1.1) In computational biology, studying the building principles of biological networks
could potentially revolutionize our view of biology and disease pathologies [9] For example, byaligning multiple protein-protein interaction networks, researchers found conserved interactionpathways and complexes of distantly related species The discovery of conserved subnetworkscould measure evolutionary distance at the level of network connectivity rather than at the
Trang 17Cu Cee)
Figure 1.2: Protein-Protein Interaction Network
level of DNA or protein sequence [8] In the analysis of gene relevance networks, the recurrentdense subnetworks allow researchers to infer conceptual functional associations of genes undervarious conditions for many model organisms, thus providing valuable information to study thefunctions and the dynamics of biological systems [55] In the discovery of new drugs, based
on the activity of known chemical compounds, structure classification techniques could providequantitative structure-activity relationship analysis (QSAR [50]) In contrast to local struc-tures in graphs, the global properties of large graphs, such as diameter, degree and density, areuseful in characterizing their topology The “small-world phenomenon” states that real graphshave surprisingly small diameter [80, 112, 5, 14, 64, 113, 25} In [37], Faloutsos et al discoveredsurprisingly simple power-laws of the Internet The power law can be applied to detecting link
spams (see e.g., [51]) that aggressively interlink webpages in order to manipulate ranking results[41] Leskovec et al [60] observed that a wide range of real graphs, such as the Internet and the
Trang 18Web, densify over time and the phenomena of graph densification and shrinkage is explainableaccording to a “forest fire’ spreading process.
The applications discussed so far are only the tip of the iceberg of what graph data miningcan achieve It was surprising that even very basic graph mining and search problems, described
as follows, have not been systematically examined in these fields, while their solutions are critical
to the success of many graph-related applications
1 Graph Mining: Given a graph data set, what are the interesting hidden structural patternsand how can we find them?
2 Graph Indexing and Similarity Searching: How can we index graphs and perform
search-ing, either exactly or approximately, in large graph databases?
The first problem involves a data mining task that defines and discovers interesting tural patterns in a graph data set One kind of common structural pattern is frequent subgraph.Given a graph database D, |D;| is the number of graphs in D where g is a subgraph |D,| is
struc-called the (absolute) support of g A graph g is frequent if its support is no less than a minimum
support threshold, đo Frequent subgraphs are useful in characterizing graphs, discriminatingbetween different groups of graphs, classifying and clustering graphs, and building graph in-
dices Borgelt and Berthold [13] illustrated the discovery of active chemical structures in an
HIV-screening data set by contrasting the support of frequent graphs between different classes
Deshpande et al [31] used frequent structures as features to classify chemical compounds Huan
et al [56] successfully applied the frequent graph mining technique to study protein structural
families Koyuturk et al [65] proposed a method to detect frequent subgraphs in biologicalnetworks: they observed considerably large frequent sub-pathways in metabolic networks Due
to its essential role in graph intensive applications, scalable graph pattern mining algorithmsbecome greatly sought-after in multiple disciplines, although graph pattern mining itself is an
expensive computational problem since subgraph isomorphism is NP-complete [27]
The scalability of graph pattern mining is not the only issue a user has to face when graphpatterns are put into practice The huge number of redundant patterns, often in the millions, oreven billions, makes them difficult to explore, which prohibits their interpretation and eventualutilization in many fields that could potentially benefit from them This issue not only existsfor complex structural patterns, but also for simple patterns, such as frequent itemsets andsequential patterns Furthermore, without patten post-processing, it might be infeasible toconstruct advanced data mining tools, such as classification and clustering, on a large set ofgraph patterns Therefore, a good solution is needed Pattern summarization provides one suchsolution by summarizing or selecting patterns using K representatives, which should not only
be significant, but also distinctive
Trang 19In addition to the general purpose frequent graph pattern mining, sometimes, a user mayonly be interested in patterns with specific structural constraints For example, in computa-tional biology, a highly connected subgraph could represent a set of genes within the same
functional module, i.e., a set of genes participating in the same biological pathways [19] In
chem-informatics, scientists are often interested in frequent graphs that contain a functionalfragment, e.g., a benzene ring In both examples, users have control on certain properties ofthe mining results When the mining of general frequent patterns becomes costly, a good way
of finding constrained patterns is to push constraints deeply into the mining process Thechallenge is at finding the optimal degree of integration which could vary dramatically for dif-ferent structural constraints A systematic investigation of possible integration mechanismscould revolutionize the constraint-based graph mining and outline a general constraint miningframework
The second problem, graph indexing and similarity search, involves a query processing taskthat defines and searches query graphs in a graph database In chemistry, the structures andproperties of newly discovered or synthesized chemical molecules are studied, classified, and
recorded for scientific and commercial purposes ChemIDplus [86], a free data service offered
by the National Library of Medicine (NLM), provides access to structure and nomenclature
information Users can query molecules by their names, structures, toxicity, and even weight in
a convenient way through its web interface Given a query structure, ChernIDplus can quickly
identify a small subset of relevant molecules for further analysis [47, 114], thus shortening
the discovery cycle in drug design and other scientific activities Nevertheless, the usage of agraph database as well as its query system is not confined to chemical informatics only In
computer vision and pattern recognition [95, 78, 11], graphs are used to represent complex
structures, such as hand-drawn symbols, 3D objects, and medical images Researchers extractgraph models from objects and compare them in order to identify unknown objects and scenes.Developments in bioinformatics also call for eficient mechanisms in querying a large number ofbiological pathways and protein interaction networks These networks are usually very complex
with embedded multi-level structures [61]
The classical graph query problem is formulated as follows: Given a graph database D =
{GI,Ga, , Gn} and a graph query Q, find all the graphs in which Q is a subgraph It is
inefficient to perform a sequential scan on the graph database and check whether @ is a subgraph
of G; Sequential scan is costly because one has to not only access the whole graph database butalso check subgraph isomorphism one by one Furthermore, existing database infrastructuresanswer graph queries in an inefficient manner For example, the indices built on the labels ofvertices or edges are usually not selective enough to distinguish complicated, interconnectedstructures Therefore, new indexing mechanisms have to be devised in order to support graphquery processing in large graph databases
Trang 20Besides the exact search scenario mentioned above, a common problem in substructuresearch is: what if there is no match or very few matches for a given query graph? In thissituation, a subsequent query refinement process has to be undertaken in order to find thestructures of interest Unfortunately, it is often too time-consuming for a user to performmanual refinements One solution is to ask the system to find graphs that nearly contain theentire query graph This similarity search strategy is appealing since a user only needs tovaguely define a query graph and then lets the system handle the rest of the refinement Thequery could be relaxed progressively until a relaxation threshold is reached or a reasonablenumber of matches are found.
Graph pattern mining and search find a wealth of applications in diversified fields Inaddition to the two fundamental problems addressed in this dissertation, we also examined somepractical problems arising in bioinformatics, computer systems, and software engineering Inthe past decade, rapid advances in biological and medical research, such as functional genomicsand proteomics, have accumulated an overwhelming amount of bio-medical data In computersystems, the amount of useful data generated by various systems is ever increasing, such assystem logs, program execution traces, and click-streams These kinds of data actually provide
us with the chance to examine the techniques developed in our study We are going to examinetwo applications from bioinformatics and software engineering
1.2 Contributions
This dissertation is focused on techniques regarding graph pattern, graph search, and severalextrapolated topics, including pattern interpretation and summarization, pattern-based classi-fication, constraint pattern mining, and graph similarity search The analytical algorithms andtools developed for these topics could leverage the usage of graph patterns in a broad spectrum,
as demonstrated in the novel applications discovered in bioinformatics, computer systems andsoftware engineering This dissertation makes key contributions in the following areas
Frequent Graph Mining [120, 121]
Existing frequent graph mining solutions generate inevitable overheads since they rely on theApriori property [3] to join two existing patterns to form larger candidates [59, 66, 110] We
developed non-Apriori methods to avoid these overheads [120} Our effort lead to the discovery
of a new graph canonical labeling system, called DFS coding, which shows both theoreticallyand empirically that this kind of join operation is unnecessary The mining approach, gSpan,built on the DFS coding could reduce the computational cost dramatically, thus making themining efficient in practice In addition to frequent graph mining, we observed that most
Trang 21frequent subgraphs actually deliver nothing but redundant information if they have the same
support This often makes further analysis on frequent graphs nearly impossible Therefore, weproposed to mine closed frequent patterns [121], which are much smaller in number, but conservethe same information as frequent graphs The significance of (closed) frequent graphs is due
to their fundamental role in supporting higher level mining and search algorithms, includingsummarization, classification, indexing and similarity search
Pattern Summarization and Pattern-Based Classification [119, 118, 117]
We examined how to summarize a collection of frequent patterns using only & representatives,which is a long standing problem that prohibits the application of frequent patterns The &representatives should not only cover most of the frequent patterns but also approximate their
supports A generative model and a clustering approach [119, 118, 117] were developed to
extract and profile these representatives, under which the patterns’ support can be easily stored without accessing the original dataset Based on the restoration error, a quality measurefunction was devised to determine the optimal value of parameter k
re-The application of frequent patterns in classification appeared in sporadic studies andachieved initial success in the classification of relational data, text documents and graphs.However, there is a lack of theoretical analysis on their principles in classification We built aconnection between pattern frequency and discriminative measures, such as information gainand Fisher score, thus providing solid reasons supporting this methodology Through our study,
it was also demonstrated that feature selection on frequent patterns is able to generate highquality features for classifiers
Pattern summarization, together with pattern-based classification, confirmed the tance of pattern post-processing that was first revealed in our pattern-based graph indexingmethod With pattern post-processing, a series of pattern-based data mining and databasetools were discovered
impor-Graph Pattern Mining with Constraints [126, 55]
We re-examined the constraint-based mining problem and explored how to push sophisticatedstructural constraints deeply into graph mining A new general framework was developed thatexplores constraints in both pattern space and data space Traditional constraint-based min-ing frameworks only explore the pruning in pattern space which is unfortunately not so usefulfor pushing many structural constraints We proposed new concepts of constraint pushing,including weak pattern antimonotonicity and data antimonotonicity, which could effectivelyprune both the pattern and data spaces The discovery of these antimonotonicities is a sig-
Trang 22nificant extension to the known classification of constraints and deepens our understanding of
characterizing the pruning properties of structural constraints
Graph Indexing [123, 124]
A graph indexing model was designed to support graph search in large graph databases [123,
124] The model is based on discriminative frequent structures that are identified through agraph mining process Since discriminative frequent structures capture the shared character-
istics of data, they are relatively stable to database updates, thus facilitating sampling-based
feature extraction and incremental index maintenance It was shown that the compact indexbuilt under this model can achieve better performance in processing graph queries This modelnot only provides a solution to the graph indexing problem, but also demonstrates how databaseindexing and query processing can benefit from data mining, especially frequent pattern min-ing The concepts developed for graph indexing can be generalized and applied to indexingsequences, trees, and other complicated structures as well
Graph Similarity Search [125, 128, 127]
The issues of substructure similarity search using indexed features in graph databases werestudied By transforming the similarity ratio of a query graph into the maximum allowedfeature misses, a structural filtering algorithm was proposed, which can filter graphs without
performing pairwise similarity computation [125] It was also shown that using either too few
or too many features can result in poor filtering performance We proved that the complexity
of optimal feature set selection is (2) in the worst case, where m is the number of features
for selection [128] In practice, we identified several criteria to build effective feature sets forfiltering, and demonstrated that combining features with similar size and selectivity can improvethe filtering and search performance significantly within a multi-filter framework The proposedfeature-based filtering concept can be generalized and applied to searching approximate non-consecutive sequences, trees, and other structured data as well
In addition to structural similarity search, we also explored the retrieval problem of tures with categorical or geometric distance constraints A method called Partition-based Graph
struc-Index and Search (PIS) [127] was implemented to support similarity search on substructures
with superimposed distance constraints PIS selects discriminative fragments in a query graphand uses an index to prune the graphs that violate the distance constraints A feature selec-tion criterion was set up to distinguish the selectivity of fragments in multiple graphs, and apartition method was invented to obtain a set of highly selective fragments
Trang 23Applications in Bioinformatics, Computer Systems and Software Engineering
[126, 55, 89, 71, 73, 72]
Two applications were well examined in our study: one is gene relevance analysis for functionalannotation; the other is program flow analysis for automated software bug isolation In our
recent collaboration with bioinformaticians at the University of Southern California, we were
able to discover biological modules across arbitrary number of biological networks using scalable
frequent graph mining algorithms [126, 55] Together with computer system researchers at
the University of Illinois at Urbana-Champaign, we have successfully developed multiple datamining techniques to enhance the performance and reliability of computer systems, such asimproving the effectiveness of storage caching and isolating software bugs by mining source
code and runtime data [71, 73, 72] As demonstrated in our software bug isolation solution,
based on graph classification, the analysis of program execution data could disclose importantpatterns and outliers that may help the discovery of bugs These studies actually herald aprosperous future in an interdisciplinary study between data mining and other disciplines
1.3 Organization
[ graph pattern mining |
pattern- — pattern-based bio systemclassification ——
sim search | == search ]
Figure 1.3: A Big Picture
This dissertation work can be viewed from two different perspectives, in terms of graphs andpatterns From the graph perspective, the work has three major pieces: graph mining, graphsearch and their applications From the pattern perspective, the work is driven by exploration
of graph patterns and their post-processing methods for distilling the most useful patternsfor various applications Figure 1.3 illustrates this view globally Specifically, starting with
a general purpose graph mining algorithm, this dissertation investigates the usage of graph
Trang 24patterns in classification, indexing, and similarity searching, and demonstrates the importance
of pattern post-processing, which could boost the performance of pattern-based methods.The rest of this dissertation is organized as follows Chapter 2 discusses graph patternmining Preliminary results on pattern summarization and pattern-based classification arealso included in Chapter 2 Constraint-based graph pattern mining is explored in Chapter 3.Chapter 4 introduces graph indexing and searching, followed by detailed investigation of graphsimilarity searching in Chapter 5 The conclusions of this study are made in Chapter 6
Trang 25Chapter 2
Graph Pattern Mining
Frequent pattern mining is a major research theme in data mining with many efficient and
scalable techniques developed for mining association rules [2, 3], frequent itemsets [10, 48, 18,132], sequential patterns [4, 76, 94], and trees [130, 7] With the increasing complexity of data,
many scientific and commercial applications demand for mining the hidden structural patterns
in large data sets, which go beyond sets, sequences, and trees into graphs This chapter presentsthe main ideas of frequent graph mining algorithms and explores potential issues arising fromapplications of frequent patterns
Definition 1 (Labeled Graph) Given two alphabets Sy and Ug, a labeled graph is a triple
G =(V,E,Ì) where V is a finite set of nodes, E CV x V, and 1 is a function describing the
labels of V and E,1:V - Xy,E — dg
As a general data structure, labeled graphs are used to model complicated relationship indata A labeled graph is a graph with labels assigned to its nodes and edges These assignments
do not have to be unique, i.e., nodes or edges can have the same label However, when theassignments of node labels become unique, such graph is called relational graph This disser-tation has for its subject matter the mining and search of general labeled graphs Unlabeledgraph can be regarded as a special case of labeled graph
Definition 2 (Relational Graph) A relational graph is a labeled graph G = (V,E,1), where
l(0) # l(u) for all Au, 0,u€ V
A graph G is a subgraph of another graph G’ if there exists a subgraph isomorphism from
G to G’, written GCG’
Definition 3 (Subgraph Isomorphism) A subgraph isomorphism is an injective function
f : V(G) > V(G’), such that (1) Vu € V(G), ƒ(u) € V(G’) and l(u) = Ứ(ƒ(u)), and (2)
Trang 26V(u,v) € E(G), (f(u), ƒ(o)) € E(G') and I(u,v) = Ứ(ƒ(u), f(v)), where 1 andl’ are the labelfunction of G and Œ', respectively ƒ is called an embedding of G in G’.
Definition 4 (Frequent Graph Pattern) Given a labeled graph data set, D = {G1,Ga, , Gn},support(g), is the number of graphs in D where g is a subgraph Dg = {G|g C G,G € D} is thesupporting data set of g A graph is frequent if its support 1s no less than a minimum supportthreshold, min_support The frequency of g is written 6(g) = |Dg|/|D| The minimum frequencythreshold is denoted by 69 = min_support/|D| Frequent graph is called graph pattern
1: makepat
2: ese
3: addstr 4: getccl
5: dodash 6: in_set_2
7: stclose
Figure 2.2: Frequent Graph Patterns
Example 1 Figure 2.1 shows segments of program caller/callee graphs derived from three
dif-ferent runs of a program “replace” from Siemens Suite! [58, 98] , a regular expression matching
and substitution utility software Each node represents a function (or a procedure) in “replace”
Taking the run corresponding to the third graph for instance, getccl, addstr, esc, in_set_2and stclose are subtasks of function makepat They work together to complete the task as-sociated with makepat As to transition, the dashed arrow from getccl to addstr means thataddstr is called immediately after getccl returns Figure 2.2 depicts two of frequent subgraphs
in the data set shown in Figure 2.1, assuming that the minimum support is equal to 2 7http: //www.cc.gatech.edu/aristotle/Tools/subjects
Trang 27The discovery of frequent graphs usually consists of two steps In the first step, it generatesfrequent subgraph candidates while the frequency of each candidate is checked in the second
step The second step involves subgraph isomorphism test which is NP-complete [27] Many
well-known pair-wise isomorphism testing algorithms were already developed, e.g., J R
DI-mann’s Backtracking [108] and B D McKay’s Nauty [77] Most studies of frequent subgraph
discovery involve the first step
2.1 Apriori-based Mining
The initial frequent graph mining algorithm, called AGM, was proposed by Inokuchi et al [59],which applies the Apriori property (Lemma 1) to generate larger pattern candidates by mergingtwo (or more) frequent graphs Its mining methodology is similar to the Apriori-based itemset
mining [3] This Apriori property is also used by other frequent graph mining algorithms such
as FSG [66] and the path-join algorithm [110] In these algorithms, frequent graphs are often
searched in a bottom-up manner by generating candidates having an extra vertex, edge, orpath
Lemma 1 (Apriori Property) All nonempty subgraphs of a frequent graph must be frequent
The general framework of Apriori-based methods is outlined in Algorithm 1 Let S; bethe frequent subgraph set of size k The definition of graph size will be given later withcorresponding implementations Algorithm 1 adopts a level-wise mining methodology At eachiteration, the size of newly discovered frequent subgraphs is increased by one These newsubgraphs are first generated by joining two similar frequent subgraphs that are discovered inthe last call of Algorithm 1 The newly formed graphs are then checked for their frequency.The frequent ones are used to generate larger candidates in the next round
The main design complexity of Apriori-based algorithms comes from the candidate ation step Although the candidate generation for frequent itemset mining is straightforward,the same problem in the context of graph mining is much harder, since there are many ways tomerge two graphs
gener-AGM [59] proposed a vertex-based candidate generation method that increases the graph
size by one vertex at each iteration of Algorithm 1 Two size-k frequent graphs are joined only
when they share the same size-(k — 1) subgraph Here the size of a graph means the number
of vertices in a graph The newly formed candidate includes the common size-(k — 1) subgraphand the additional two vertices from the two size-k patterns Figure 2.3 depicts two subgraphsjoined by two chains
FSG proposed by Kuramochi and Karypis [66] adopts an edge-based method that increases
the graph size by one edge in each call of Algorithm 1 In FSG, two size-k graph patterns
Trang 28Algorithm 1 Apriori(D, min_support, Sx)
Input: A graph data set 2 and min_support
Output: A frequent graph set Sx
1: Sk41 — Ø;
2: for each frequent g; € % do
œ for each frequent g; € Š„ do
4 for each size (k + 1) graph g formed by the merge of g; and g; do
bì if g is frequent in D and g ế Sz.) then
SA-SHE
Figure 2.4: FSG
Other Apriori-based methods such as the edge disjoint path method proposed by Vanetik et
al [110] use more complicated candidate generation procedures For example, in [110], graphs
are classified by the number of disjoint paths they have A graph pattern with k + 1 disjointpaths is generated by joining graphs with k disjoint paths
Trang 29Apriori-based algorithms have considerable overheads at joining two size-k frequent graphs
to generate size-(k + 1) graph candidates In order to avoid such overheads, non-Apriori-basedalgorithms such as gSpan[120], MoFaj13], FFSM [57], SPIN[96] and Gaston [84] have beendeveloped recently These algorithms are inspired by PrefixSpan [94], TreeMinerV [130], andFREQT [7| at mining sequences and trees, respectively All of these algorithms adopt thepattern-growth methodology [48], which extends patterns from a single pattern directly
2.2 Pattern Growth-based Mining
Algorithm 2 (PatternGrowth) illustrates a framework of pattern growth-based frequent graphmining algorithms A graph g can be extended by adding a new edge e The newly formedgraph is denoted by go, e Edge e may or may not introduce a new vertex to g For eachdiscovered graph g, PatternGrowth performs extensions recursively until all the frequent graphswith g embedded are discovered The recursion stops once no frequent graph can be generated
any more.
Algorithm 2 PatternGrowth(g, D, min_support, S)
Input: A frequent graph g, a graph data set D, and min_support
Output: A frequent graph set S
if g€ S then return;
else insert g to S;
scan D once, find all the edges e such that g can be extended to g©z € ;
for each frequent go, e do
Call PatternGrowth(g o, e, D, min_support, S);
return;
Algorithm 2 is simple, but not efficient The bottleneck is at the inefficiency of extending
a graph The same graph can be discovered many times For example, there may exist n
different (n — 1)-edge graphs which can be extended to the same n-edge graph The repeated
discovery of the same graph is computationally inefficient A graph that is discovered at thesecond time is called duplicate graph Although Line 1 of Algorithm 2 gets rid of duplicategraphs, the generation and detection of duplicate graphs may cause additional workloads Inorder to reduce the generation of duplicate graphs, each frequent graph should be extended asconservatively as possible in a sense that not every part of each graph should be extended Thisprinciple leads to the design of gSpan gSpan limits the extension of a graph only to the nodes
along its right-most path (See Section 2.2.2), while the completeness of its mining result is still
Trang 30guaranteed The completeness comes with a set of techniques including mapping a graph to
a DFS code (DFS Coding), building a lexicographic ordering among these codes, and miningDFS codes based on this lexicographic order
2.2.1 DFS Subscripting
Depth-first search is adopted by gSpan to traverse graphs Initially, a starting vertex is randomlychosen and the vertices in a graph are marked so that one can tell which vertices are visited.The visited vertex set is expanded repeatedly until a full DFS tree is built One graph mayhave various DFS trees depending on how the depth-first search is performed, i.e., the vertexvisiting order The darkened edges in Figures 2.5(b)-2.5(d) shows three DFS trees for the samegraph shown in Figure 2.5(a) (the vertex labels are x, y, and z; the edge labels are a and b; thealphabetic order is taken as the default order in the labels) When building a DFS tree, thevisiting sequence of vertices forms a linear order This order is illustrated with subscripts i < 7means v; is visited before v; when the depth-first search is performed A graph G subscriptedwith a DFS tree T is written as Gr T is named a DFS subscripting of G
Given a DFS tree 7, the starting vertex in 7, vp is called the root, and the last visitedvertex, Un, the right-most vertex The straight path from vo to 0ạ is called the right-most path
In Figures 2.5(b)-2.5(d), three different subscriptings are generated based on the correspondingDFS trees The right-most path is (vp, v1, v3) in Figures 2.5(b) and 2.5(c), and (vp, v1, Đa, V3)
in Figure 2.5(d)
Figure 2.5: DFS Subscripting
Given a graph G with a DFS tree T, the forward edge (tree edge [29]) set contains all the
edges in the DFS tree, denoted by Ef, and the backward edge (back edge [29]) set contains all
the edges which are not in the DFS tree, denoted by tỳ For example, the darkened edges in
Figures 2.5(b)-2.5(d) are forward edges while the undarkened ones are backward edges Fromnow on, an edge (v;,v;) (also written as (i, 7)) is viewed as an ordered pair If (œ,0;) € E(G)and i < 7, it is a forward edge; otherwise, a backward edge A forward edge of v; means thereexists a forward edge (i, 7) such that ¿ < 7 A backward edge of v; means there exists a backwardedge (i,7) such that i > 7 In Figure 2.5(b), (1,3) is a forward edge of vj, but not a forwardedge of vg (2,0) is a backward edge of v9
Trang 31be added between the right-most vertex and other vertices on the right-most path (backward
extension); or it can introduce a new vertex and connect to vertices on the right-most path
(forward extension) Both kinds of extensions are regarded as right-most extension, denoted by
Go, e (for brevity, 7 is omitted here)
Example 2 For the graph in Figure 2.5(b), the backward extension candidates can be (v3, 00)
The forward extension candidates can be edges extending from 0a, v1, or vg with a new vertexintroduced
Figures 2.6(b)-2.6(g) show all the potential right-most extensions of Figure 2.6(a) (the ened vertices consist the rightmost path) Among them, Figures 2.6(b)- 2.6(d) grow from the
dark-rightmost vertex while Figures 2.6(e)-2.6(g) grow from other vertices on the dark-rightmost path
Figures 2.6(b.0)-2.6(b.4) are children of Figure 2.6(b), and Figures 2.6(f.0)-2.6(f.3) are children
of Figure 2.6(f) In summary, backward extension only takes place on the rightmost vertexwhile forward extension introduces a new edge from vertices on the rightmost path
Since many DFS trees/subscriptings may exist for the same graph, one of them is chosen as
the base subscripting and right-most extension is only conducted on that DFS tree/subscripting
2.2.3 DFS Coding
Each subscripted graph is transformed to an edge sequence so that an order is built amongthese sequences The goal is to select the subscripting which generates the minimum sequence
Trang 32as its base subscripting There are two kinds of orders in this transformation process: (1) edgeorder, which maps edges in a subscripted graph into a sequence; and (2) sequence order, which
builds an order among edge sequences, i.e., graphs
Intuitively, DFS tree defines the discovery order of forward edges For the graph shown in
Figure 2.5(b), the forward edges are visited in the order of (0,1), (1,2), (1,3) Now backward
edges are put into the order as follows Given a vertex v, all of its backward edges shouldappear just before its forward edges If v does not have any forward edge, its backward edges
are put after the forward edge where v is the second vertex For vertex v2 in Figure 2.5(b),its backward edge (2,0) should appear after (1,2) since ve does not have any forward edge
Among the backward edges from the same vertex, an order is enforced Assume that a vertex
vj has two backward edges (i, 71) and (i, j2) If j1 < 7a, then edge (4,31) will appear beforeedge (72) So far, the ordering of the edges in a graph completes Based on this order, agraph can be transformed into an edge sequence A complete sequence for Figure 2.5(b) is(0, 1), (1, 2), (2,0), (1,3)
Formally define a linear order, <7, in N? such that e; <r e holds if and only if one of thefollowing statements is true (assume e = (#1, 71), e2 = (t2, j2)):
(i) e\,ey € Ef, and ƒt < fo or iy > ig A fy = jo.
(ii) e1,€2 € ES, and i) < iy orii = to A jt < jo.
(ii) e, € Eb, ca € BL, and iy < jo.
(iv) e1, € HỆ, eg € Tỳ, and 71 <2.
Example 3 According to the above definition, in Figure 2.5(b), (0,1) <7 (1,2) as case (i),(2,0) <r (1,3) as case (iii), (1,2) <r (2,0) as case (iv) Add a new backward edge between
vertices v3 and vo, then (2,0) <r (3,0) as case (ii) Note that in one graph, it is impossible to
have the second condition of case (i) However, this condition becomes useful when a sequenceorder (DFS lexicographic order, as illustrated in the next subsection) is built on this edge order
In that case, the edges from different graphs have to be compared, where the second condition
of case (i) may take place
Definition 5 (DFS Code) Given a subscripted graph Gr, an edge sequence (e;) can be structed based on relation <r, such that e; <7 e;41, tuhere ¡ =0, ,|E|[—1 The edge sequence
con-(e;) is a DFS code, written as code(G,T)
Example 4 For simplicity, an edge is encoded by a ð-tuple, (#, 7, li,l(j),;), where 1; and 1;are the labels of v; and v,; respectively and [(, „) is the label of the edge connecting them Table2.1 shows three different DFS codes, yo, 71 and y2, generated by DFS subscriptings in Figures2.5(b), 2.5(c) and 2.5(d) respectively
Trang 33edge | Yo al %
eo | (0,1,X,a,X) | (0,1,X,a,X) | (0,1,Y,6, ( X)
€1 (1,2,X,a,Z) (1, 2,.X,6, Y) (1,2, X,a,X)
( ( )
e3 | (1,3,X,b,Y)
Table 2.1: DFS code for Figures 2.5(b), 2.5(c), and 2.5(d)
Through DFS coding, a one-to-one mapping is built between a subscripted graph and a DFS
code (a one-to-many mapping between a graph and DFS codes) When the context is clear, a
subscripted graph and its DFS code are regarded as the same All the notations on subscriptedgraphs can also be applied to DFS codes The graph represented by a DFS code a is written
as Gq
2.2.4 DFS Lexicographic Order
The previous discussion has illustrated how to transform a graph into a DFS code For labeledgraphs, the label information should be considered as one of the ordering factors The labels ofvertices and edges are used to break the tie when two edges have the exactly same subscript, butdifferent labels Let relation <r take the first priority, the vertex label 1; take the second priority,the edge label i(¿„; take the third, and the vertex label 1; take the fourth to determine the order
of two edges For example, the first edge of the three DFS codes in Table 2.1 is (0,1, X,a,X),
(0,1, X,a,X), and (0,1, Y,b, X) respectively All of them share the same subscript (0,1) So
relation <7 cannot tell the difference among them With label information, following theorder of first vertex label, edge label, and second vertex label, (0,1, X,a,X) <y (0,1, Y,0,X).Suppose there is a linear order <z in the label set EL The lexicographic combination of <rand ~, forms the linear order < in the edge space N? x L x L x L (written as U) Based onthis order, two DFS codes œ = (ag, @1, ,@m) and 3 = (bo, bị, , b„) have the relation a < Ø if
đọ = bg, ,@¢-1 = b¿_¡ and a; <y bị (t < min(m,n)) According to this order definition, wehave yo < ¥1 < yo for the DFS codes listed in Table 2.1
Through the above discussion, an order is built in the DFS codes of one graph, which canfurther be applied to DFS codes derived from different graphs
Definition 6 (DFS Lexicographic Order) Let Z be the set of DFS codes of all graphs.Two DFS codes a and 8 have the relation a S 8 (DFS Lexicographic Order in Z) if andonly if one of the following conditions is true Let a = code(G„, T„) = (a0,41,. ;@m) and
Trang 348 = code(Ga, Ta) = (bo, bi, cưng bạ).
(i) 31,0 <‡< min(m,n), ay = by for all k s.t.k < t, and a <y by
(ii) ap = by forallkst.0 << k <S m and m <S n
Definition 7 (Minimum DFS Code) Let Z(G) be the set of all DFS codes for a given graph
G Minimum DFS Code of G, written as dfs(G), is a DFS code in Z(G), such that for each
7 € Z(G), dfs(G) < +.
Code yo in Table 2.1 is the minimum DFS code of the graph in Figure 2.5(a) The
sub-scripting which generates the minimum DFS code is called base subsub-scripting The DFS tree inFigure 2.5(b) shows the base subscripting of the graph in Figure 2.5(a) Let dfs(œ) denote theminimum DFS code of the graph represented by code a
Theorem 1 Given two graphs G and G’, G is isomorphic to GÌ if and only if dfs(G) = dfs(G’)
Proof If G is isomorphic to G’, then there is an isomorphic function f : V(G) + V(G’) Given
a DFS subscripting of G, by assigning the subscript of v for each v € V(G) to f(v), a DFS
subscripting is thus built in G’ The DFS code produced by these two subscriptings of G andG’ must be the same, otherwise, f is not an isomorphic function between G and G’ Therefore,
Z(G) C Z(G’) Similarly, Z(G) D Z(G’) Hence, Z(G) = Z(G’) and dfs(G) = dfs(G’).
Conversely, if dfs(G) = dfs(G’), a function is derived by mapping vertices which have the
same subscript This function is an isomorphic function between G and G’ Hence, G isisomorphic to G’ "
Pruned
Figure 2.7: Lexicographic Search Tree
As to the right-most extension for a given graph, all other subscriptings except its basesubscripting are ignored In the following discussion, the right-most extension of G specificallymeans the right-most extension on the base subscripting of G Figure 2.7 shows how to arrangeall DFS codes in a search tree through right-most extensions The root is en empty code Eachnode is a DFS code encoding a graph Each edge represents a right-most extension from a
Trang 35(k — 1)-length DFS code to a k-length DSF code The tree itself is ordered: left siblings are
smaller than right siblings in the sense of DFS lexicographic order Since any graph has atleast one DFS code, the search tree can enumerate all possible subgraphs in a graph data set.However, one graph may have several DFS codes, minimum and non-minimum The search ofnon-minimum DFS codes does not produce useful result Is it necessary to perform right-most
extension on non-minimum DFS codes The answer is “no” If codes s and s’ in Figure 2.7encode the same graph, the search space under s’ can be safely pruned
Definition 8 (k Right-Most Extension Set) Let C* be a set of DFS codes generated from
a DFS code a through k times of right-most extensions That is, ck = {6 |3bq, ,by, 8 =
A Op b Op Op DR, Ví,0 Ki Sk, bị EU, anda, GE Z} ck is called the k right-most extensionset of a
Let O, be the set of all DFS codes which are less than a DFS code + (y € Z), Oy = {n|n <7,1 € Z}, where ? and + are not necessary from the same graph
Lemma 2 (DFS Code Extension) Let a be the minimum DFS code of a graph G and 8
be a non-minimum DFS code of G For any DFS code 6 generated from 8 by one right-mostextension, i.e., 6 € Ch,
(i) 6 is not a minimum DFS code,
(it) dfs(d) cannot be extended from 8, and
(iti) dfs(5) is either less than œ or can be extended from a, i.e., dfs(d) € Og UC} and
dfs(d) < 8
Proof First, statement (i) can be derived from statement (ii) Assume to the contrary that
6 is a minimum DFS code It is contradicting to statement (ii) Secondly, if statement (iii) istrue, statement (ii) must be true The reason is as follows Since a and Øđ have the same sizeand a < 8, Og C Og and CẢ C Og If dfs(5) € O„ U CẢ, then dfs(6) < 8, which means that
dfs(é) cannot be extended from đ
Now we prove that statement (iii) is true Let the edge set of G be {eo,ei, e„}, the edgesequence of œ be eg, @¡¡, , €i,, Where 0 < im <n DFS code 6 is extended from đ by adding
a new edge, e, i.e., 6 = Go, e Let Gs be the graph represented by 6
Gs is a new graph built from Gg with a new edge e There are two situations: (1) e introduces
a new vertex; or (2) e connects two existing vertices in G Consider the first situation Weconstruct an alternative DFS code of Gs based on œ as follows Let vz be the vertex of e in
G If vz is on the right-most path of a, then (e;,, ,e;,,¢) forms an alternative DFS code
for Gs Otherwise, there must exist a forward edge (vw,, Uz) in œ such that w; < z by DFSsubscripting Since vz is not on the right-most path, there exist forward edges (vw, Uw) in œ 8.t
Trang 36ta < tị and tị < w3 Let e;,, be the smallest edge among them according to the linear order
(<r) By inserting the new edge e right before e;,, in the edge sequence, €j, €i,,.- +s Cims +++ 1 Cins
the new sequence forms a DFS code of Gs This code is less than a Therefore, there is analternative DFS code existing for Gs, which should be in one of the following two formats:(1) (€i9,.++)€in€), Which belongs C} ; (2) (®¡s, ,„_¡,€;6„, ) and the code formed bythis sequence is less than a Similarly, the same conclusion holds for the second situation Insummary, an alternative DFS code 6’ of G5 exists such that 6’ € Og UCL Since dfs(d) < &, dfs(d) € Og UCh and dƒs(ô) < 8 "
Theorem 2 (Completeness) Performing only the right-most extensions on the minimum
DFS codes guarantees the completeness of mining results
Theorem 2 is equivalent to the following statement: Any DFS code extended from minimum DFS codes is not minimum Thus, it is not necessary to extend non minimum DFS
non-codes at all Formally, given two DFS non-codes, a and đ, if a = dfs(G) and a # @, then for any
DFS code, 6, extended from đ, i.e, 6 € Ul, C5, dfs(5) < đ.
Proof Assume that the following proposition is true,
Vp € J2; CG, if dfs(p) < 6, then Vạ € Ch, dfs(g) < @ (Proposition 1).
By Lemma 2, Vp € Ck, hence dfs(p) < 6 since đ # df&(đ) (initial step) Using the above proposition, by induction Vp € UL, C5, dfs(p) < 8 That means any k right-most extensionfrom a non-minimum DFS code must not be a minimum DFS code Furthermore, its minimumDFS code is less than đ
For any p € Ur, CS, if dfs(p) < 0, then for any g € C, by Lemma 2, dfs(g) € Øqfs¿ø) U
Cars Q)' Since dfs(p) < @ and the length dfs(p) is greater than that of đ, according to DFS
lexicographic ordering, Võ € Cops )’ 6 < Ø Therefore, dfs(g) < đ "
Lemma 3 (Anti-monotonicity of Frequent Patterns) If a graph G is frequent, then any
subgraph of G is frequent If G is not frequent, then any supergraph of G is not frequent It tsequal to say, V Ø € Ul, CK, if a DFS code a is infrequent, 9 is infrequent too.
These lemmas and theorems set the foundation of our mining algorithm By pre-ordersearching of the lexicographic search tree shown in Figure 2.7, one can guarantee that all thepotential frequent graphs are enumerated The pruning of non-minimum DFS codes in the treeensures that the search is complete while the anti-monotonicity property can be used to prune
a large portion of the search space
Trang 372.2.5 gSpan
For space efficiency, a sparse adjacency list is used to store graphs in gSpan Algorithm 3
(Main-Loop) outlines the main loop which iterates gSpan until all frequent subgraphs are discovered
Let D; be the set of graphs where s is a subgraph (i.e., a minimum DFS code) Usually, we
only maintain graph identifiers in D, and use them to index graphs in D
Algorithm 3 MainLoop(D, min-support, S)
Input: A graph data set D, and min_support
Output: A frequent graph set S
1: remove infrequent vertices and edges in D;
2: 9! — all frequent 1-edge graphs in D;
3: sort 9! in the increasing DFS lexicographic order;
4: $< 61;
5: for each edge e € S$! do
6 initialize s with e, set D, = {g|g € Dande € E(g)}; (only graph id is recorded)
€1, but not any eg This procedure repeats until all the frequent graphs are discovered
The details of gSpan are depicted in Algorithm 4 gSpan is called recursively to extend agraph pattern until the support of the newly formed graph is lower than min_support or itscode is not minimum any more The difference between gSpan and PatternGrowth is at therightmost extension and extension termination of non-minimum DFS codes (Algorithm 4 Lines1-2) We replace the existence condition in Algorithm 2 Lines 1-2 with the inequality s # dfs(s)
Actually, s # dfs(s) is more efficient to calculate Line 5 requires exhaustive enumeration of s
in D in order to count the frequency of all the possible rightmost extensions of s Algorithm 4implements a depth-first search version of gSpan Actually, gSpan can easily adopt breadth-firstsearch too
Trang 38Algorithm 4 gSpan(s, D, min support, S)
Input: A DFS code s, a graph data set D, and min_support
Output: A frequent graph set S
1: if s ¢ dfs(s), then
2: return;
3: insert s into S;
4: set C to Ø;
5: scan D once, find all the edges e such that s can be rightmost extended to s 0, e;
insert so, e into C and count its frequency;
6: sort C in DFS lexicographic order;
7: for each frequent so, e in C do
8: Call gSpan(so, e, D, min.support, S);
9: return;
2.3 Closed Graph Pattern
According to the Apriori property, all the subgraphs of a frequent graph must be frequent
A large graph pattern may generate an exponential number of frequent subgraphs For ample, among 423 confirmed active chemical compounds in an AIDS antiviral screen data set
ex-(http://dtp.nci.nih.gov/docs/aids/aids_data.html), there are nearly 1,000,000 frequent graph
patterns whose support is at least 5% This renders the further analysis on frequent graphsnearly impossible
The same issue also exists in frequent itemset mining and sequence mining To reducethe huge set of frequent patterns generated in data mining while maintain the high quality
of patterns, recent studies have been focusing on mining a compressed or approximate set offrequent patterns In general, pattern compression can be divided into two categories: losslesscompression and lossy compression, in terms of the information that the result set contains,
compared with the whole set of frequent patterns Mining closed patterns [90, 93, 18, 132, 122],
described as follows, is a lossless compression of frequent patterns Mining all non-derivablefrequent sets proposed by Calders and Goethals [21] belongs to this category as well sincethe set of result patterns and their support information generated from these methods can
be used to derive the whole set of frequent patterns Lossy compression is adopted in mostother compressed patterns, such as maximal patterns [10, 69, 18, 44], top-k most frequentclosed patterns [111], condensed pattern bases [91], k-cover patterns [1] or pattern profiles [119](Section 2.5), and clustering-based compression [118] (Section 2.5)
Trang 39A frequent pattern is closed if and only if there does not exist a super-pattern that has thesame support A frequent pattern is mazimal if and only if it does not have a frequent super-pattern For the AIDS antiviral data set mentioned above, among the one million frequentgraphs, only about 2,000 are closed frequent graphs.
Since the maximal pattern set is a subset of the closed pattern set, usually it is more compactthan the closed pattern set However, it cannot reconstruct the whole set of frequent patternsand their supports, while the closed frequent pattern set can This study is focused on closedgraph pattern mining The proposed pruning techniques can also be applied to maximal patternmining,
2.3.1 Equivalent Occurrence
Given two graphs g and G, where g is a subgraph of G, the number of embeddings (the number
of subgraph isomorphisms) of g in G is written as y(g, G)
Definition 9 (Occurrence) Given a graph g and a graph data set D = {Gi, Go, ,Gn},the occurrence of g in D is the number of embeddings of g in D, i.e., 3 );—¡ @(g, Gi), written as
T(g, D).
o=c > C+C+N—C
Sl
: o=c nền= —> C+C—N+C
| f! | -.eee.ee ‘
N S
Figure 2.8: Extended Subgraph Isomorphism
Let g’ be the graph formed by a graph g with a new edge and p be a subgraph isomorphic
function between g and g’ Assume both graphs g and g’ are subgraphs of a graph G It is
possible to transform an embedding of g in G to an embedding of g’ in G Let f be a subgraph
isomorphism of g in G, and f’ be a subgraph isomorphism of g’ in G If f(v) = f’(p(v)) for
each v in V(g), f is called an extendable subgraph isomorphism and f’ an eztended subgraph
isomorphism of f Intuitively, if a subgraph isomorphism f was already built between g and
G, f can extend to a subgraph isomorphism between g’ and G The number of extendable
isomorphisms is written as ¢(g,9',G) Figure 2.8 illustrates a picture of this transformation
Trang 40procedure Obviously, not every embedding of g in G can be transformed to an embedding of
The extended occurrence is the number of embeddings of g that can be transformed to the
embeddings of g’ in a graph data set
Definition 11 (Equivalent Occurrence) Let g’ be the graph formed by a graph g with a newedge Given a graph database D, g and g’ have equivalent occurrence if 1(g, D) = L(g, 9’, D)
Let e be the new edge added in g such that g’ = go, e Given a graph G, if g C G and
g CG, the inequality y(g,G) > o(g,9',G) always holds Hence, 7(g, D) > L(g,g’,D) WhenT(g,D) is equal to L(g, g',D), it means wherever g occurs in Œ, g’ also occurs in the same
place Let h be a supergraph of g If h does not have the edge e, then h will not be closed since
support(h) = support(ho,z e) (hoz e is an abbreviation of horse and ho„pe Unless specificallynoted, h oz e is constructed by adding e into an embedding of g in h such that g’ becomes asubgraph of ho, e) Therefore only g’ needs to be extended instead of g This search strategy
is named Karly Termination
2.3.2 Failure of Early Termination
Unfortunately, Early Termination does not work in one case The following example shows thesituation where Early Termination fails
dataset 3) missed pattern
zd <x
Figure 2.9: Failure of Early Termination
Failure Case Suppose there is a data set with two graphs shown in Figures 2.9(1) and
2.9(2) and the mining task is to find closed frequent graphs whose minimum support is 2 Let
g be z-# and g’ be z-#-*z For brevity, ry represents a graph with one edge, which
1 tị
has “a” as its edge label and “x” and “y” as its vertex labels As one can see, edge y—-z is