142 MANAGING AND MINING GRAPH DATA for subgraph isomorphism. Procedure Search(𝑖) iterates on the 𝑖 𝑡ℎ node to find feasible mappings for that node. Procedure Check(𝑢 𝑖 , 𝑣) examines if 𝑢 𝑖 can be mapped to 𝑣 by considering their edges. Line 12 maps 𝑢 𝑖 to 𝑣. Lines 13–16 continue to search for the next node or if it is the last node, evaluate the graph-wide predicate. If it is true, then a feasible mapping 𝜙 : 𝑉 (𝒫) → 𝑉 (𝐺) has been found and is reported (line 15). Line 16 stops searching immediately if only one mapping is required. The graph pattern and the graph are represented as a vertex set and an edge set, respectively. In addition, adjacency lists of the graph pattern are used to support line 21. For line 22, edges of graph 𝐺 can be represented in a hashtable where keys are pairs of the end points. To avoid repeated evaluation of edge predicates (line 22), another hashtable can be used to store evaluated pairs of edges. The worst-case time complexity of Algorithm 4.1 is 𝑂(𝑛 𝑘 ) where 𝑛 and 𝑘 are the sizes of graph 𝐺 and graph pattern 𝒫, respectively. This complexity is a consequence of subgraph isomorphism that is known to be NP-hard. In practice, the running time depends on the size of the search space. We now consider possible ways to accelerate Algorithm 4.1: 1 How to reduce the size of Φ(𝑢 𝑖 ) for each node 𝑢 𝑖 ? How to efficiently retrieve Φ(𝑢 𝑖 )? 2 How to reduce the overall search space Φ(𝑢 1 ) × ×Φ(𝑢 𝑘 )? 3 How to optimize the search order? We present three techniques that respectively address the above questions. The first technique prunes each Φ(𝑢 𝑖 ) individually and retrieves it efficiently through indexing. The second technique prunes the overall search space by considering all nodes in the pattern simultaneously. The third technique applies ideas from traditional query optimization to find the right search order. 4.2 Local Pruning and Retrieval of Feasible Mates Node attributes can be indexed directly using traditional index structures such as B-trees. This allows for fast retrieval of feasible mates and avoids a full scan of all nodes. To reduce the size of feasible mates Φ(𝑢 𝑖 )’s even further, we can go beyond nodes and consider neighborhood subgraphs of the nodes. The neighborhood information can be exploited to prune infeasible mates at an early stage. Definition 4.10. (Neighborhood Subgraph) Given graph 𝐺, node 𝑣 and radius 𝑟, the neighborhood subgraph of node 𝑣 consists of all nodes within distance 𝑟 (number of hops) from 𝑣 and all edges between the nodes. Query Language and Access Methods for Graph Databases 143 Node 𝑣 is a feasible mate of node 𝑢 𝑖 only if the neighborhood subgraph of 𝑢 𝑖 is sub-isomorphic to that of 𝑣 (with 𝑢 𝑖 mapped to 𝑣). Note that if the radius is 0, then the neighborhood subgraphs degenerate to nodes. Although neighborhood subgraphs have high pruning power, they incur a large computation overhead. This overhead can be reduced by representing neighborhood subgraphs by their light-weight profiles. For instance, one can define the profile as a sequence of the node labels in lexicographic order. The pruning condition then becomes whether a profile is a subsequence of the other. P A B A 1 B 1 C 1 B 2 G C C 2 A 2 Figure 4.16. A sample graph pattern and graph A 1 B 1 C 2 A 1 B 1 C 1 C 2 B 1 C 1 A 1 B 1 B 2 C 2 A 1 Nodes of G Neighborhood sub- graphs of radius 1 Profiles B 1 B 2 C 1 C 2 ABC ABCC ABC BC ABBC B 2 A 2 A 2 AB B 2 C 2 A 2 Search space Retrieve by nodes: {A 1 , A 2 } X {B 1 , B 2 } X {C 1 , C 2 } Retrieve by neighborhood subgraphs: {A 1 } X {B 1 } X {C 2 } Retrieve by profiles of neighborhood subgraphs: {A 1 } X {B 1 , B 2 } X {C 2 } Figure 4.17. Feasible mates using neighborhood subgraphs and profiles. The resulting search spaces are also shown for different pruning techniques. Figure 4.16 shows the sample graph pattern 𝒫 and the database graph 𝐺 again for convenience. Figure 4.17 shows the neighborhood subgraphs of ra- 144 MANAGING AND MINING GRAPH DATA dius 1 and their profiles for nodes of 𝐺. If the feasible mates are retrieved using node attributes, then the search space is {𝐴 1 , 𝐴 2 } × {𝐵 1 , 𝐵 2 } × {𝐶 1 , 𝐶 2 }. If the feasible mates are retrieved using neighborhood subgraphs, then the search space is {𝐴 1 }×{𝐵 1 }×{𝐶 2 }. Finally, if the feasible mates are retrieved using profiles, then the search space is {𝐴 1 } × {𝐵 1 , 𝐵 2 } × {𝐶 2 }. These are shown in the right side of Figure 4.17. If the node attributes are selective, e.g., many unique attribute values, then one can index the node attributes using a B-tree or hashtable, and store the neighborhood subgraphs or profiles as well. Retrieval is done by indexed ac- cess to the node attributes, followed by pruning using neighborhood subgraphs or profiles. Otherwise, if the node attributes are not selective, one may have to index the neighborhood subgraphs or profiles. Recent graph indexing tech- niques [9, 17, 23, 34, 36, 39–42] or multi-dimensional indexing methods such as R-trees can be used for this purpose. 4.3 Joint Reduction of Search Space We reduce the overall search space iteratively by an approximation algo- rithm called Pseudo Subgraph Isomorphism [17]. This prunes the search space by considering the whole pattern and the space Φ(𝑢 1 ) × ×Φ(𝑢 𝑘 ) simultane- ously. Essentially, this technique checks for each node 𝑢 in pattern 𝒫 and its feasible mate 𝑣 in graph 𝐺 whether the adjacent subtree of 𝑢 is sub-isomorphic to that of 𝑣. The check can be defined recursively on the depth of the adjacent subtrees: the level 𝑙 subtree of 𝑢 is sub-isomorphic to that of 𝑣 only if the level 𝑙 − 1 subtrees of 𝑢’s neighbors can all be matched to those of 𝑣’s neighbors. To avoid subtree isomorphism tests, a bipartite graph ℬ 𝑢,𝑣 is defined between neighbors of 𝑢 and 𝑣. If the bipartite graph has a semi-perfect matching, i.e., all neighbors of 𝑢 are matched, then 𝑢 is level 𝑙 sub-isomorphic to 𝑣. In the bipartite graph, an edge is present between two nodes 𝑢 ′ and 𝑣 ′ only if the level 𝑙 − 1 subtree of 𝑢 ′ is sub-isomorphic to that of 𝑣 ′ , or equivalently the bipar- tite graph ℬ 𝑢 ′ ,𝑣 ′ at level 𝑙 − 1 has a semi-perfect matching. A more detailed description can be found in [17]. Algorithm 4.2 outlines the refinement procedure. At each iteration (lines 3–20), a bipartite graph ℬ 𝑢,𝑣 is constructed for each 𝑢 and its feasible mate 𝑣 (lines 5–9). If ℬ 𝑢,𝑣 has no semi-perfect matching, then 𝑣 is removed from Φ(𝑢), thus reducing the search space (line 13). The algorithm has two implementation improvements on the refinement pro- cedure discussed in [17]. First, it avoids unnecessary bipartite matchings. A pair ⟨𝑢, 𝑣⟩ is marked if it needs to be checked for semi-perfect matching (lines 2, 4). If the semi-perfect matching exists, then the pair is unmarked (lines 10–11). Otherwise, the removal of 𝑣 from Φ(𝑢) (line 13) may affect the exis- tence of semi-perfect matchings of the neighboring ⟨𝑢 ′ , 𝑣 ′ ⟩ pairs. As a result, Query Language and Access Methods for Graph Databases 145 Algorithm 4.2: Refine Search Space Input: Graph Pattern 𝒫, Graph 𝐺, Search space Φ(𝑢 1 ) × × Φ(𝑢 𝑘 ), level 𝑙 Output: Reduced search space Φ ′ (𝑢 1 ) × × Φ ′ (𝑢 𝑘 ) begin 1 foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢) do Mark ⟨𝑢, 𝑣⟩;2 for 𝑖 ← 1 to 𝑙 do3 foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢), ⟨𝑢, 𝑣⟩ is marked do4 //Construct bipartite graph ℬ 𝑢,𝑣 5 𝑁 𝒫 (𝑢), 𝑁 𝐺 (𝑣): neighbors of 𝑢, 𝑣;6 foreach 𝑢 ′ ∈ 𝑁 𝒫 (𝑢), 𝑣 ′ ∈ 𝑁 𝐺 (𝑣) do7 ℬ 𝑢,𝑣 (𝑢 ′ , 𝑣 ′ ) ← { 1 if 𝑣 ′ ∈ Φ(𝑢 ′ ); 0 otherwise. 8 end9 if ℬ 𝑢,𝑣 has a semi-perfect matching then10 Unmark ⟨𝑢, 𝑣⟩;11 else12 Remove 𝑣 from Φ(𝑢);13 foreach 𝑢 ′ ∈ 𝑁 𝒫 (𝑢), 𝑣 ′ ∈ 𝑁 𝐺 (𝑣), 𝑣 ′ ∈ Φ(𝑢 ′ ) do14 Mark ⟨𝑢 ′ , 𝑣 ′ ⟩;15 end16 end17 end18 if there is no marked ⟨𝑢, 𝑣⟩ then break;19 end20 end21 these pairs are marked and checked again (line 14). Second, the ⟨𝑢, 𝑣⟩ pairs are stored and manipulated using a hashtable instead of a matrix. This reduces the space and time complexity from 𝑂(𝑘 ⋅𝑛) to 𝑂( ∑ 𝑘 𝑖=1 ∣Φ(𝑢 𝑖 )∣). The overall time complexity is 𝑂(𝑙 ⋅ ∑ 𝑘 𝑖=1 ∣Φ(𝑢 𝑖 )∣ ⋅ (𝑑 1 𝑑 2 + 𝑀(𝑑 1 , 𝑑 2 ))) where 𝑙 is the refinement level, 𝑑 1 and 𝑑 2 are maximum degrees of 𝒫 and 𝐺 respectively, and 𝑀() is the time complexity of maximum bipartite matching (𝑂(𝑛 2.5 ) for Hopcroft and Karp’s algorithm [19]). Figure 4.18 shows an execution of Algorithm 4.2 on the example in Fig- ure 4.16. At level 1, 𝐴 2 and 𝐶 1 are removed from Φ(𝐴) and Φ(𝐶), respec- tively. At level 2, 𝐵 2 is removed from Φ(𝐵) since the bipartite graph ℬ 𝐵,𝐵 2 has no semi-perfect matching (note that 𝐴 2 was already removed from Φ(𝐴)). Whereas the neighborhood subgraphs discussed in Section 4.2 prune in- feasible mates by using local information, the refinement procedure in Algo- 146 MANAGING AND MINING GRAPH DATA B C B 1 A 1 C 2 B 2 Level-0 Level-2 A B C B A C C A B B 1 A 1 C 1 A 1 B 1 C 2 C 2 A 1 B 1 B 2 C 2 Input search space: {A 1 , A 2 } X {B 1 , B 2 } X {C 1 , C 2 } Output search space: {A 1 } X {B 1 } X {C 2 } B 2 A 2 A 2 C 2 A C has no semi- perfect matching C 2 A 2 A C 1 Level-1 A B C B A C C A B B 1 A 1 C 1 A 1 B 1 C 2 C 2 A 1 B 1 B 2 C 2 B 2 A 2 C 2 A 2 B 2 C 1 B 1 Figure 4.18. Refinement of the search space rithm 4.2 prunes the search space globally. The global pruning has a larger overhead and is dependent on the output of the local pruning. Therefore, both pruning methods are indispensable and should be used together. 4.4 Optimization of Search Order Next, we consider the search order of Algorithm 4.1. The goal here is to find a good search order for the nodes. Since the search procedure is equivalent to multiple joins, it is similar to a typical query optimization problem [7]. Two principal issues need to be considered. One is the cost model for a given search order. The other is the algorithm for finding a good search order. The cost model is used as the objective function of the search algorithm. Since the search algorithm is relatively standard (e.g., dynamic programming, greedy algorithm), we focus on the cost model and illustrate that it can be customized in the domain of graphs. Cost Model. A search order (a.k.a. a query plan) can be represented as a rooted binary tree whose leaves are nodes of the graph pattern and each internal node is a join operation. Figure 4.19 shows two examples of search orders. We estimate the cost of a join (a node in the query plan tree) as the product of cardinalities of the collections to be joined. The cardinality of a leaf node is the number of feasible mates. The cardinality of an internal node can be estimated as the product of cardinalities of collections reduced by a factor 𝛾. Query Language and Access Methods for Graph Databases 147 A B C (a) (b) A C B Figure 4.19. Two examples of search orders Definition 4.11. (Result size of a join) The result size of join 𝑖 is estimated by 𝑆𝑖𝑧𝑒(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓𝑡) × 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡) × 𝛾(𝑖) where 𝑖.𝑙𝑒𝑓 𝑡 and 𝑖.𝑟𝑖𝑔ℎ𝑡 are the left and right child nodes of 𝑖 respectively, and 𝛾(𝑖) is the reduction factor. A simple way to estimate the reduction factor 𝛾(𝑖) is to approximate it by a constant. A more elaborate way is to consider the probabilities of edges in the join: Let ℰ(𝑖) be the set of edges involved in join 𝑖, then 𝛾(𝑖) = ∏ 𝑒(𝑢,𝑣 )∈ℰ(𝑖) 𝑃 (𝑒(𝑢, 𝑣)) where 𝑃 (𝑒(𝑢, 𝑣)) is the probability of edge 𝑒(𝑢, 𝑣) conditioned on 𝑢 and 𝑣. This probability can be estimated as 𝑃 (𝑒(𝑢, 𝑣)) = 𝑓𝑟𝑒𝑞(𝑒(𝑢, 𝑣)) 𝑓𝑟𝑒𝑞(𝑢) ⋅𝑓𝑟𝑒𝑞(𝑣) where 𝑓𝑟𝑒𝑞() denotes the frequency of the edge or node in the large graph. Definition 4.12. (Cost of a join) The cost of join 𝑖 is estimated by 𝐶𝑜𝑠𝑡(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓𝑡) × 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡) Definition 4.13. (Cost of a search order) The total cost of a search order Γ is estimated by 𝐶𝑜𝑠𝑡(Γ) = ∑ 𝑖∈Γ 𝐶𝑜𝑠𝑡(𝑖) For example, let the input search space be {𝐴 1 } × {𝐵 1 , 𝐵 2 } × {𝐶 2 }. If we use a constant reduction factor 𝛾, then 𝐶𝑜𝑠𝑡(𝐴 ⊳⊲ 𝐵) = 1 × 2 = 2, 𝑆𝑖𝑧𝑒(𝐴 ⊳⊲ 𝐵) = 2𝛾, 𝐶𝑜𝑠𝑡((𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶) = 2𝛾 × 1 = 2𝛾. The total cost is 2 + 2𝛾. Similarly, the total cost of (𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is 1 + 2𝛾. Thus, the search order (𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is better than (𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶. 148 MANAGING AND MINING GRAPH DATA Search Order. The number of all possible search orders is exponential in the number of nodes. It is expensive to enumerate all of them. As in many query optimization techniques, we consider only left-deep query plans, i.e., the outer node of each join is always a leaf node. The traditional dynamic programming would take an 𝑂(2 𝑘 ) time complexity for a graph pattern of size 𝑘. This is not scalable to large graph patterns. Therefore, we adopt a simple greedy approach in our implementation: at join 𝑖, choose a leaf node that minimizes the estimated cost of the join. 5. Experimental Study In this section, we evaluate the performance of the presented graph pattern matching algorithms on large real and synthetic graphs. The graph specific optimizations are compared with an SQL-based implementation as described in Figure 4.2. MySQL server 5.0.45 is used and configured as: storage en- gine=MyISAM (non-transactional), key buffer size = 256M. Other parameters are set as default. For each large graph, two tables V(vid, label) and E(vid1, vid2) are created as in Figure 4.2. B-tree indices are built for each field of the tables. The presented graph pattern matching algorithms were written in Java and compiled with Sun JDK 1.6. All the experiments were run on an AMD Athlon 64 X2 4200+ 2.2GHz machine with 2GB memory running MS Win XP Pro. 5.1 Biological Network the real dataset is a yeast protein interaction network [2]. This graph consists of 3112 nodes and 12519 edges. Each node represents a unique protein and each edge represents an interaction between proteins. To allow for meaningful queries, we add Gene Ontology (GO) [14] terms to the proteins. The Gene Ontology is a hierarchy of categories that describes cellular components, biological processes, and molecular functions of genes and their products (proteins). Each GO term is a node in the hierarchy and has one or more parent GO Terms. Each protein has one or more GO terms. We use high level GO terms as labels of the proteins (183 distinct labels in total). We index the node labels using a hashtable, and store the neighborhood subgraphs and profiles with radius 1 as well. Clique Queries. The clique queries are generated with sizes (number of nodes) between 2 and 7 (sizes greater than 7 have no answers). For each size, a complete graph is generated with each node assigned a random label. The random label is selected from the top 40 most frequent labels. A total of 1000 clique queries are generated and the results are averaged. The queries are divided into two groups according to the number of answers returned: low Query Language and Access Methods for Graph Databases 149 hits (less than 100 answers) and high hits (more than 100 answers). Queries having no answers are not counted in the statistics. Queries having too many hits (more than 1000) are terminated immediately and counted in the group of high hits. To evaluate the pruning power of the local pruning (Section 4.2) and the global pruning (Section 4.3), we define the reduction ratio of search space as 𝛾(Φ, Φ 0 ) = ∣Φ(𝑢 1 )∣ × × ∣Φ(𝑢 𝑘 )∣ ∣Φ 0 (𝑢 0 )∣ × × ∣Φ 0 (𝑢 𝑘 )∣ where Φ 0 refers to the baseline search space. 2 3 4 5 6 7 10 −20 10 −15 10 −10 10 −5 10 0 Clique size Reduction ratio Retrieve by profiles Retrieve by subgraphs Refined search space (a) Low hits 2 3 4 5 6 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 Clique size Reduction ratio Retrieve by profiles Retrieve by subgraphs Refined search space (b) High hits Figure 4.20. Search space for clique queries 2 3 4 5 6 7 0 50 100 150 200 250 300 350 Clique size Time (msec) Retrieve by profiles Retrieve by subgraphs Refine search space Search w/ opt. order Search w/o opt. order (a) Individual steps 2 3 4 5 6 7 10 0 10 1 10 2 10 3 10 4 10 5 Clique size Time (msec) Optimized Baseline SQL−based (b) Total query processing Figure 4.21. Running time for clique queries (low hits) Figure 4.20 shows the reduction ratios of search space by different methods. “Retrieve by profiles” finds feasible mates by checking profiles and “Retrieve by subgraphs” finds feasible mates by checking neighborhood subgraphs (Sec- 150 MANAGING AND MINING GRAPH DATA tion 4.2). “Refined search space” refers to the global pruning discussed in Sec- tion 4.3 where the input search space is generated by “Retrieve by profiles”. The maximum refinement level ℓ is set as the size of the query. As can be seen from the figure, the refinement procedure always reduces the search space re- trieved by profiles. Retrieval by subgraphs results in the smallest search space. This is due to the fact that neighborhood subgraphs for a clique query is actu- ally the entire clique. Figure 4.21(a) shows the average processing time for individual steps under varying clique sizes. The individual steps include retrieval by profiles, retrieval by subgraphs, refinement, search with the optimized order (Section 4.4), and search without the optimized order. The time for finding the optimized order is negligible since we take a greedy approach in our implementation. As shown in the figure, retrieval by subgraphs has a large overhead although it produces a smaller search space than retrieval by profiles. Another observation is that the optimized order improves upon the search time. Figure 4.21(b) shows the average total query processing time in comparison to the SQL-based approach on low hits queries. The “Optimized” processing consists of retrieval by profiles, refinement, optimization of search order, and search with the optimized order. The “Baseline” processing consists of re- trieval by node attributes and search without the optimized order on the base- line space. The query processing time in the “Optimized" case is improved greatly due to the reduced search space. The SQL-based approach takes much longer time and does not scale to large clique queries. This is due to the unpruned search space and the large number of joins involved. Whereas our graph pattern matching algorithm (Section 4.1) is exponential in the number of nodes, the SQL-based approach is exponential in the number of edges. For instance, a clique of size 5 has 10 edges. This requires 20 joins between nodes and edges (as illustrated in Figure 4.2). 5.2 Synthetic Graphs The synthetic graphs are generated using a simple Erd ˝ os-R « enyi [13] ran- dom graph model: generate 𝑛 nodes, and then generate 𝑚 edges by randomly choosing two end nodes. Each node is assigned a label (100 distinct labels in total). The distribution of the labels follows Zipf’s law, i.e., probability of the 𝑥 𝑡ℎ label 𝑝(𝑥) is proportional to 𝑥 −1 . The queries are generated by randomly extracting a connected subgraph from the synthetic graph. We first fix the size of synthetic graphs 𝑛 as 10𝐾, 𝑚 = 5𝑛, and vary the query size between 4 and 20. Figure 4.22 shows the search space and pro- cessing time for individual steps. Unlike clique queries, the global pruning produces the smallest search space, which outperforms the local pruning by full neighborhood subgraphs. Query Language and Access Methods for Graph Databases 151 4 8 12 16 20 10 −40 10 −30 10 −20 10 −10 10 0 Query size Reduction ratio Retrieve by profiles Retrieve by subgraphs Refined search space (a) Search space 4 8 12 16 20 0 20 40 60 80 100 Query size Time (msec) Retrieve by profiles Retrieve by subgraphs Refine search space Search w/ opt. order Search w/o opt. order (b) Time for individual steps Figure 4.22. Search space and running time for individual steps (synthetic graphs, low hits) 4 8 12 16 20 10 0 10 1 10 2 10 3 Query size Time (msec) Optimized Baseline SQL−based (a) Varying query sizes (graph size: 10K) 10 20 40 80 160 320 10 0 10 1 10 2 10 3 10 4 Graph size (x1000) Time (msec) Optimized Baseline SQL−based (b) Varying graph sizes (query size: 4) Figure 4.23. Running time (synthetic graphs, low hits) Figure 4.23 shows the total time with varying query sizes and graph sizes. As can be seen, The SQL-based approach is not scalable to large queries, though it scales to large graphs with small queries. In either case, the “Op- timized” processing produces the smallest running time. To summarize the experimental results, retrieval by profiles has much less overhead than that of retrieval by subgraphs. The refinement step (Section 4.3) greatly reduces the search space. The overhead of the search step is well com- pensated by the extensive reduction of search space. A practical combination would be retrieval by profiles, followed by refinement, and then search with an optimized order. This combination scales well with various query sizes and graph sizes. SQL-based processing is not scalable to large queries. Overall, the optimized processing performs orders of magnitude better than the SQL-based approach. While small improvements in SQL-based implementations can be