3.4. DISTINCT ROOT-BASED KEYWORD SEARCH 69 3.4 DISTINCT ROOT-BASED KEYWORD SEARCH In this section, we show approaches to find Q-subtrees using the distinct root semantics,where the weight of a tree is defined as the sum of the shortest distance from the root to each keyword node. As shown in the previous section, the problem of keyword search under the directed steiner tree is, in general, a hard problem. Using the distinct root semantics, there can be at most n Q-subtrees for a keyword query, and in the worst case, all the Q-subtrees can be found in time O(l(nlogn + m)). The approaches introduced in this section deal with very large graphs in general, and they propose search strategies or indexing schemes to reduce the search time for an online keyword query. 3.4.1 BIDIRECTIONAL SEARCH BackwardSearch algorithm, as proposed in the previous section, can be directly applied to the distinct root semantics, by modifying Line 3 to iterate over the l keyword nodes, i.e., {k 1 , ··· ,k l }. It would explore an unnecessarily large number of nodes in the following scenarios: • The query contains a frequently occurring keyword. In BackwardSearch, one iterator is associated with each keyword node.The algorithm would generate a large number of iterators if a keyword matches a large number of nodes. • An iterator reaches a node with large fan-in (incoming edges). An iterator may need to explore a large number of nodes if it hits a node with a very large fan-in. Bidirectional search [Kacholia et al., 2005] can be used to overcome the drawbacks of Backward- Search .The main idea of bidirectional search is to start forward searches from potential roots.The main difference of bidirectional search from BackwardSearch are as follows: • All the single source shortest path iteratorsfrom the BackwardSearch algorithm are merged into a single iterator, called the incoming iterator. •Anoutgoing iterator runs concurrently, which follows forwarding edges starting from all the nodes explored by the incoming iterator. •Aspreading activation is proposed to prioritize the search, which chooses incoming iterator or outgoing iterator to be called next. It also chooses the next node to be visited in the incoming iterator or outgoing iterator. A high-level pseudocode for the bidirectional search algorithm is shown in Algorithm 25 ( BidirectionalSearch).Q in (Q out ) is a priority queue of nodes in backward (forward) expanding fringe. X in (X out ) is a set of nodes expanded for incoming (outgoing) paths. Fringe nodes are the set of nodes that have already been hit by an iterator with the neighbors being yet to be explored. The set of fringe nodes of an iterator are called iterator frontier. For every node u explored so far, either in outgoing or incoming search, the algorithm keeps track of the best known path from u to any node in S i . Specifically, for every keyword term k i , it maintains the child node sp u,i that u follows 70 3. GRAPH-BASED KEYWORD SEARCH Algorithm 25 BidirectionalSearch (G D , Q) Input: a data graph G D , and an l-keyword query Q ={k 1 ,k 2 , ··· ,k l }. Output: Q-subtrees in approximately increasing weight order. 1: Find the keyword node sets: {S 1 , ··· ,S l }, S ← l i=1 S i 2: Q in ← S; Q out ←∅;X in ←∅;X out ←∅ 3: ∀v ∈ S : P u ←∅,depth u ← 0;∀i, ∀u ∈ S : sp u,i ←∞ 4: ∀i, ∀u ∈ S : if u ∈ S i ,dist u,i ← 0 else dist u,i ←∞ 5: while Q in or Q out are non-empty do 6: if Q in has node with highest activation then 7: Pop best v from Q in and insert into X in 8: if iscomplete(v) then emit (v) 9: if depth v <d max then 10: ∀u ∈ incoming[v]:exploreEdge(u, v), if u/∈ X in insert it into Q in with depth depth v + 1 11: if v/∈ X out insert it into Q out 12: else if Q out has node with highest activation then 13: Pop best u from Q out and insert into X out 14: if iscomplete(u) then emit (u) 15: if depth u <d max then 16: ∀v ∈ outgoing[u]:exploreEdge(u, v), if v/∈ X out insert it into Q out with depth depth u + 1 to reach a node in S i in the best known path. dist u,i stores the length of the best known path from u to a node in S i . depth u stores the number of edges of node u from the nearest keyword node. A depth cutoff value d max is used to prevent generation of answers that would be unintuitive due to excessive path lengths, and to ensure termination. At each iteration of the algorithm (line 5), between the incoming and outgoing iterators, the one having the node with highest priority is scheduled for exploration. Exploring a node v in Q in (resp. Q out ) is done as follows: incoming (resp. outgoing) edges are traversed to propagate keyword- distance information and activation information from v to adjacent nodes, 1 and the node is moved from Q in to X out (resp. Q out to X out ). Additionally, if the node is found to have been reached from all keywords, it is “emitted” to an output queue. Answers are output from the output queue when the algorithm decides that no answers with lower cost will be generated in future. 1 The activation of a node v is defined for every keyword k i .Leta v,i be the activation of v with respect to keyword k i .The activation of v is then defined as the sum of its activations from each keyword, i.e., a v = l i=1 a v,i . a v,i is first initialized for each node v that contains keyword k i , and will spread to other nodes u to reflect the path length from u to keyword node k i and how close u is to the potential root. 3.4. DISTINCT ROOT-BASED KEYWORD SEARCH 71 3.4.2 BI-LEVEL INDEXING He et al. [2007] propose a bi-level index to speed up BidirectionalSearch, as no index (except the keyword-node index) is used in the original algorithm. A naive index precomputes and indexes all the distances from the nodes to keywords, but this will incur very large index size, as the number of distinct keywords is in the order of the size of the data graph G D . A bi-level index can be built by first partitioning graph, and then building intra-block index and block index. Two node-based partitioning methods are proposed to partition a graph into blocks, namely,BFS-Based Partitioning, and METIS-Based Partitioning.Before introducing thedata structure of theindex,wefirst introduce the concept of a portal. In a node-based partitioning of a graph, a node separator is called a portal node (or portal for short). A block consists of all nodes in a partition as well as all portals incident to the partition. For a block, a portal can be either “in-portal”, “out-portal”, or both. A portal is called in-portal if it has at least one incoming edge from another block and at least one outgoing edge in this block. And a portal is called out-portal if it has at least one outgoing edge to another block and at least one incoming edge from this block. For each block b, the intra-block index (IB-index) consists of the following data structures: • Intra-block keyword-node lists: For each keyword k, L KN (b, k) denotes the list of nodes in block b that can reach k without leaving b, sorted according to their shortest distances (within b)tok (or more precisely, any node in b containing k). • Intra-block node-keyword map: Looking up a node u ∈ b together with a keyword k in this hash map returns M NK (b,u,k), the shortest distance (within b)fromu to k (∞ if u cannot reach k in b). • Intra-block portal-node lists: For each out-portal p of b, L PN (b, p) denotes the list of nodes in b that can reach p without leaving b, sorted according to shortest distances (within b)top. • Intra-block node-portal distance map: Looking up a node u ∈ b in this hash map returns D NP (b, u), the shortest distance (within b) from a node u to the closest out-portal of b (∞ if u cannot reach any out-portal of b). The primary purpose of L PN is to support cross-block backward expansion in an efficient manner, as an answer may span multiple blocks through portals. The D NP map gives the shortest distance between a node and its closest out-portal within a block. The block index is a simple data structure consisting of: • keyword-block lists: For each keyword k, L KB (k) denotes the list of blocks containing key- word k, i.e., at least one node in b contains k; • portal-block lists: For each portal p, L PB (p) denotes the list of blocks with p as an out-portal. BLINKS [He et al., 2007], shown in Algorithm 26, works as follows. A priority queue Q i of cursors is created for each keyword term k i to simulate Dijkstra’s algorithm by utilizing the distance 72 3. GRAPH-BASED KEYWORD SEARCH Algorithm 26 BLINKS (G D , Q) Input: a data graph G D , and an l-keyword query Q ={k 1 ,k 2 , ··· ,k l }. Output: directed rooted trees in increasing weight order. 1: for each i ∈[1,l] do 2: Q i ← new Queue(); ∀b ∈ L KB (k i ) : Q i .insert(new Cursor(L KN (b, k i ),0)) 3: while ∃j ∈[1,l]:Q j =∅ do 4: i ← pickKeyword(Q 1 , ··· , Q l ) 5: c ← Q i .pop(); u, d←c.next() 6: visitNode(i,u,d) 7: if ¬crossed(i,u) and L PB (u) =∅ then 8: ∀b ∈ L PB (u) : Q i .insert(new Cursor(L PN (b, u), d)) 9: crossed(i,u) ← true 10: Q i .insert(c),ifc.peekDist() =∞ 11: if |A|≥K and j Q j .top().peekDist () < τ prune and ∀v ∈ R − A : sumLBDist (v) > τ prune then 12: exit and output the top K answers in A 13: output up to top K answers in A 14: Procedure visitNode(i, u, d) 15: if R[u] =⊥ then 16: R[u]←u, ⊥, ··· , ⊥; R[u].dist i ← d 17: b ← the block containing u 18: for each j ∈[1,i)∪ (i, l] do 19: R[u].dist i ← M NK (b,u,k i ),ifD NP (b, u) ≥ M NK (b,u,w i ) 20: else if sumLBDist (u) > τ prune then 21: return 22: R[u].dist i ← d 23: if sumDist (u) < ∞ then 24: A.add(R[u]) 25: τ prune ← the k-th largest of {sumDist (v)|v ∈ A} information stored in the IB-index. Initially, for each keyword k i , all the blocks that contain it are found by the keyword-block list, and a cursor is created to scan each intra-block keyword-node list and put in queue Q i (line 3).The main part of the algorithm performs backward search, and it only conducts forward check at line 23. When an in-portal u is visited, all the blocks that have u as their out-portal need to be expanded (line 10) because a shorter path may cross several blocks. The pickKeyword( Q 1 , ··· , Q l ) chooses the next keyword (queue) to expand. When a keyword k i visits node u for the first time (Procedure visitNode()), the distance d is guaranteed to 3.4. DISTINCT ROOT-BASED KEYWORD SEARCH 73 be the shortest distance between u and k i .The intra-block index D NP (b, u) and M NK (b,u,k i ) can be used to lower bound the shorted distance between u and k i ,ifD NP (b, u) ≥ M NK (b,u,k i ), then the shortest distance is guaranteed to be M NK (b,u,k i ) (line 23); otherwise, it is lower bounded by D NP (b, u). One important optimization is that each keyword can across a portal at most one time (lines 9,11), i.e., any path that crosses the same portal more than one time can not be a shortest path. This reduces the search space dramatically. Another index, called structure-aware index, is proposed to find answer for keyword query efficiently [Li et al., 2009b]. A different semantics of answer is defined, called compact steiner tree. The compact steiner tree is similar to the distinct root-based semantics.There is at most one answer tree, rooted at each node, for a keyword query. The tree, rooted at node t, is chosen as follows: for each keyword k i , a node v containing k i and dominating t (i.e., v is the node with shortest distance among all the nodes containing k i that t can reach), is chosen, and the shortest path from t to v is added to the tree. Based on such a definition, the path from t to k i becomes query independent, and it can be therefore precomputed and stored on disk. When a keyword query comes, it selects all the paths and joins them to form compact steiner trees. 3.4.3 EXTERNAL MEMORY DATA GRAPH Dalvi et al.[2008]study keyword search on graphswherethe graph G D can notfitinto main memory. They build a much smaller supernode graph on top of G D that can resident in main memory. The supernode graph is defined as follows: • SuperNode:The graph G D is partitioned into components by a clustering algorithm,and each cluster is represented by a node called the supernode in the top-level graph. Each supernode thus contains a subset of V(G D ), and the contained nodes (nodes in G D ) are called innernodes. • SuperEdge: The edges between the supernodes called superedges are constructed as follows: if there is at least one edge from an innernode of supernode s 1 to an innernode of supernode s 2 , then there exists a superedge from s 1 to s 2 . During supernode graph construction, the parameters are chosen such that the supernode graph fits into the available amount of main memory. Each supernode has a fixed number of innernodes and is stored on disk. A multi-granular graph is used to exploit information presented in lower-level nodes (innern- odes) that are cache-resident at the time a query is executed. A multi-granular graph is a hybrid graph that contains both supernodes and innernodes. A supernode is present either in expanded form, i.e.,all its innernodes along with their adjacency lists are present in the cache, or in unexpanded form, i.e., its innernodes are not in the cache. The innernodes and their adjacency lists are handled in the unit of supernodes, i.e., either all or none of the innernodes of a supernode are presented in the cache. Since supernodes and innernodes coexist in the multi-granular graph, several types of edges can be present. Among these, the edges between supernodes and between innernodes need to be stored; the other edges can be inferred, i.e., the edges between supernodes are stored in main . DISTINCT ROOT-BASED KEYWORD SEARCH 71 3.4.2 BI-LEVEL INDEXING He et al. [2007] propose a bi-level index to speed up BidirectionalSearch, as no index (except the keyword- node index) is used in. Pop best v from Q in and insert into X in 8: if iscomplete(v) then emit (v) 9: if depth v <d max then 10: ∀u ∈ incoming[v]:exploreEdge(u, v), if u/∈ X in insert it into Q in with depth depth v +. leaving b, sorted according to their shortest distances (within b)tok (or more precisely, any node in b containing k). • Intra-block node -keyword map: Looking up a node u ∈ b together with a keyword