Keyword Search in Databases- P16 potx

74 3. GRAPH-BASED KEYWORD SEARCH memory, and the edges between innernodes are stored in cache or on disk in the form of adjacency lists; the edges between supernode and innernode do not need to be stored explicitly. The weight of different kinds of edges are defined as follows. • supernode → supernode (S → S): The edge weight of s 1 → s 2 is defined as the minimum weight of those edges between the innernodes of s 1 and that of s 2 , i.e., w e ((s 1 ,s 2 )) = min v 1 ∈s 1 ,v 2 ∈s 2 w e ((v 1 ,v 2 )), where weight of edge (v 1 ,v 2 ) is defined to be ∞ if it does not exist. • supernode → innernode (S → I): The edge weight of s 1 → v 2 is defined as w e ((s 1 ,v 2 )) = min v 1 ∈s 1 w e ((v 1 ,v 2 )). These edges need not necessarily be explicitly represented. During the graph traversal, if s 1 is an unexpanded supernode, and there is a supernode s 2 in the adjacency list ofsupernodes 1 ,and s 2 is expanded,suchedges can beenumeratedby locatingallinnernodes {v 2 ∈ s 2 | the adjacency list of v 2 contains some inner node in s 1 }. • innernode → supernode (I → S): The edge weight in this case is defined in an analogous fashion to the previous case. • innernode → innernode (I → I): Edge weight is the same as in the original graph. When searching the multi-granular graph, the answers generated may contain supernodes, called supernode answer.If an answer does notcontainanysupernodes,itiscalled pureanswer.The finalanswer returned to users must be pure answer.The Iterative Expansion Search algorithm (IES) [Dalvi et al., 2008] is a multi-stage algorithm that is applicable to mulit-granular graphs, as shown in Algo- rithm 27. Each iteration of IES can be broken up into two phases. • Explore phase: Run an in-memory search algorithm on the current state of the multi-granular graph.The multi-granular graph is entirely in memory, whereas the supernode graph is stored in main memory, and details of expanded supernodes are stored in cache. When the search reaches an expanded supernode, it searches on the corresponding innernodes in cache. • Expand phase: Expand the supernodes found in top-n (n>k) results of the previous phase and add them to input graph to produce an expanded multi-granular graph, by loading all the corresponding innernodes into cache. The graph produced at the end of Expand phase of iteration i acts as the graph for iteration i + 1. Any in-memory graph search algorithm can be used in the Explore phase that treats all nodes (unexpanded supernode and innernode) in the same way. The multi-granular graph is maintained as a “virtual memory view”, i.e., when visiting an expanded supernode, the algorithm will lookup its expansion in the cache, and load it into the cache if it is not in the cache. The algorithm stops when all top-k results are pure. Other termination heuristics can be used to reduce the time taken for query execution, at the potential cost of missed results. Algorithm 27 restarts search (explore phase) every time from the scratch, which can lead to significantly increased CPU time. Dalvi et al. [2008] propose an alternative approach, called 3.4. DISTINCT ROOT-BASED KEYWORD SEARCH 75 Algorithm 27 Iterative Expansion Search(G, Q) Input: a multi-granular graph G, and an l-keyword query Q ={k 1 ,k 2 , ··· ,k l }. Output: top-k pure results. 1: while stopping criteria not satisfied do 2: /* Explore phase */ 3: Run any in-memory search algorithm on G to generate the top-n results 4: /* Expand phase */ 5: for each result R in top-n results do 6: SNodeSet ← SNodeSet ∪{all super nodes from R} 7: Expand all supernodes in SNodeSet and add them to G 8: output top-k pure results Algorithm 28 Iterative Expansion Backward Search(G, Q) Input: a multi-granular graph G, and an l-keyword query Q ={k 1 ,k 2 , ··· ,k l }. Output: top-k pure results. 1: while less than k pure results generated do 2: Result ← BackwardSearch.GetNextResult() 3: if Result contains a supernode then 4: Expand one or more supernodes in Result and update the SPI trees that contain those expanded supernodes 5: output top-k pure results incremental expansion. When a supernode answer is generated, one or more supernodes in the answer are expanded. However, instead of restarting each time when supernodes are expanded, incremental expansion updates the state of the search algorithm. Once the state is updated, search continues from where it left off earlier, on the modified graph. Algorithm 28 shows the Incremental Expansion Backward search (IEB) where the in-memory search is implemented by a backward search algorithm. There is one shortest path iterator (SPI) tree per keyword k i , which contains all nodes “touched” by Dijkstra’s algorithm, including explored nodes and fringe nodes, starting from k i (or more precisely S i ). More accurately, the SPI tree does not contain graph nodes, rather each tree- node of an SPI tree contains a pointer to a graph node. From the SPI tree, the shortest path from an explored node to an keyword node can be identified. The backward search algorithm expands each SPI tree using Dijkstra’s algorithm. When an answer is output by the backward search algorithm, if it contains any supernode, one or more supernodes from the answer are expanded, otherwise it is output. When a supernode is expanded, the SPI trees that contain this supernode should be updated to include all the innernodes and exclude this supernode. 76 3. GRAPH-BASED KEYWORD SEARCH 3.5 SUBGRAPH-BASED KEYWORD SEARCH The previous sections define the answer of a keyword query as Q-subtree, which is a directed subtree. We show two subgraph-based notions of answer definition for a keyword query in the following, namely, r-radius steiner graph, and multi-center induced graph. 3.5.1 r-RADIUS STEINER GRAPH Li et al. [2008a] define the result of an l-keyword query as an r-radius steiner subgraph.The graph is unweighted and undirected, and the length of a path is defined as the number of edges in it. The definition of r-radius steiner graph is based on the following concepts. Definition 3.11 Centric Distance. Given a graph G and any node v ∈ V (G), the centric distance of v in G, denoted as CD(v), is the maximum among the shortest distances between v and any node u ∈ V (G), i.e., CD(v) = max u∈V (G) dist(u,v). Definition 3.12 Radius. The radius of a graph G, denoted as R(G),is the minimum value among the centric distances of every node in G, i.e., R(G) = min v∈V (G) CD(v). G is called an r-radius graph if its radius is exactly r. Definition 3.13 r-Radius Steiner Graph. Given an r-radius graph G and a keyword query Q, node v in G is called a content node if it contains some of the input keywords. Node s is called steiner node if there exist two content nodes, u and v, and s in on the simple path between u and v. The subgraph of G composed of the steiner nodes and associated edges is called an r-radius steiner graph (SG). The radius of an r-radius steiner graph can be smaller than r. Example 3.14 Figure 3.9(a) shows two subgraphs, SG 1 and SG 2 , of the data graph shown in Figure 3.1(e). In SG 1 , the centric distance of t 1 and t 8 are CD(t 1 ) = 2 and CD(t 8 ) = 3, respectively. In SG 2 , the centric distance of t 1 and t 8 are CD  (t 1 ) = 3 and CD  (t 8 ) = 3, respectively. The radius of SG 1 and SG 2 are R(SG 1 ) = 2 and R(SG 2 ) = 3, respectively. For a keyword query Q ={Brussels, EU},one2-radius steiner graph is shown in Figure 3.9(b), where t 6 contains keyword “Brussels” and t 3 contains keyword “EU”, and it is obtained by removing the non-steiner nodes from SG 1 . Note that the definition of r-radius steiner graph is based on r-radius subgraph. A more general definition of r-radius steiner graph would be any induced subgraph satisfying the following two properties: (1) the radius should be no more than r, (2) every node should be either a content node or a steiner node. The actual problem of a keyword query in this setting is to find r-radius subgraphs, and the corresponding r-radius steiner graph is obtained as a post-processing step as described by the definition. 3.5. SUBGRAPH-BASED KEYWORD SEARCH 77 t 10 t 8 t 9 t 3 t 5 t 1 t 4 SG 1 SG 2 t 6 (a) Two Subgraphs t 9 t 3 t 1 t 6 (b) 2-radius steiner graph Figure 3.9: 2-radius steiner graph for Q ={Brussels, EU} The approaches to find r-radius subgraphs are based on the adjacency matrix, M = (m ij ) n×n , with respect to G D , which is a n × n Boolean matrix. An element m ij is 1, if and only if there is an edge between v i and v j , m ii is 1 for all i. M r = M × M ···×M = (m ij ) n×n is the r-th power of adjacency matrix M. An element m r ij is 1, if and only if the shortest path between v i and v j is less than or equal to r.N r i ={v j |m r ij = 1} is the set of nodes that have a path to v i with distance no larger than r. G r i denotes the subgraph induced by the node set N r i . G r v i (N r v i ) can be interchangeably used instead of G r i (N r i ). We use G i ✂ G j to denote that G i is a subgraph of G j . The r-radius subgraph is defined based on G r i ’s. The following lemma is used to find all the r-radius subgraphs [Li et al., 2008a]. Lemma 3.15 [Li et al., 2008a]GivenagraphG, with R(G) ≥ r>1, ∀i, 1 ≤ i ≤|V (G)|, G r i is an r-radius subgraph, if, ∀v k ∈ N r i , N r i  N r−1 k . Note that, the above lemma is a sufficient condition for identifying r-radius subgraphs, but it is not a necessary condition. In principle, there can be, exponentially, many r-radius subgraphs of G. Li et al. [2008a] only consider n =|V (G)| subgraphs; each is uniquely determined by one node in G, while other r-radius subgraphs are possible. An r-radius subgraph G r i is maximal if and only if there is no other r-radius subgraph G r j that is a super graph of G r i , i.e. G r i ✂ G r j . Li et al. [2008a] consider those maximal r-radius subgraphs G r i as the subgraphs that will generate r-radius steiner subgraphs. All these maximal r-radius subgraphs G r i are found, which can be pre-computed and indexed on the disk, because these maximal r-radius graph are query independent. 78 3. GRAPH-BASED KEYWORD SEARCH The objective here is to find top-kr-radius steiner subgraphs, and ranking functions are introduced to rank the r-radius steiner subgraphs. Each keyword term k i has an IR-style score: Score IR (k i ,SG) = ntf (k i ,G)× idf(k i ) ndl(G) that is a normal TF-IDF score, where idf(k i ) indicates the relative importance of keyword k i and ntf (k i ,G) measures the relevance of G to keyword k i . Here G is the subgraph from which the r-radius steiner subgraph is generated. Each keyword pair (k i ,k j ) has a structural score, which measures the compactness of the two keywords in SG. Sim(k i ,k j |SG) = 1 |C k i ∪ C k j |  v i ∈C k i ,v j ∈C k j Sim(v i ,v j |SG) where C k i (C k j ) is the set of keyword nodes in SG that contain k i (k j ), and Sim(v i ,v j |SG) =  p∈path(v i ,v j |SG) 1 (len(p)+1) 2 , where path(v i ,v j |SG) denote the set of all the paths between v i and v j in SG and len(p) is the length of path p. Intuitively, Sim(k i ,k j |SG) measures how close the two keywords, k i and k j , are connected to each other.The final score of SG is defined as follows, Score({k 1 , ··· ,k l },SG) =  1≤i<j≤l Score(k i ,k j |SG) where Score(k i ,k j |SG) = Sim(k i ,k j |SG) × (Score IR (k i ,SG)+ Score IR (k j , SG)) According to the definition of r-radius steiner subgraph, Score(k i ,k j |SG) can be directly computed on G. Then Score(k i ,k j |SG) can be pre-computed for each possible keyword pair and each maximal r-radius subgraph G r i . An index can be built by storing a list of maximal r-radius subgraphs G in decreasing order of Score(k i ,k j |SG), for each possible keyword pair. When a keyword query arrives, it can directly use these lists, and by applying the Threshold Algorithm [Fagin, 1998] the top-k maximal r-radius subgraphs can be obtained, and then the top-kr-radius steiner subgraphs can be computed by refining the corresponding r-radius subgraphs. 3.5.2 MULTI-CENTER INDUCED GRAPH In contrast to tree-based results that are single-center (root) induced trees,in this section, we consider query answers that are multi-centered induced subgraphs of G D . These are referred to as commu- nities [Qin et al., 2009b]. The vertices of a community R(V, E), V(R)is a union of three subsets, V = V c ∪ V l ∪ V p , where V l represents a set of keyword nodes (knode), V c represents a set of center nodes (cnode) (for every cnode v c ∈ V c ,there exists at least a single path such that dist(v c ,v l ) ≤ R max for any v l ∈ V l , where R max is introduced to control the size of a community), and V p represents a . beenumeratedby locatingallinnernodes {v 2 ∈ s 2 | the adjacency list of v 2 contains some inner node in s 1 }. • innernode → supernode (I → S): The edge weight in this case is defined in an analogous fashion. For a keyword query Q ={Brussels, EU},one2-radius steiner graph is shown in Figure 3.9(b), where t 6 contains keyword “Brussels” and t 3 contains keyword “EU”, and it is obtained by removing. be updated to include all the innernodes and exclude this supernode. 76 3. GRAPH-BASED KEYWORD SEARCH 3.5 SUBGRAPH-BASED KEYWORD SEARCH The previous sections define the answer of a keyword query

Định dạng
Số trang	5
Dung lượng	125,78 KB