Keyword Search in Databases- P17 pps

3.5. SUBGRAPH-BASED KEYWORD SEARCH 79 Algorithm 29 GetCommunity(G D , C, R max ) Input: a data graph G D , a core C =[c 1 , ··· ,c l ], and a radius threshold R max . Output: A community uniquely determined by C. 1: Find theset of cnodes,V c ,by running |C| copies of Dijkstra’ssingle sourceshortest path algorithm 2: Run a single copy of Dijkstra’s algorithm to find the shortest distance to the nearest knode, for each node v ∈ V(G D ), i.e. dist k (v) = min c∈C dist(v,c) 3: Run a single copy of Dijkstra’s algorithm to find the shortest distance from the nearest cnode, for each node v ∈ V(G D ), i.e. dist c (v) = min v c ∈V c dist(v c ,v) 4: V ←{u ∈ V(G D )|dist c (u) + dist k (u) ≤ R max } 5: Construct a subgraph R in G D induced by V and return it set path nodes (pnode) that include all the nodes that appear on any path from a cnode v c ∈ V c to a knode v l ∈ V l with dist(v c ,v l ) ≤ R max . E(R) is the set of edges induced by V(R). A community, R, is uniquely determined by the set of knodes, V l , which is called the core of the community and denoted as core(R). The weight of a community R, w(R) is defined as the minimum value among the total edge weights from a cnode to every knode; more precisely, w(R) = min v c ∈V c  v l ∈V l dist(v c ,v l ). (3.8) For simplicity, we use C to represent a core as a list of l nodes, C =[c 1 ,c 2 , ··· ,c l ], and it may use C[i] to denote c i ∈ C, where c i contains the keyword term k i . Based on the definition of community, once the core C is provided, the community is uniquely determined, and it can be found by Algorithm 29, which is self-explanatory. Qin et al. [2009b] enumerate all (or the top-k) communities in polynomial delay by adopting the Lawler’s procedure [Lawler, 1972]. The general idea is the same as EnumTreePD (Algo- rithm 19). But it is much easier here, because EnumTreePD enumerates trees which has structure, while in this case only the cores are enumerated where each core is just a set of l keyword nodes. In this problem, the answer space is S 1 × S 2 ···×S l , where each S i is the set of nodes in G D that contains keyword k i . A subspace is described by V 1 × V 2 ··· , ×V l where V i ⊆ S i and it also can be compactly described by a set of inclusion constraints and exclusion constraints. Based on Lawler’s procedure, in order to enumerate the communities in increasing cost order, it is straightforward to obtain an algorithm whose time complexity of delay is O(l · c(l)), where c(l) is the time complexity to compute the best community. Two algorithms are proposed for enumerating communities in order with time complexity O(c(l)): one enumerates all communities in arbitrary order with polynomial delay, and the other enumerates top-k communities in increasing weight order with polynomial delay. In the following, we discuss the second algorithm. 80 3. GRAPH-BASED KEYWORD SEARCH Algorithm 30 COMM-K(G D , Q, R max ) Input: a data graph G D , keywords set Q ={k 1 , ··· ,k l }, and a radius threshold R max . Output: Enumerate top-K communities in increasing weight order. 1: Find the set of knode s {S 1 , ··· ,S l } and their corresponding neighborhood nodes {N 1 , ··· ,N l } 2: Find the best core (with lowest weight) and the corresponding weight from {N 1 , ··· ,N l }, denoted (C, weight) 3: Initialize H ←∅; H.insert(C, weight, 1, ∅) 4: while H =∅and less than K communities output do 5: g ← H.pop(); {g = (C, weight, pos,prev)} 6: R  ← Get Community(G D ,g.C,R max ), and output R  7: ∀i ∈[1,l]: update N i to be the neighborhood nodes of g.C[i],V i ← S i 8: update {V 1 , ··· ,V l } by following the links g.prev recursively 9: for i = l downto g.pos do 10: V i ← V i −{g.C[i]}, update N i to be the neighborhood nodes of V i 11: Find the best core from the current {N 1 , ··· ,N l }, denoted (C  ,weight  ) 12: H.insert(C  ,weight  ,i,g) if C  exists 13: V i ← V i ∪{g.C[i]}, update N i to be the neighborhood nodes of V i Algorithm 30 shows the high-level pseudocode. H is a priority heap, used to store the intermediate and potential cores with additional information. The general idea is to consider the entire set of potential cores as an l-dimensional space S 1 × S 2 ···×S l , and at each step, divide a subspace into smaller subspaces and find a best core in each newly generated subspace. At any intermediate step, the whole set of subspaces are disjoint, and the union is guaranteed to cover the whole space. Each time a core with the lowest weight is removed from H, it is guaranteed to be the next community in order (line 5). The best core of a subspace V 1 × V 2 ···×V l , where V i ⊂ S i , is found as follows (lines 2,11). First, a neighborhood nodeset N i is found for each set V i , which consists of all the nodes with a shortest distance no greater than R max to at least one of the nodes in V i . This can be done by running a shortest path algorithm. Second, a linear scan of the nodes can find the best core with the best center and weight. When the next best core g.C is found, the subspace from which g.C is found is partitioned into several subspaces (lines 9-13); the best core from each newly generated subspace is found (line 11) and inserted into H (line 12). Each entry in H consists of four fields, (C,weight, pos,prev), where C is the core and weight is the corresponding weight, pos and pre is used to reconstruct efficiently the subspace (without storing the description of the subspace explicitly) from which C is computed. Algorithm 30 enumeratestop-k communities in increasing weight order,with timecomplexity O(l(nlog n + m)), and using space O(l 2 · k + l · n + m) [Qin et al., 2009b]. Note that, finding the best core in a subspace (under inclusion constraints and exclusion constraints) also takes time c(l) = O(l(nlog n + m)). According to discussion of EnumTreePD, it is easy to get an enumeration 3.5. SUBGRAPH-BASED KEYWORD SEARCH 81 algorithm with delay l · c(l). However, information can be shared during consecutive execution of Line 11 of EnumTreePD, so Algorithm 30 can enumerate communities with delay c(l). 83 CHAPTER 4 Keyword Search in XML Databases In this chapter, we focus on keyword search in XML databases where an XML database is treated as a large data tree. We introduce various semantics to answer a keyword query on XML tree, and we discuss efficient algorithms to find the answers under such semantics. A main difference between this chapter and the previous chapters is that the underlying data structure is a large tree instead of a large graph. In Section 4.1, we introduce several important concepts and definitions such as Lower Com- mon Ancestor ( LCA), Smallest LCA (SLCA), Exclusive LCA (ELCA), and Compact LCA (CLCA).Their properties and the relationships among LCA, SLCA and ELCA will be discussed. In Section 4.2, we discuss the algorithms that find answers based on SLCA. In Section 4.3, we discuss the algorithms that focus on identifying meaningful return information. We discuss algorithm to find answers based on ELCA in Section 4.4. In Section 4.5, in brief, we give several approaches based on meaning LCA, interconnection, and relevance oriented ranking. 4.1 XML AND PROBLEM DEFINITION XML is modeled as a rooted and labeled tree, such as the one shown in Figure 4.1. Each internal node v in the tree corresponds to an XML element, called element node, and is labeled with a tag/label tag(v). Each leaf node of the tree corresponds to a data value, called value node. For example, in Figure 4.1, “Dean” and “Title” are element nodes, “John” and “Ben” are value nodes. In this model, the attribute nodes are modeled as children of the associated elementnode,and we do not distinguish them from element nodes. Each node (element node or value node) in the XML tree is assigned an unique Dewey ID. The Dewey ID of nodes are assigned in the following way: the relative position of each node among its siblings are recorded, and the concatenation of these relative positions using dot ’.’ starting from the root composes the De wey ID of the nodes. For example, the node with De wey ID 0.1.2.1 (Students) is the second child of its parent node 0.1.2 (Class). We denote the Dewey ID of a node v as pre(v), as it is compatible with the preorder numbering, i.e., a node v 1 precedes another node v 2 in the preorder left-to-right depth-first traversal of the tree, if and only if pre(v 1 )<pre(v 2 ). The < relationship between two Dewey IDs is the same as comparing between two sequences. Besides the order information preserved by the Dewey ID, it also can be used to detect sibling and ancestor-descendant relationships between nodes. . communities in increasing weight order,with timecomplexity O(l(nlog n + m)), and using space O(l 2 · k + l · n + m) [Qin et al., 2009b]. Note that, finding the best core in a subspace (under inclusion. is found is partitioned into several subspaces (lines 9-13); the best core from each newly generated subspace is found (line 11) and inserted into H (line 12). Each entry in H consists of four. However, information can be shared during consecutive execution of Line 11 of EnumTreePD, so Algorithm 30 can enumerate communities with delay c(l). 83 CHAPTER 4 Keyword Search in XML Databases In

Định dạng
Số trang	5
Dung lượng	108,28 KB