Keyword Search in Databases- P23 ppt

4.4. ELCA-BASED SEMANTICS 109 v u i+1 u m F i+1 u i yu l p.c p.(c +1) p Figure 4.9: v and child_elcacan(v) [Xu and Papakonstantinou, 2008] Algorithm 37 isELCA (v, ch) Input: a node v and ch = child_elcacan(v). Output: return true if v is ELCA, falseotherwise 1: for i ← 1 to l do 2: x ← v 3: for j ← 1 to |ch| do 4: x ← rm(x, S i ) 5: if x =⊥or pre(x)<pre(ch[j]) then 6: break 7: else 8: x ← nextSibling(ch[j ]) 9: if j =|ch|+1 then 10: return false,ifv ⊀ rm(x, S i ) 11: return true After the first step that we got ELCA_CANs, if we can find child_elcacan(v) efficiently for each ELCA_CAN v, then we can find ELCAs in time O(|S 1 |ld log |S|). If we assign each ELCA_CAN u to bethe child of its ancestor ELCA_CAN nodev with the largest Dewey ID, then u corresponds to exactly one node in child_elcacan(v), and the node in child_elcacan(v) corresponding to u can be found in O(d)timebytheDewey ID.In the following,we use child_elcacan(v) to denote the set of ELCA_CAN nodes u which is a descendant of v and there does not exist any node x with v ≺ x ≺ w, i.e. child_elcacan(v) ={u ∈ elca_can(S 1 , ··· ,S l ) | v ≺ u ∧ x ∈ elca_can(S 1 , ··· ,S l )(v ≺ x ≺ u)} There is an one-to-one correspondence between the two definitions of child_elcacan(v). It is easy to see that  v∈elca_can(S 1 ,··· ,S l ) |child_elcacan(v)|=O(|elca_can(S 1 , ··· ,S l )|) = O(|S 1 |). 110 4. KEYWORD SEARCH IN XML DATABASES Now the problem becomes how to compute child_elcacan(v) efficiently for all v ∈ elca_can(S 1 , ··· ,S l ). Note that, the nodes in elca_can(S 1 , ··· ,S l ) as computed by ∪ v 1 ∈S 1 elca_can(v 1 ) are not sorted in Dewey ID order. Similar to DeweyInvertedList, a stack based algorithmis used to computechild_elcacan(v),butit works on the setelca_can(S 1 , ··· ,S l ), while DeweyInvertedList works on the set S 1 ∪ S 2 ···∪S l . Each stack entry created for a node v 1 ∈ S 1 has the following three components: • elca_can is elca_can(v 1 ) • CH is child_elcacan(v 1 ) • SIB is the list of ELCA_CANs before elca_can, which is used to compute CH IndexedStack [Xu and Papakonstantinou, 2007, 2008] is shown in Algorithm 38. For each node v 1 ∈ S 1 , it computes elca_can v 1 = elca_can(v 1 ) (line 3), a stack entry en is created for elca_can v 1 (line 4). If the stack is empty (line 5), we simply push en to stack (line 6). Otherwise, different operations are applied based on the relationship between elca_can v 1 and elca_can v 2 , which is the node at the top of stack. • elca_can v 1 = elca_can v 2 , then en is discarded (lines 8-9) • elca_can v 2 ≺ elca_can v 1 , then just push en to stack (lines 10-11), • elca_can v 2 <elca_can v 1 , but elca_can v 2 ⊀ elca_can v 1 , then the non-ancestor nodes of elca_can v 1 in stack is popped out,and it is checked whether it is an ELCA or not (procedure popStack (lines 23-30)), because all its descendant match nodes have been read, and the child_elcacan information have been stored in popEntry.CH (lines 27-28).After the non- ancestor nodes have been popped out (line 13), it may be necessary to store the sibling nodes of en to en.SI B. Note that, in this case, there may exist a potential ELCA that is the ancestor of en, and the descendant of the top entry of the stack (or the root of the XML tree if stack is empty). If this is possible (line 15), then the sibling information is stored in en.SI B (line 16). • elca_can v 1 ≺ elca_can v 1 , then the non-ancestor nodes of elca_can v 1 in stack is popped out, and it is checked whether it is to be an ELCA or not (line 19), and en.CH is stored (line 20). Note that there does not exist any more potential ELCA nodes that are descendants of the popped entries. Note that these are the only four possible cases of the relationship between elca_can v 1 and elca_can v 2 . IndexedStack output all the ELCA nodes, i.e., elca(S 1 , ··· ,S l ), in time O(|S 1 |  l i=2 d log |S i |),orO(|S 1 |ld log |S|) [Xu and Papakonstantinou, 2008]. 4.4. ELCA-BASED SEMANTICS 111 Algorithm 38 IndexedStack (S 1 , ··· ,S l ) Input: l list of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: output all ELCAs 1: stack ←∅ 2: for each node v 1 ∈ S 1 , in increasing Dewey ID order do 3: elca_can v 1 ← slca({v 1 },S 2 , ··· ,S l ) 4: en ←[elca_can ← elca_can v 1 ; SIB ← []; CH ← []] 5: if stack =∅ then 6: stack.push(en); continue 7: topEntry ← stack.top(); elca_can v 2 ← t opEntry.elca_can 8: if elca_can v 1 = elca_canv 2 then 9: ⊥ 10: else if elca_can v 2 ≺ elca_can v 1 then 11: stack.push(en) 12: else if elca_can v 2 <elca_can v 1 then 13: popEntry ← popStack(elca_can v 1 ) 14: top_elcacan ← stack.top().elca_can 15: if stack =∅and top_elcacan ≺ lca(elca_can v 1 , popEntry.elca_can) then 16: en.SI B ←[popEntry.SIB, popEntry.elca_can] 17: stack.push(en) 18: else if elca_can v 1 ≺ elca_can v 2 then 19: popEntry ← popStack(elca_can v 1 ) 20: en.CH ←[popEntry.SIB, popEntry.elca_can] 21: stack.push(en) 22: popStack(0) 23: Procedure popStack (elca_can v 1 ) 24: popEntry ←⊥ 25: while stack =∅and stack.top().elca_can ⊀ elca_can v 1 do 26: popEntry ← stack.pop() 27: if isELCA (popEntry.elca_can,toChild_elcacan(popEntry.elca_can, popEntry.CH )) then 28: output popEntry.elca_can as an ELCA 29: stack.top().CH ← stack.top().CH + popEntry.elca_can 30: return popEntry 4.4.2 IDENTIFYING MEANINGFUL ELCAS Kong et al. [2009] extend the definition of contributor [Liu and Chen, 2008b]tovalid-contribute, and they propose an algorithm similar to MaxMatch to compute relevant matches basedon ELCA semantics, i.e., root t canbeanELCA node. Definition 4.35 Valid Contributor. Given an XML data T and a keyword query Q, a node v in Q is called a valid contributor to Q, if either one of the following two conditions holds: 1. v has a unique label tag(v) among its sibling nodes 112 4. KEYWORD SEARCH IN XML DATABASES 2. v has several siblings v 1 , ··· ,v m (m ≥ 1), with the same label as tag(v), but the following conditions hold: • v i , dMatch(v) ⊂ dMatch(v i ) • ∀v i >v, if dMatch(v) = dMatch(v i ), then TC v = TC v i , where TC v denote the set of words (among the match nodes in M) appear in the subtree rooted at v A valid contributor only compares nodes with its sibling nodes that have the same label. If a node v has a unique label among itssibling nodes,then it is a validcontributor.Otherwise,only those nodes whose dMatchis not subsumed by any sibling node with the same label is a valid contributor. Also, if the subtree rooted at two sibling nodes contains exactly the same set of words (TC v ), then only one is a valid contributor. Definition 4.36 Relevant Match. For an XML tree T and a query Q, a match node v in T is relevant if v is a witness node of u ∈ elca(Q), and all the nodes on the path from u to v are valid contributors. Based on this definition of valid contributor and relevant match, all the subtrees formed by ELCA node and its corresponding relevant match nodes will satisfy the four properties discussed earlier [Liu and Chen, 2008b], namely, data monotonicity, data consistency, query monotonicity, and query consistency. An algorithm to find the relevant matches for each ELCA node exists [Kong et al., 2009], that consists of three steps: (1) find all ELCAs using DeweyInvertedList or Indexed- Stack , (2) group match nodes to each ELCA node, (3) prune irrelevant matches from each group. The algorithm usesideas similar to MaxMatch to find relevant matches according tothe definition of valid contributor. 4.5 OTHER APPROACHES There exist several semantics other than SLCA and ELCA for keyword search on XML databases, namely, meaningful LCA (MLCA)[Li et al., 2004, 2008b], interconnection [Cohen et al., 2003], Compact Valuable LCA (CVLCA)[Li et al.,2007a], and relevance oriented ranking [Bao et al.,2009]. The difference between MLCA and interconnection is that MLCA is based on SLCA, whereas interconnection is not, i.e., the root nodes of the subtrees returned by interconnection may not be a SLCA node. CVLCA is a combination of ELCA semantics and the interconnection semantics. Another approach to keyword search on XML databases is to make use of the schema information where results are minimal connected trees of XML fragments that contain all the key- words [Balmin et al., 2003; Hristidis et al., 2003b]. Hristidis et al. study keyword search on XML trees,and propose efficient algorithms to find minimum connecting trees [Hristidis et al.,2006].Al- Khalifa et al. integrate the IR-styled ranking function into XQuery,and they propose a bulk-algebra which is the basis for integrating information retrieval techniques into a standard pipelined database 4.5. OTHER APPROACHES 113 query evaluation engine [Al-Khalifa et al., 2003]. NaLIX (Natural Language Interface to XML) is a system, in which an arbitrary English language sentence is translated into an XQuery expression, and it can be evaluated against an XML database [Li et al., 2007b].The problem of keyword search on XML using a minimal number of materialized views is also studied, where the answer definition is based on SLCA semantics [Liu and Chen, 2008a]. Some works study the problem of keyword search over virtual (unmaterialized) XML views [Shao et al., 2007, 2009a]. eXtract is a system to generate snippets for tree results of querying on XML database, which highlights the most domi- nant features [Huang et al., 2008a,b]. Answer differentiation is studied to find a limited number of valid features in result so that they can maximally differentiate this result from the others [Liu et al., 2009a]. . al. study keyword search on XML trees,and propose efficient algorithms to find minimum connecting trees [Hristidis et al.,2006].Al- Khalifa et al. integrate the IR-styled ranking function into XQuery,and. basis for integrating information retrieval techniques into a standard pipelined database 4.5. OTHER APPROACHES 113 query evaluation engine [Al-Khalifa et al., 2003]. NaLIX (Natural Language Interface. approach to keyword search on XML databases is to make use of the schema information where results are minimal connected trees of XML fragments that contain all the key- words [Balmin et al.,

Định dạng
Số trang	5
Dung lượng	127,29 KB