94 4. KEYWORD SEARCH IN XML DATABASES Algorithm 32 IndexedLookupEager (S 1 , ··· ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: All the SLCAs 1: v ←⊥ 2: while there are more nodes in S 1 do 3: Read p nodes of S 1 into buffer B 4: for i ← 2 to l do 5: B ← getSLCA(B, S i ) 6: removeFirstNode(B),ifv =⊥and getFirstNode(B) v 7: output v as a SLCA,ifv =⊥, B =∅and v ⊀ getFirstNode(B) 8: if B =∅ then 9: v ← removeLastNode(B) 10: output B as SLCA nodes 11: output v as a SLCA 12: Procedure getSLCA(S 1 ,S 2 ) 13: Result ←∅; u ← root with Dewey ID 0 14: for each node v ∈ S 1 in increasing Dewey ID order do 15: x ← lca(v, closest(v, S 2 )) 16: if pre(u) < pre(x) then 17: Result ← Result ∪{u},ifu ⊀ x 18: u ← x 19: return Result ∪{u} ScanEager,p at Line (3) must be no smaller than |S 1 |, i.e.,it must first compute slca(S 1 ,S 2 ), then slca(slca(S 1 ,S 2 ), S 3 ) and continue. ScanEager directly follows from Property 4.10,Property 4.9, Property 4.11. ScanEager outputs all the SLCA nodes, i.e., slca(S 1 , ··· ,S l ), in time O(ld|S 1 |+ d l i=2 |S i |),orO(ld|S|) [Xu and Papakonstantinou, 2005]. Although ScanEager has the same time complexity as StackAlgorithm, it has two advantages. First, ScanEager starts from the smallest keyword list, and it does not have to scan to the end of every keyword list and may terminate much earlier. Second, the number of lca operations of ScanEager is O(l|S 1 |), which is usually much less than that of the StackAlgorithm that has O( l i=1 |S i |)lcaoperations. Multiway SLCA: MultiwaySLCA [Sunetal., 2007] further improves the performance of In- dexedLookupEager , but with the same worst case time complexity.The Motivation and general idea of MultiwaySLCA are shown by the following example. 4.2. SLCA-BASED SEMANTICS 95 Algorithm 33 ScanEager (S 1 , ··· ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: All the SLCAs 1: u ← root with Dewey ID 0 2: for each node v 1 ∈ S 1 in increasing Dewey ID order do 3: moving cursors in each list S i to closest (v 1 ,S i ), for 1 ≤ i ≤ l 4: v ← lca(v 1 ,closest(v 1 ,S 2 ), ··· ,closest(v 1 ,S l )) 5: if pre(u) < pre(v) then 6: if u ⊀ v then 7: output u as a SLCA 8: u ← v 9: output u as a SLCA r 1 x 1 x 2 a 1 a 100 b 1 a 101 a 200 b 2 a 901 a 1000 b 10 b 1001 b 11 x 10 Figure 4.4: An Example XML Tree to Illustrate MultiwaySLCA [Sunetal., 2007] Example 4.15 Consider a keyword query Q ={a, b} on the XML tree shown in Figure 4.4. S a ={a 1 , ··· ,a 1000 } and S b ={b 1 , ··· ,b 1001 }, slca(S a ,S b ) ={x 1 , ··· ,x 10 }. Since |S a | < |S b |, IndexedLookupEager will enumerate each of the “a” nodes in S a in increasing Dewey ID order to compute a potential SLCA.This results in a total number of 1000 slca computations to produce a result of size 10. Lots of redundant computations have been conducted, e.g., the SLCA of a i and S b gives the same result of x 1 for 1 ≤ i ≤ 100. Conceptually,each potential SLCA computed by IndexedLookupEager can be thought of as being driven by some nodes from S a (or S 1 in general). But, MultiwaySLCA picks an “anchor” node among the l keyword lists to drive the multiway SLCA computation at each individual step.In this example, MultiwaySLCA will first consider the first node in each keyword list and select the one with the largest Dewey ID as the anchor node. Thus, between a 1 ∈ S a and b 1 ∈ S b , it chooses b 1 as the anchor node. Next, using b 1 as an anchor, it will select the closest node from each other keyword list,i.e., a 100 ∈ S b , and will compute the lca of those chosen nodes, i.e.,lca(a 100 ,b 1 ) = x 1 . The next anchor node is selected in the same way by removing all those nodes with Dewey ID 96 4. KEYWORD SEARCH IN XML DATABASES smaller than pre(b 1 ) from each keyword list. Then b 2 is selected, and slca(b 2 ,S a ) = x 2 . Clearly, MultiwaySLCA is able to skip many unnecessary computations. Definition 4.16 Anchor Node. Given l lists S 1 , ··· ,S l , a sequence of nodes, L =v 1 , ··· ,v l where v i ∈ S i , is said to be anchored by a node v a ∈ L, if for each v i ∈ L, v i = closest(v a ,S i ).We refer to v a as the anchor node of L. Lemma 4.17 If lca(L) is a SLCA and v ∈ L, then lca(L) = lca(L ), where L is the set of nodes anchored by v in each S i . Thus, it only needs to consider anchored sets, where a set is called anchored if it is anchored by some nodes, for computing potential SLCAs. In fact, from the definition of Compact LCA and its equivalence to ELCA, if a node u is a SLCA, then there must exist a set {v 1 , ··· ,v l }, where v i ∈ S i for 1 ≤ i ≤ l, such that u = lca(v 1 , ··· ,v l ) and every v i is an anchor node. Lemma 4.18 Consider two matches L =v 1 , ··· ,v l and L =u 1 , ··· ,u l , where L<L , i.e., v i ≤ u i for 1 ≤ i ≤ l, and L is anchored by some node v i .IfL contains some node u j with pre(u j ) ≤ pre(v i ), then lca(L ) is either equal to lca(L) or an ancestor of lca(L). Lemma 4.18 provides a useful property to find the next anchor node. Specifically, if we have considered a match L that is anchored by a node v a , then we can skip all the nodes v ≤ v a . Lemma 4.19 Let L and L be two matches. If L contains two nodes, where one is a descendant of lca(L), while the other is not, then lca(L ) lca(L). Lemma 4.19 provides another useful property to optimize the next anchor node. Specifically, if we have considered a match L and lca(L) is guaranteed to be a SLCA, then we can skip all the nodes that are descendants of lca(L). Lemma 4.20 Let L be a list of nodes, then lca(L) = lca(f irst (L), last (L)). Note that, if the nodes in L is not in order, then f irst (L) and last (L) will take time O(ld), while directly using the definition also takes time O(ld), i.e., lca(v 1 , ··· ,v l ) = lca(lca(v 1 , ··· ,v l−1 ), v l ), where l is the number of nodes in L. Two algorithms, namely, Basic Multiway-SLCA (BMS) and Incremental Multiway-SLCA (IMS), are proposed in [Sunetal., 2007] to compute all the SLCA nodes. The BMS algorithm implements the general idea above. IMS introduces one further optimization aimed to reduce the lca computation of BMS. However, lca takes the same time as comparing two Dewey IDs, and BMS needs to retrieve nodes in order from an unordered set, and this will incur extra time. So in the 4.2. SLCA-BASED SEMANTICS 97 Algorithm 34 MultiwaySLCA (S 1 , ··· ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: All the SLCAs 1: v m ← last({first(S i ) | 1 ≤ i ≤ l}), where the index m is also recorded 2: u ← root with Dewey ID 0 3: while v m =⊥ do 4: if m = 1 then 5: v 1 ← closest(v m ,S 1 ) 6: v m ← v 1 ,ifv m <v 1 7: v i ← closest(v m ,S i ), for each 1 ≤ i ≤ l,i = m 8: x ← lca(first(v 1 , ··· ,v l ), last (v 1 , ··· ,v l )) 9: if u ≤ x then 10: output u as a SLCA,ifu x 11: u ← x 12: v m ← last({rm(v m ,S i ) | 1 ≤ i ≤ l, v i ≤ v m }) 13: if v m =⊥and u v m then 14: v m ← last({v m }∪{out (u, S i ) | 1 ≤ i ≤ l, i = m}) 15: output u as a SLCA following, we will show BMS algorithm, denoted as MultiwaySLCA, and only show the further optimization of IMS. MultiwaySLCA is shown in Algorithm 34. It computes the SLCAs iteratively. At each iteration, an anchor node v m is selected to compute the match anchored by v m and its LCA, where index m is also stored, and v m is initialized at Line 1. Let u denote the potential SLCA node that is most recently computed, and it is initialized to be the root node with Dewey ID 0 (line 2).When v m is not ⊥, more potential SLCAs can be found (lines 3-13). Lines 4-6 further optimize the anchor node to be a node with large Dewey ID if one exists.After an anchor node v m is chosen, Line 7 finds the match anchored by v m , and Line 8 computes the LCA x of this match. If x u (line 9), then x is ignored. Line 10 outputs u as a SLCA if it is not an ancestor-or-self of x. u is updated to be the recently computed potential SLCA. Lines 12-14 select the next anchor node by choosing the furthest possible node that maximized the number of skipped nodes, where line 12 corresponds to Lemma 4.18, and lines 13-14 corresponds to Lemma 4.19. Theorem 4.21 Let u and x be the two variables in MultiwaySLCA.Ifu ≥ x then x u. Otherwise either u ≺ x or u<⊀ x. 1 If u<⊀ x, then u is guaranteed to be a SLCA. 1 u<⊀ x means that u<xbut u ⊀ x. 98 4. KEYWORD SEARCH IN XML DATABASES IMS [Sunetal., 2007] further optimizes lines 7-8. Let L denote the match anchored by v m , i.e., L =v 1 , ··· ,v l . Note that each call of closest requires two LCA computations. IMS reduces the number of LCA computation by enumerating all the possible L’s whose LCA can be potential SLCA, it can be at most l possible choices. By the definition of match L anchored by v m , it must satisfy the following three conditions: • L ⊆{v m }∪P ∪ N , where P ={lm(v m ,S i ) | i ∈[1,l],i = m, lm(v m ,S i ) =⊥} and N = {rm(v m ,S i ) | i ∈[1,l],i = m, rm(v m ,S i ) =⊥} • L ∩ S i =∅, ∀i ∈[1,l] • v m ∈ L Without loss of generality, we assume that all lm(v m ,S i ) and rm(v m ,S i ) are not equal to ⊥, P =u 1 , ··· ,u l−1 , where pre(u i ) ≤ pre(u i+1 )∀i ∈[1,l− 2], N =u 1 , ··· ,u l−1 is the list corresponding to P , and v m ∈ S l .Then all the possible L’s whose LCA can be potential SLCA is of the form u i , ··· ,u l−1 ,v m ,u 1 , ··· ,u i−1 , denoted as L i . This is because that, if f irst (L) = u i , then u 1 , ··· ,u i−1 must be in L, and L i is the one with smallest last (L) among all those matches with f irst (L) = u i , then result in the largest LCA. Note that all the LCAs are on the path from root to v m ,asv m must be in L. Then we can enumerate L in the order L 1 , ··· ,L l , where first(L i ) ≤ first(L i+1 ) and last(L i ) ≤ last(L i+1 ). Therefore, if lca(L i ) lca(L i+1 )∀i<j, and lca(L j ) lca(L j+1 ), then L j is the match anchored by v m . Note that, the above discussion are based on the fact that the nodes in P are in increasing Dewey ID order, but usually this is not the case, so we have to sort P first. BMS ( MultiwaySLCA) and IMS correctly output all the SLCA nodes, i.e. slca(S 1 , ··· ,S l ), in time O(|S 1 |ld log |S|) [Sunetal., 2007]. 4.3 IDENTIFY MEANINGFUL RETURN INFORMATION The algorithms shown in the previous section study the efficiency aspect of keyword search.They can find and output all the SLCA nodes (or the whole subtree rooted at SLCA nodes) efficiently. But they do not consider the user’s intention for a keyword query.The information returned is either too few (only SLCAs are returned) or too large (the whole subtree rooted at each SLCA is returned). Two approaches have been proposed to identify meaningful return information for a keyword query. One alternative is representing the whole subtree rooted at a SLCA node compactly and presenting it to users,so that it will not overwhelm users [Liu and Chen,2007]. Another alternative is returning only those subtrees that satisfy two novel properties, which captures desirable changes to a query result upon a change to the query or data in a general framework [Liu and Chen, 2008b]. Both works are based on the following definition of query result. Definition 4.22 Keyword Query Results. Processing keyword query Q on XML tree T returns a set of query results, denoted as R(T , Q), where each query result is a subtree (defined by a pair . 94 4. KEYWORD SEARCH IN XML DATABASES Algorithm 32 IndexedLookupEager (S 1 , ··· ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output:. ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: All the SLCAs 1: u ← root with Dewey ID 0 2: for each node v 1 ∈ S 1 in increasing. x 1 . The next anchor node is selected in the same way by removing all those nodes with Dewey ID 96 4. KEYWORD SEARCH IN XML DATABASES smaller than pre(b 1 ) from each keyword list. Then b 2 is selected,