4.2. SLCA-BASED SEMANTICS 89 4.2.1 PROPERTIES OF LCA AND SLCA Property 4.9 Given a set S and two nodes v i and v j with v i <v j , then closest (v i ,S)≤ closest (v j ,S). Proof. We prove it by contradiction, by assuming that closest(v i ,S) > closest(v j ,S). Then closest (v i ,S)= rm(v i ,S) and closest (v j ,S)= lm(v j ,S), rm(v i ,S)>lm(v j ,S). Recall that closest (v, S) is chosen from lm(v, S) and rm(v, S), and lm(v i ,S)≤ lm(v j ,S) and rm(v i ,S)≤ rm(v j ,S)if all exists. If lm(v j ,S)<rm(v i ,S), then lm(v j ,S)≤ lm(v i ,S), therefore lm(v i ,S)= lm(v j ,S)by the fact that lm(v i ,S)≤ lm(v j ,S). Similarly, we can get that rm(v i ,S)= rm(v j ,S). Also, we can learn that lm(v i ,S)= rm(v i ,S), otherwise closest (v i ,S)= lm(v i ,S). Let lm denote lm(v i ,S) and rm denote rm(v i ,S). It holds that lm<v i <v j <rm.Ac- cording to Property 4.2, lca(lm,v j ) lca(lm,v i ) and lca(rm, v i ) lca(rm, v j ). According to the definition of closest, lca(lm,v i ) ≺ lca(rm, v i ) and lca(rm, v j ) lca(lm,v j ), which is a con- tradiction. ✷ Property 4.10 Let V and U be lists of nodes, e.g., V ={v 1 , ··· ,v l } and U ={u 1 , ··· ,u l }, such that V ≤ U , e.g., v i ≤ u i for 1 ≤ i ≤ l.Letlca(V) and lca(U) be the LCA of nodes in V and U , respectively. Then, 1. if lca(V ) ≥ lca(U), then lca(U) lca(V ), 2. if lca(V)<lca(U), then • either lca(V ) ≺ lca(U), •orlca(V ) ⊀ lca(U), then for any W with U ≤ W , lca(V) ⊀ lca(W). Proof. This is an extension of Property 4.3 to more than two nodes. The proof is by induction, when V and U contain only two nodes, it is proven in Property 4.3. Assume that it is true for V,U and W , we prove it is true for V ,U ,W , where V = V ∪{v l }, U = U ∪{u l }, with v l ≤ u l . One important property of lca is that lca(V ) = lca(lca(V ), v l ).Iflca(U) lca(V ), then either lca(U ) lca(V ) or lca(V ) ≺ lca(U ). Otherwise, lca(V)<lca(U), according to Property 4.3, there are three cases of lca(V ) and lca(U ), and we only need to prove the last case, i.e. the case that lca(V )<⊀ lca(U ). Then for any W = W ∪{w l },iflca(U) ≤ lca(W), then we are done; otherwise lca(W) ≺ lca(U), then lca(V ) ⊀ lca(W ), because lca(W ) lca(W). ✷ 90 4. KEYWORD SEARCH IN XML DATABASES Table 4.0: id k 1 k 2 ··· k l id m ··· id 2 id 1 Figure 4.3: Stack Data Structure 4.2.2 EFFICIENT ALGORITHMS FOR SLCAS In this section, we consider three algorithms, namely StackAlgorithm, IndexedLookupEa- ger , and ScanEager [Xu and Papakonstantinou, 2005], that find all the slca(S 1 , ··· ,S l ) effi- ciently.Each algorithm has a different characteristic, and it works efficient in some situations. Mul- tiwaySLCA further improves the performance of IndexedLookupEager by proposing some heuristics but with the same worst case time complexity as IndexedLookupEager. Note that these algorithms only get all the SLCAs, but they do not keep the match nodes for the SLCAs. Finding the match nodes for all the SLCAs can be done efficiently by one scan of SLCAs and one scan of S 1 , ··· ,S l , provided that the nodes in SLCAs are in increasing Dewey ID order. Stack Algorithm:This is an adaptation of the stack based sort-merge algorithm [Guoetal.,2003]to compute all the SLCAs. It uses a stack, each stack entry has a pair of components (id,keyword), as shown in Figure 4.3. Assume the id components from the bottom entry to a stack entry en are id 1 , ··· ,id m , respectively, then the stack entry en denotes the node with the De wey ID id 1 .id 2 . ··· .id m . keyword is an array of length l of Boolean values, where keyword[i]=true means that the subtree rooted at the node denoted by the entry contains keyword k i directly or indirectly. The general idea of StackAlg orithm is to use a stack to simulate the postorder traversal of a virtual XML tree formed by the union of the paths from root to each node in S 1 , ··· ,S l , while the nodes are read in a preorder fashion. When an entry en is popped out, which means that all the descendant-or-self nodes of en in S 1 , ··· ,S l have been visited, it is known whether or not a keyword appears in the subtree. StackAlg orithm merges all keyword lists and computes the longest common prefix of the node with the smallest Dewey ID from the input lists and the node denoted by the top entry of the stack.Then it pops out all top entries until the longest common prefix is reached. If the keyword component of a popped entry en contains all the keywords, then the node denoted by en is a SLCA node. Based on the definition of SLCA, all the ancestor nodes of a SLCA node can not be SLCA,so this information is recorded.Otherwise, the keyword containment information of en is used to update its parent entry’s keyword array. Also, a stack entry is created for each Dewey component of the current visiting node that is not part of the common prefix, where each new entry corresponds to one node on the path from the longest common prefix to the current 4.2. SLCA-BASED SEMANTICS 91 Algorithm 31 StackAlgorithm (S 1 , ··· ,S l ) Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i . Output: All the SLCAs 1: stack ←∅ 2: while has not reached the end of all Dewey lists do 3: v ← getSmallestNode() 4: p ← lca(stack, v) 5: while stack.size > p do 6: en ← stack.pop() 7: if en.keyword[i]=true, ∀i(1 ≤ i ≤ l) then 8: output en as a SLCA 9: mark all the entries in stack so that it can never be SLCA node 10: else 11: ∀i(1 ≤ i ≤ l) : stack.top().keyword[i]←true,ifen.keyword[i]=true 12: ∀i(p < i ≤ v.length) : stack.push(v[i], []) 13: stack.top().keyword[i]←true, where v ∈ S i 14: check entries of the stack and return any SLCA node if exists node. Essentially, the node represented by the top entry of the stack is the node visited in pre-order traversal. StackAlg orithm is shown in Algorithm 31. It first initializes the stack stack to be empty (line 1). As long as there are Dewey lists that have not been visited (line 2), it reads the next node with the smallest Dewey ID (line 3), and performs necessary operations. Essentially, reading nodes in this order is equivalent to a preorder traversal of the original XML tree ignoring irrelevant nodes. Let stack[i] denote the node represented by the i-th entry of stack starting from the bottom, and v[i] denote the i-th component of the Dewey ID of v. After getting v, it computes the LCA of v and the node represented by the top of stack (line 4), which is stack[p]. This means that all the keyword nodes have been read that are descendants of stack[p + 1] if they exist, and the keyword containment information has been stored in the corresponding stack entries. Then all those nodes represented by stack[i] (p < i ≤ stack.size) are popped (lines 5-11). For each popped entry en (line 6), it first checks whether it is a SLCA node (line 7); if en is indeed a SLCA node, then it is output (line 8) and the information is recorded that all its ancestors can not be SLCAs (line 9). Otherwise, the keyword containment information of its parent node is updated (line 11). After popping out all the non-ancestor nodes from stack, v and its ancestors are pushed onto stack (line 12), and the keyword containment information is stored (line 13). At this moment, the node represented by the top entry of stack is v, and the whole stack represents all the nodes on the path from root to v, and the keyword containment information is stored compactly. After all the Dewey 92 4. KEYWORD SEARCH IN XML DATABASES lists have been read, all the entries need to be popped from stack, and a check is performed to see if there exists any SLCA node (line 14). StackAlg orithm outputs all the SLCA nodes, i.e. slca(S 1 , ··· ,S l ), in time O(d l i=1 |S i |),orO(ld|S|) [Xu and Papakonstantinou, 2005]. Note that the above time complex- ity doesnot takeinto accountthe time to merge S 1 , ··· ,S l ,as it willtake timeO(d log l · l i=1 |S i |). getSmallestNode (line 3) just retrieves the next node with smallest Dewey ID from the merged list. Indexed Lookup Eager: StackAlg orithm treats all the Dewey lists S 1 , ··· ,S l equally,but some- times |S 1 |, ··· , |S l | vary dramatically. Xu and Papakonstantinou [2005] propose IndexedLooku- pEager to compute all the SLCA nodes, in the situation that |S 1 | is much smaller than |S|.Itis based on the following properties of slca function. Property 4.11 slca({v},S) = lca(v, closest (v, S)), and slca({v},S 2 , ··· ,S l ) = slca(slca({v},S 2 , ··· ,S l−1 ), S l ) = lca(v, closest (v, S 2 ), ··· ,closest(v, S l )) for l>2. Property 4.11 suggests that we can find the SLCA node of a node, v, and a set of nodes, S, by finding the closest node of v and S first followed by finding the LCA node of v and the closest node of v and S. The definition of closest is given in Section 4.1.2. Based on Property 4.11, we can compute slca({v 1 },S 2 , ··· ,S l ) by first finding the closest point of v 1 from each set S i , denoted as closest (v 1 ,S i ); then finding the slca consists of the single node lca(v 1 ,closest(v 1 ,S 2 ), ··· ,closest(v 1 ,S l )). The computation of slca({v 1 },S 2 , ··· ,S l ) takes time O(d l i=2 log |S i |). Then for arbitrary S 1 , ··· ,S l , we have the following property. Property 4.12 slca(S 1 , ··· ,S l ) = removeAncestor( v 1 ∈S 1 slca({v 1 },S 2 , ··· ,S l )). Property 4.12 shows that in order to find SLCA nodes of S 1 , ··· ,S l , we can first find slca({v 1 },S 2 , ··· ,S l ) for each v 1 ∈ S 1 , and then remove all these ancestor nodes. Its correctness follows from the fact that, slca(S 1 , ··· ,S l ) = removeAncestor(lca(S 1 , ··· ,S l )). The definition of removeAncestor is given in Section 4.1.2. The above two properties directly lead to an algorithm to compute slca(S 1 , ··· ,S l ): (1) first compute {x i }=slca({v i },S 2 , ··· ,S l ), for each v i ∈ S 1 (1 ≤ i ≤|S 1 |); (2) removeAncestor({x 1 , ··· ,x |S 1 | }) is the answer. The time complexity of the algorithm is O(|S 1 | l i=2 d log |S i |+|S 1 |d log |S 1 |) or O(|S 1 |ld log |S|). The first step of computing slca({v i },S 2 , ··· ,S l ) for each v i ∈ S 1 takes time O(|S 1 | l i=2 d log |S i |). The second step takes time O(|S 1 |d log |S 1 |), which can be implemented by first sorting {x 1 , ··· ,x |S 1 | } in increasing Dewey ID order, and then finding the SLCA nodes by a linear scan. Note that, this time complexity is different from Xu and Papakonstantinou [2005], which is O(|S 1 | l i=2 d log |S i |+|S 1 | 2 ). Although it has the same time complexity of IndexedLookupEager, the above algorithm is a blocking algorithm, while IndexedLookupEager is non-blocking. Lemma 4.13 Given any two nodes v i and v j , with pre(v i )<pre(v j ), and a set S of Dewey IDs: 4.2. SLCA-BASED SEMANTICS 93 1. if slca({v i },S)≥ slca({v j },S), then slca({v j },S) slca({v i },S). 2. if slca({v i },S)<slca({v j },S), • either slca({v i },S)is an ancestor of slca({v j },S), •orslca({v i },S) is not an ancestor of slca({v j },S), then for any v such that pre(v) > pre(v j ), slca({v i },S)⊀ slca({v},S). The correctness of the above lemma directly follows from Property 4.3 and Property 4.11. It straightforwardly leads to a non-blocking algorithm to compute slca(S 1 ,S 2 ), by removing ancestor nodes on-the-fly, which is shown as the subroutine getSLCA in IndexedLookupEager. The above lemma can be directly applied to multiple sets with the first set as a singleton, i.e. by replacing S by S 2 , ··· ,S l in the lemma. The correctness directly follows Property 4.10, Property 4.9, and Property 4.11. Property 4.14 slca(S 1 , ··· ,S l ) = slca(slca(S 1 , ··· ,S l−1 ), S l ) for l>2. IndexedLookupEager, as shown in Algorithm 32, directly follows from Lemma 4.13 and Property 4.11, Property 4.12, and Property 4.14. p in Line 3 is the buffer size, it can be any value ranging from 1 to |S 1 |; the smaller p is, the faster the algorithm produces the first SLCA. It first computes X 2 = slca(X 1 ,S 2 ), where X 1 is the next p nodes from S 1 (line 3). Then it computes X 3 = slca(X 2 ,S 3 ) and so on, until it computes X l = slca(X l−1 ,S l ) (lines 4-5). Note that at any step, the nodes in X i are in increasing De wey ID order, and there is no ancestor-descendant relation- ship between any two nodes in X i . All nodes in X l except the first and the last one are guaranteed to be SLCA nodes (line 9).The first node of X l is checked at line 6.The last node of X l is carried on to the next iteration (line 9) to be determined whether or not it is a SLCA (line 7). IndexedLooku- pEager outputs all the SLCA nodes, i.e., slca(S 1 , ··· ,S l ), in time O(|S 1 | l i=2 d log |S i |),or O(|S 1 |ld log |S|) [Xu and Papakonstantinou, 2005]. Scan Eager: When the keyword frequencies, i.e., |S 1 |, ··· , |S l |, do not differ significantly, the to- tal cost of finding matches by lookups using binary search may exceed the total cost of finding the matches by scanning the keyword lists, i.e O(|S 1 |ld log |S|)>O(ld|S|). ScanEager (Algo- rithm 33) [Xu and Papakonstantinou, 2005] modifies Line 15 of IndexedLookupEager by using linear scan to findthe lm() and rm().It takes advantage of thefact that the accesses toany keyword list are strictly in increasing order in IndexedLookupEager. Consider the getSLCA(S 1 ,S 2 ) subrou- tine in IndexedLookupEager, in order to find lm(v, S 2 ) and rm(v, S 2 ), ScanEager maintains a cursor for each keyword list, and it advances the cursor of S 2 until it finds the node that is closest to v from the left or the right side. Note that if rm(v, S 2 ) exists, then it should be the next node in S 2 of lm(v, S 2 ), or the first node in S 2 if lm(v, S 2 ) =⊥.The main idea is based on the fact that, for any v i and v j in S 1 , with pre(v i )<pre(v j ), lm(v i ,S 2 ) ≤ lm(v j ,S 2 ) and rm(v i ,S 2 ) ≤ rm(v j ,S 2 ),it assumes that all lm() and rm() are not equal to ⊥. Note that, in order to ensure the correctness of . finding the matches by scanning the keyword lists, i.e O(|S 1 |ld log |S|)>O(ld|S|). ScanEager (Algo- rithm 33) [Xu and Papakonstantinou, 2005] modifies Line 15 of IndexedLookupEager by using linear. of IndexedLookupEager by using linear scan to findthe lm() and rm().It takes advantage of thefact that the accesses toany keyword list are strictly in increasing order in IndexedLookupEager. Consider. Consider the getSLCA(S 1 ,S 2 ) subrou- tine in IndexedLookupEager, in order to find lm(v, S 2 ) and rm(v, S 2 ), ScanEager maintains a cursor for each keyword list, and it advances the cursor