34 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES Algorithm 10 Block-Pipelined (the keyword query Q, the top-k value k, the CN C) 1: topk ←∅; Q ←∅ 2: c ← (1, 1, , 1); c.status = USCORE 3: Q.push(c, uscore(c)) 4: while Q.max-uscore > score(topk[k],Q)do 5: c ← Q.popmax() 6: if c.status = USCORE then 7: c.status = BSCORE 8: Q.push(c, bscore(c)) 9: for i = 1 to s do 10: c ← c; c .status = USCORE 11: c [i]←c [i]+1 12: Q.push(c ,uscore(c )) 13: if c [i] > 1 then 14: break 15: else 16: update topk using eval(c) 17: output topk 2.4 OTHER KEYWORD SEARCH SEMANTICS In the above discussions, for an l-keyword query on a relational database, each result is an MTJNT . This is referred to as the connected tree semantics. There are two other semantics to answer an l- keyword query on a relational database, namely distinct root semantics and distinct core semantics.In this section, we will focus on how to answer keyword queries using rdbms given the schema graph. In the next chapter, we will further discuss how to answer keyword queries under different semantics on a schema free database graph. Distinct Root Semantics: An l-keyword query finds a collection of tuples that contain all the keywords and that are reachable from a root tuple (center) within a user-given distance ( Dmax ).The distinct root semantics implies that the same root tuple determines the tuples uniquely [Dalvi et al., 2008; He et al., 2007; Hristidis et al., 2008; Li et al., 2008a; Qin et al., 2009a]. Suppose that there is a result rooted at tuple t r . For any of the l keywords, say k i , there is a tuple t in the result that satisfies the following conditions: (1) t contains keyword k i , (2) among all tuples that contain k i , the distance between t and t r is minimum 3 , and (3) the minimum distance between t and t r must be less than or equal to a user given parameter Dmax. Reconsider the DBLP database in Example 2.1 with the same 2-keyword query Q = {Michelle, XML}, and let Dmax = 2. The 10 results are shown in Figure 2.16(a). The root nodes are the nodes shown at the top, and all root nodes are distinct. For example, the rightmost result in 3 If there is a tie, then a tuple is selected with a predefined order among tuples in practice. 2.4. OTHER KEYWORD SEARCH SEMANTICS 35 w 4 a 3 Michelle p 2 XML w 6 a 3 Michelle p 3 XML c 1 p 1 Michelle p 2 XML c 2 p 1 Michelle p 3 XML a 3 Michelle w 4 p 2 XML p 1 Michelle c 2 p 2 XML p 2 XML w 4 a 3 Michelle p 3 XML w 6 a 3 Michelle a 1 w 1 w 2 p 1 Michelle p 2 XML p 4 w 5 c 5 a 3 Michelle p 2 XML (a) Distinct Root (Q ={Michelle, XML}, Dmax = 2) p 3 w 4 p 4 w 6 c 3 w 5 c 5 a 3 Michelle p 2 XML p 2 w 6 p 4 w 4 c 3 w 5 c 4 a 3 Michelle p 3 XML a 1 c 1 p 3 w 1 w 2 c 2 c 3 p 1 Michelle p 2 XML p 2 c 2 c 1 c 3 p 1 Michelle p 3 XML (b) Distinct Core (Q ={Michelle, XML}, Dmax = 2) Figure 2.16: Distinct Root/Core Semantics Figure 2.16(a) shows that two nodes, a 3 (containing “Michelle”) and p 2 (containing “XML”), are reachable from the root node p 4 within Dmax = 2. Under the distinct root semantics, the rightmost result can be output as a set (p 4 , a 3 , p 2 ), where the connections from the root node (p 4 ) to the two nodes can be ignored as discussed in BLINKS [He et al., 2007]. Distinct Core Semantics: An l-keyword query finds multi-center subgraphs, called communities [Qin et al., 2009a,b]. A community, C i (V , E), is specified as follows. V is a union of three subsets of tuples, V = V c ∪ V k ∪ V p , where, V k is a set of keyword-tuples where a keyword-tuple v k ∈ V k contains at least akeyword,and all l keywordsinthegiven l-keyword query mustappear in at leastone keyword-tuple in V k ; V c is a set of center-tuples where there exists at least a sequence of connections between v c ∈ V c and every v k ∈ V k such that dis(v c ,v k ) ≤ Dmax; and V p is a set of path-tuples that appear on a shortest sequence of connections from a center-tuple v c ∈ V c to a keyword-tuple v k ∈ V k if dis(v c ,v k ) ≤ Dmax. Note that a tuple may serve several roles as keyword/center/path tuples in a community. E is a set of connections for every pair of tuples in V if they are connected over shortest paths from nodes in V c to nodes in V k . A community, C i , is uniquely determined by the set of keyword tuples, V k , which is called the core of the community, and denoted as core(C i ). Reconsider the DBLP database in Example 2.1 with the same 2-keyword query Q = {Michelle, XML} and Dmax = 2. The four communities are shown in Figure 2.16(b), and the four unique cores are (a 3 ,p 2 ), (a 3 ,p 3 ), (p 1 ,p 2 ), and (p 1 ,p 3 ), for the four communities from left to right, respectively.The multi-centers for each of the communities are shown in the top.For example, for the rightmost community, the two centers are p 2 and c 2 . 36 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES It is important to note that the parameter Dmax used in the distinct core/root semantics is different from the parameter Tmax used in the connected tree semantics. Dmax specifies a range from a center (root tuple) in which a tuple containing a keyword can be possibly included in a result, and Tmax specifies the maximum number of nodes to be included in a result. Distinct Core/Root in rdbms: We outline the approach to process l-keyword queries with a radius ( Dmax) based on the distinct core/root semantics. In the first step, for each keyword k i , we compute a temporal relation, Pair i (tid i ,dis i ,TID), with three attributes, where both TID and tid i are TIDs and dis i is the shortest distance betweenTID and tid i (dis(TID,tid i )), which is less than or equal to Dmax . A tuple in Pair i indicates that the TIDtuple is in the shortest distance of dis i with the tid i tuple that contains keyword k i . In the second step, we join all temporal relations, Pair i , for 1 ≤ i ≤ l, on attribute TID (center) S ← Pair 1 ✶ Pair 1 .T ID=Pair 2 .T ID Pair 2 ···Pair l−1 ✶ Pair l−1 .T ID=Pair l .T ID Pair l (2.25) Here, S is a 2l + 1 attribute relation, S(TID, tid 1 , dis 1 , ···, tid l , dis l ). Over the temporal relation S, we can obtain the multi-center communities (distinct core) by grouping tuples on l attributes, tid 1 , tid 2 , ···, tid l . Consider query Q ={Michelle, XML} and Dmax = 2, against the simple DBLP database in Figure 2.2. The rightmost community in Figure 2.16(b) is shown in Figure 2.17. TID tid1 dis1 tid2 dis2 p 1 p 1 0 p 3 2 p 2 p 1 2 p 3 2 p 3 p 1 2 p 3 0 c 2 p 1 1 p 3 1 Figure 2.17: A Multi-Center Community Here, the distinct core consists of p 1 and p 3 , where p 1 contains keyword “Michelle” (k 1 ) and p 3 contains keyword “XML” (k 2 ), and the four centers, {p 1 ,p 2 ,p 3 ,c 2 }, are listed in the TID column. Any center can reach all the tuples in the core, {p 1 ,p 3 }, within Dmax. The above does not explicitly include the two nodes, c 1 and c 3 in the rightmost community in Figure 2.16(b), which can be maintained in an additional attribute by concatenating the TIDs, for example, p 2 .c 1 .p 1 and p 2 .c 3 .p 3 . In a similar fashion, over the same temporal relation S, we can also obtain the distinct root results by grouping tuples on the attribute TID. Consider the query Q ={Michelle, XML} and Dmax = 2, the rightmost result in Figure 2.16(a) is shown in Figure 2.18. The distinct root is represented by the TID, and the rightmost result in Figure 2.16(a) is the first of the two tuples, where a 3 contains keyword “Michelle” (k 1 ) and p 2 contains keyword “XML” 2.4. OTHER KEYWORD SEARCH SEMANTICS 37 TID tid1 dis1 tid2 dis2 p 4 a 3 2 p 2 2 p 4 a 3 2 p 3 2 Figure 2.18: A Distinct Root Result Gid TID tid1 dis1 tid2 dis2 1 a 3 a 3 0 p 2 2 1 w 4 a 3 1 p 2 1 1 p 2 a 3 2 p 2 0 1 p 3 a 3 2 p 2 2 1 p 4 a 3 2 p 2 2 2 a 3 a 3 0 p 3 2 2 w 6 a 3 1 p 3 1 2 p 2 a 3 2 p 3 2 2 p 3 a 3 2 p 3 0 2 p 4 a 3 2 p 3 2 3 a 1 p 1 2 p 2 2 3 p 1 p 1 0 p 2 2 3 p 2 p 1 2 p 2 0 3 p 3 p 1 2 p 2 2 3 c 1 p 1 1 p 2 1 4 p 1 p 1 0 p 3 2 4 p 2 p 1 2 p 3 2 4 p 3 p 1 2 p 3 0 4 c 2 p 1 1 p 3 1 Gid TID tid1 dis1 tid2 dis2 1 w 4 a 3 1 p 2 1 2 w 6 a 3 1 p 3 1 3 c 1 p 1 1 p 2 1 4 c 2 p 1 1 p 3 1 5 a 3 a 3 0 p 2 2 5 a 3 a 3 0 p 3 2 6 p 1 p 1 0 p 2 2 6 p 1 p 1 0 p 3 2 7 p 2 a 3 2 p 2 0 7 p 2 a 3 2 p 3 2 7 p 2 p 1 2 p 2 0 7 p 2 p 1 2 p 3 2 8 p 3 a 3 2 p 3 0 8 p 3 a 3 2 p 2 2 8 p 3 p 1 2 p 2 2 8 p 3 p 1 2 p 3 0 9 a 1 p 1 2 p 2 2 10 p 4 a 3 2 p 2 2 10 p 4 a 3 2 p 3 2 Figure 2.19: Distinct Core(left) and Distinct Root(right) (Q ={Michelle, XML}, Dmax = 2) (k 2 ). Note that a distinct root means a result is uniquely determined by the root. As shown above, there are two tuples with the same root p 4 . We select one of them using the aggregate function min. The complete results for the distinct core/root results are given in Figure 2.19, for the same 2-keyword query,Q ={Michelle, XML} with Dmax = 2, against the DBLP database in Figure 2.2. Both tables have an attribute Gid that is for easy reference of the distinct core/root results. The left table shows the same content as the right table by grouping on TID in which the shadowed tuples are removed using the sql aggregate function min to ensure the distinct root semantics. Naive Algorithms: Figure 2.20 outlines the two main steps for processing the distinct core/root 2-keyword query, Q ={Michelle, XML}, with Dmax = 2 against the simple DBLP database. Its schema graph, G S , is in Figure 2.1, and the database is in Figure 2.2. In Figure 2.20, the left side computes Pair 1 and Pair 2 temporal relations,for keyword k 1 = “Michelle” and k 2 = “XML”,using 38 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES R 1 R 2 R 3 R 4 R 1 R 2 R 3 R 4 k 1 k 2 R 1 R 2 R 3 R 4 R 1 R 2 R 3 R 4 R 1 R 2 R 3 R 4 R 1 R 2 R 3 R 4 Dmax = 2 TID tid 1 dis 1 a 1 p 1 2 a 3 a 3 0 w 1 p 1 1 TID tid 2 dis 2 a 1 p 2 2 a 2 p 2 2 a 3 p 2 2 Pair 1 Pair 2 S TID tid 1 dis 1 tid 2 dis 2 a 1 p 1 2 p 2 2 a 3 a 3 0 p 2 2 Figure 2.20: An Overview (R 1 , R 2 , R 3 , and R 4 represent Author, Write, Paper, and Cite relations in Example 2.1) projects, joins, unions, and group-by, and the right side joins Pair 1 and Pair 2 to compute the S relation (Eq. 2.25). Let R 1 , R 2 , R 3 , and R 4 represent Author, Write, Paper, and Cite relations. The Pair 1 for the keyword k 1 is produced in the following steps. P 0,1 ← TID→tid 1 ,0→dis 1 ,∗ (σ cont ain(k 1 ) R 1 ) P 0,2 ← TID→tid 1 ,0→dis 1 ,∗ (σ cont ain(k 1 ) R 2 ) P 0,3 ← TID→tid 1 ,0→dis 1 ,∗ (σ cont ain(k 1 ) R 3 ) P 0,4 ← TID→tid 1 ,0→dis 1 ,∗ (σ cont ain(k 1 ) R 4 ) (2.26) Here σ cont ain(k 1 ) R j selects the tuples in R j that contain the keyword k 1 .LetR j ← σ cont ain(k 1 ) R j , TID→tid 1 ,0→dis 1 ,∗ (R j ) projects tuples from R j with all attributes (∗) by further adding two at- tributes (renaming the attribute TIDto be tid 1 and adding a new attribute dis 1 with an initial value zero (this is supported in sql)). For example, TID→tid 1 ,0→dis 1 ,∗ (σ cont ain(k 1 ) R 1 ) is translated into the following sql. select TID as tid 1 ,0as dis 1 , TID, Name from Author as R 1 where contain(Title, Michelle) The meaning of the temporal relation P 0,1 (tid 1 ,dis 1 ,TID,Name) is a set of R 1 relation tuples (identified by TID) that are in distance dis 1 = 0 from the tuples (identified by tid 1 ) containing keyword k 1 = “Michelle”. The same is true for other P 0,j temporal relations as well. After P 0,j are computed, 1 ≤ j ≤ 4, we compute P 1,j followed by P 2,j to obtain R j relation tuples that are in distance 1 and distance 2 from the tuples containing keyword k 1 = “Michelle” (Dmax = 2). Note that relation P d,j contains the set of tuples of R j that are in distance d from a tuple containing a . V k is a set of keyword- tuples where a keyword- tuple v k ∈ V k contains at least akeyword,and all l keywordsinthegiven l -keyword query mustappear in at leastone keyword- tuple in V k ; V c is. can be possibly included in a result, and Tmax specifies the maximum number of nodes to be included in a result. Distinct Core/Root in rdbms: We outline the approach to process l -keyword queries. the tuples in the core, {p 1 ,p 3 }, within Dmax. The above does not explicitly include the two nodes, c 1 and c 3 in the rightmost community in Figure 2.16(b), which can be maintained in an additional