Keyword Search in Databases- P9 pdf

2.4. OTHER KEYWORD SEARCH SEMANTICS 39 certain keyword, and its two attributes, tid l and dis l , explicitly indicate that it is about keyword k l . The details of computing P 1,j for R j , 1 ≤ j ≤ 4, are given below. P 1,1 ←  P 0,2 .T I D→tid 1 ,1→dis 1 ,R 1 .∗ (P 0,2 ✶ P 0,2 .AI D=R 1 .T I D R 1 ) P 1,2 ←  P 0,1 .T I D→tid 1 ,1→dis 1 ,R 2 .∗ (P 0,1 ✶ P 0,1 .T I D=R 2 .AI D R 2 ) ∪  P 0,3 .T I D→tid 1 ,1→dis 1 ,R 2 .∗ (P 0,3 ✶ P 0,3 .T I D=R 2 .P I D R 2 ) P 1,3 ←  P 0,2 .T I D→tid 1 ,1→dis 1 ,R 3 .∗ (P 0,2 ✶ P 0,2 .P I D=R 3 .T I D R 3 ) ∪  P 0,4 .T I D→tid 1 ,1→dis 1 ,R 3 .∗ (P 0,4 ✶ P 0,4 .P I D1=R 3 .T I D R 3 ) ∪  P 0,4 .T I D→tid 1 ,1→dis 1 ,R 3 .∗ (P 0,4 ✶ P 0,4 .P I D2=R 3 .T I D R 3 ) P 1,4 ←  P 0,3 .T I D→tid 1 ,1→dis 1 ,R 4 .∗ (P 0,3 ✶ P 0,3 .T I D=R 4 .P I D1 R 4 ) ∪  P 0,3 .T I D→tid 1 ,1→dis 1 ,R 4 .∗ (P 0,3 ✶ P 0,3 .T I D=R 4 .P I D2 R 4 ) (2.27) Here,each join/project corresponds to a foreign key reference – an edge in schema graph G S .Theidea is to compute P d,j based on P d−1,i if there is an edge between R j and R i in G S .Consider P 1,3 for R 3 , it computes P 1,3 by union of three joins (P 0,2 ✶ R 3 ∪ P 0,4 ✶ R 3 ∪ P 0,4 ✶ R 3 ), because there is one foreign key reference between R 3 (Paper) and R 2 (Write), and two foreign key references between R 3 and R 4 (Cite). This ensures that all R j tuples that are with distance d from a tuple containing a keyword k l can be computed. Continuing the example, to compute P 2,j for R j , 1 ≤ j ≤ 4, for keyword k 1 , we replace every P d,j in Eq. 2.27 with P d+1,j and replace “1 → dis 1 ” with “2 → dis 1 ”. The process repeats Dmax times. Suppose that we have computed P d,j for 0 ≤ d ≤ Dmax and 1 ≤ j ≤ 4, for keyword k 1 = “Michelle”. We further compute the shortest distance between a R j tuple and a tuple containing k 1 using union, group-by, and sql aggregate function min. First, we perform project, P d,j ←  TID,tid 1 ,dis 1 P d,j . Therefore, every P d,j relation has the same tree attributes. Second, for R j , we compute the shortest distance from a R j tuple to a tuple containing keyword k 1 using group-by () and sql aggregate function min. G j ← TID,tid 1  min(dis 1 ) (P 0,j ∪ P 1,j ∪ P 2,j ) (2.28) where, the left side of group-by () is group-by attributes, and the right side is the sql aggregate function. Finally, Pair 1 ← G 1 ∪ G 2 ∪ G 3 ∪ G 4 (2.29) Here, Pair 1 records all tuples that are shortest distance away from a tuple containing keyword k 1 , within Dmax. Note that G i ∩ G j =∅, because G i and G j are tuples identified with TIDs from R i and R j relations and TIDs are unique in the database as assumed. We can compute Pair 2 for keyword k 2 = “XML” following the same procedure as indicated in Eq. 2.26-Eq. 2.29. Once all Pair 1 and Pair 2 are computed, we can easily compute distinct core/root results based on the relation S ← Pair 1 ✶ Pair 2 (Eq. 2.25). 40 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES Algorithm 11 Pair(G S ,k i , Dmax,R 1 , ··· ,R n ) Input: Schema G S , keyword k i , Dmax, n relations R 1 , ··· ,R n . Output:Pair i with 3 attributes: TID, tid i , dis i . 1: for j = 1 to n do 2: P 0,j ←  R j .T I D→tid i , 0→dis i ,R j .∗ (σ cont ain(k i ) R j ) 3: G j ←  tid i ,dis i ,T I D (P 0,j ) 4: for d = 1 to Dmax do 5: for j = 1 to n do 6: P d,j ←∅ 7: for all (R j ,R l ) ∈ E(G S ) ∨ (R l ,R j ) ∈ E(G S ) do 8:  ←  P d−1,l .T I D→tid i ,d→dis i ,R j .∗ (P d−1,l ✶ R j ) 9:  ← σ (tid i ,T I D)/∈  tid i ,T I D (G j ) () 10: P d,j ← P d,j ∪  11: G j ← G j ∪  tid i ,dis i ,T I D () 12: Pair i ← G 1 ∪ G 2 ∪ ··· ∪ G n 13: return Pair i Computing group-by () with sql aggregate function min: Consider Eq. 2.28, the group-by  can be computed by virtually pushing . Recall that all P d,j relations, for 1 ≤ d ≤ Dmax, have the same schema,and P d,j maintains R j tuples that are in distance d from a tuple containing a keyword. We use two pruning rules to reduce the number of temporal tuples computed. Rule-1: If the same (tid i ,TID) value appears in two different P d  ,j and P d,j , then the shortest distance between tid i and TIDmust be in P d  ,j but not P d,j ,ifd  <d.Therefore, Eq. 2.28 can be computed as follows. G j ← P 0,j G j ← G j ∪ (σ (tid 1 ,T I D)/∈  tid 1 ,T I D (G j ) P 1,j ) G j ← G j ∪ (σ (tid 1 ,T I D)/∈  tid 1 ,T I D (G j ) P 2,j ) (2.30) Here, σ (tid 1 ,T I D)/∈  tid 1 ,T I D (G j ) P 2,j selects P 2,j tuples where their (tid 1 ,TID) does not appear in G j ; yet, in other words, there does not exist a shortest path between tid 1 and TID before. Rule-2: If there exists a shortest path between tid i and TIDvalue pair, say, dis i (tid i ,TID)= d  , then there is no need to compute any tuple connections between the tid i and TIDpair, because all those will be removed later by group-by and sql aggregate function min. In Eq. 2.27, every P 1,j , 1 ≤ j ≤ 4, can be further reduced as P 1,j ← σ (tid 1 ,T I D)/∈  tid 1 ,T I D (P 0,j ) P 1,j . The algorithm Pair() is given in Algorithm 11, which computes Pair i for keyword k i . It first computes all the initial P 0,j relations (refer to Eq. 2.26) and initializes G j relations (refer to the first equation in Eq. 2.30) in lines 1-3. Second, it computes P d,j for every 1 ≤ d ≤ Dmax and every 2.4. OTHER KEYWORD SEARCH SEMANTICS 41 Algorithm 12 DC-Naive(R 1 , ··· ,R n ,G S ,Q,Dmax) Input: n relations R 1 ,R 2 , ··· ,R n , schema graph G S , and l-keyword, Q ={k 1 ,k 2 , ··· ,k l }, and radius Dmax. Output:Relation with 2l + 1 attributes named TID, tid 1 , dis 1 , ···, tid l , dis l . 1: for i = 1 to l do 2: Pair i ← Pair(G S , k i , Dmax, R 1 , ···, R n ) 3: S ← Pair 1 ✶ Pair 2 ✶ ··· ✶ Pair l 4: Sort S by tid 1 , tid 2 , ···, tid l 5: return S k 1 Dmax k 2 Dmax Center nodes (a) From keywords to centers W 1 W 2 k 1 nodes k 2 nodes Center nodes (b) From centers to keywords k 1 Dmax k 2 Dmax c Dmax t u dis(k 1 , t u ) dis(c, t u ) t v dis(c, t v ) dis(k 2 , t v ) (c) Project Relations Figure 2.21: Three-Phase Reduction relation R j , 1 ≤ j ≤ n, in two “for loops” (lines 4-5). In lines 7-11, it computes P d,j based on the foreign key references in the schema graph G S , referencing to Eq. 2.27 and Eq. 2.30, using the two rules, Rule-1 and Rule-2. In our example, to compute Pair 1 , it calls Pair(G S , k 1 , Dmax, R 1 , R 2 , R 3 , R 4 ), where k 1 = “Michelle”, Dmax = 2, and the 4 relations R j , 1 ≤ j ≤ 4. The naive algorithm DC-Naive() to compute distinct cores is outlined in Algorithm 12. DR- Naive()that computes distinct roots can be implemented in the same way as DC-Naive() by replacing line 4 in Algorithm 12 with 2 group-bys as follows: X ← TID  min(dis 1 )→dis 1 ,··· ,min(dis l )→dis l S, and S ← TID,dis 1 ,··· ,dis l  min(tid 1 )→tid 1 ,··· ,min(tid l )→tid l (S ✶ X). Three-Phase Database Reduction: We now discuss a three-phase reduction approach to project a relational database RDB’ out of RDB with which we compute multi-center communities (distinct core semantics). In other words, in the three-phase reduction, we significantly prune the tuples from an RDB that do not participate in any communities. We also show that we can fast compute distinct root results using the same subroutine used in the three-phase reduction. 42 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES Algorithm 13 DC(R 1 ,R 2 , ··· ,R n ,G S ,Q,Dmax) Input: n relations R 1 ,R 2 , ··· ,R n , with schema graph G S , and an l-keyword query, Q ={k 1 ,k 2 , ··· ,k l }, and radius Dmax. Output:Relation with 2l + 1 attributes named TID, tid 1 , dis 1 , ···, tid l , dis l . 1: for i = 1 to l do 2: {G 1,i , ··· ,G n,i }←PairRoot(G S ,k i ,Dmax,R 1 ,···,R n , σ cont ain(k i ) R 1 ,···,σ cont ain(k i ) R n ) 3: for j = 1 to n do 4: R j,i ← R j  G j,i 5: for j = 1 to n do 6: Y j ← G j,1 ✶ G j,2 ✶ ··· ✶ G j,l 7: X j ← R j  Y j 8: for i = 1 to l do 9: {W 1,i , ··· ,W n,i }←PairRoot(G S , k i , Dmax, R 1,i , ···, R n,i , X 1 , ···, X n ) 10: for j = 1 to n do 11: Path j,i ← G j,i ✶ G j,i .T I D=W j,i .T I D W j,i 12: Path j,i ←  TID, G j,i .dis i →d k i ,W j,i .dis i →d r (P ath j,i ) 13: Path j,i ← σ d k i +d r ≤Dmax (P ath j,i ) 14: R  j,i ← R j,i  Path j,i 15: for i = 1 to l do 16: Pair i ← Pair(R  1,i , R  2,i , ···, R  n,i , G S , k i , Dmax) 17: S ← Pair 1 ✶ Pair 2 ✶ ··· ✶ Pair l 18: Sort S by tid 1 , tid 2 , ···, tid l 19: return S Figure 2.21 outlines the main ideas for processing an l-keyword query, Q ={k 1 ,k 2 , ··· ,k l }, with a user-given Dmax , against an RDB with a schema graph G S . The first reduction phase (from keyword to center): We consider a keyword k i as a virtual node, called a keyword-node, and we take a keyword-node, k i , as a center to compute all tuples in an RDB that are reachable from k i within Dmax. A tuple t within Dmax from a virtual keyword-node k i means that tuple t can reach at least a tuple containing k i within Dmax.LetG i be the set of tuples in RDB that can reach at least a tuple containing keyword k i within Dmax, for 1 ≤ i ≤ l. Based on all G i , we can compute Y = G 1 ✶ G 2 ✶ ··· ✶ G l , which is the set of center-nodes that can reach every keyword-node k i , 1 ≤ i ≤ l, within Dmax. Y is illustrated as the shaded area in Figure 2.21(a) for l = 2. Obviously, a center appears in a multi-center community must appear in Y. The second reduction phase (from center to keyword): In a similar fashion, we consider a virtual center-node. A tuple t within Dmax from a virtual center-node means that t is reachable from a tuple in Y within Dmax. We compute all tuples that are reachable from Y within Dmax.LetW i 2.4. OTHER KEYWORD SEARCH SEMANTICS 43 Algorithm 14 PairRoot(G S ,k i , Dmax,R 1 , ··· ,R n ,I 1 , ··· ,I n ) Input: Schema graph G S , keyword k i , Dmax, n relations R 1 ,R 2 , ··· ,R n , and n initial relations I 1 ,I 2 , ··· ,I n . Output:n relations G 1,i , ··· ,G n,i each has 3 attributes: TID, tid i , dis i . 1: for j = 1 to n do 2: P 0,j ←  I j .T I D→tid i ,0→dis i ,I j .∗ (I j ) 3: G j,i ←  tid i ,dis i ,T I D (P 0,j ) 4: for d = 1 to Dmax do 5: for j = 1 to n do 6: P d,j ←∅ 7: for all (R j ,R l ) ∈ E(G S ) ∨ (R l ,R j ) ∈ E(G S ) do 8:  ←  P d−1,l .T I D→tid i ,d→dis i ,R j .∗ (P d−1,l ✶ R j ) 9:  ← R j .∗  min(tid i ),min(dis i ) () 10:  ← σ TID/∈  TID (G j,i ) () 11: P d,j ← P d,j ∪  12: G j,i ← G j,i ∪  tid i ,dis i ,T I D () 13: return {G 1,i , ··· ,G n,i } Algorithm 15 DR(R 1 ,R 2 , ··· ,R n ,G S ,Q,Dmax) Input: n relations R 1 ,R 2 , ··· ,R n , with schema graph G S , and an l-keyword query, Q ={k 1 ,k 2 , ··· ,k l }, and radius Dmax. Output:Relation with 2l + 1 attributes named TID, tid 1 , dis 1 , ···, tid l , dis l . 1: for i = 1 to l do 2: {G 1,i , ··· ,G n,i }←PairRoot(G S ,k i , Dmax, R 1 , ···, R n , σ cont ain(k i ) R 1 , ···, σ cont ain(k i ) R n ) 3: for j = 1 to n do 4: S j ← G j,1 ✶ G j,2 ✶ ··· ✶ G j,l 5: S ← S 1 ∪ S 2 ∪ ··· ∪ S n 6: return S be the set of tuples in G i that can be reached from a center in Y within Dmax, for 1 ≤ i ≤ l. Note that W i ⊆ G i . When l = 2, W 1 and W 2 are illustrated as the shaded areas on left and right in Figure 2.21(b), respectively. Obviously, only the tuples that contain a keyword within Dmax from a center are possible to appear in the final result as keyword tuples. The third reduction phase (project DB): We project an RDB’ out of the RDB, which is sufficient to compute all multi-center communities by join G i ✶ W i , for 1 ≤ i ≤ l. Consider a tuple in G i , which contains a TID t  with a distance to the virtual keyword-node k i , denoted as dis(t  ,k i ), and consider a tuple in W i , which contains aTID t  with a distance to the virtual center-node c, denoted . virtual keyword- node k i means that tuple t can reach at least a tuple containing k i within Dmax.LetG i be the set of tuples in RDB that can reach at least a tuple containing keyword k i within Dmax,. DC-Naive() to compute distinct cores is outlined in Algorithm 12. DR- Naive()that computes distinct roots can be implemented in the same way as DC-Naive() by replacing line 4 in Algorithm 12 with. for 1 ≤ d ≤ Dmax, have the same schema,and P d,j maintains R j tuples that are in distance d from a tuple containing a keyword. We use two pruning rules to reduce the number of temporal tuples

Định dạng
Số trang	5
Dung lượng	192,43 KB