Keyword Search in Databases- P6 docx

24 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES P{} P{} W{} P{} P{} C{}C{} W{} A{XML} A{Michelle} P{XML} P{Michelle} (a) Structure L−Node Relation L−Edge Relation Vid Rname KSet 2 10 A 1 Fid Cid Attr AID 1 3 l (b) Storage Figure 2.11: L-Lattice P { XML } C {} P {} W {} A { Michelle } C W P (a) Semijoin P C W P{XML } A{Michelle} (b) Join Figure 2.12: Join vs Semijoin/Join CN C i , we attempt to find the largest subtrees in L that C i can share with using the index, and we link to the roots of such subtrees. Figure 2.11(a) illustrates a partial lattice. The entire lattice, L,is maintained in two relations: L-Node relation and L-Edge relation (Figure 2.11(b)). Let a bit-string represent a set of keywords, {k 1 ,k 2 , ··· ,k l }. The L-Node relation maintains, for any node in L, a unique Vid in L, the corresponding relation name (Rname) that appears in the given database schema, G S , a bit-string (KSet) that indicates the keywords associated with the node in L, and the size of the bit-string (l). The L-Edge relation maintains the parent/child relations among all the nodes in L with its parent Vid and child Vid (Fid/Cid) plus its join attribute, Attr, (either primary key or foreign key). The two relations can be maintained in memory or on disk. Several indexes are built on the relations to quickly search for given nodes in L. There are three main differences between the two execute graphs: the Mesh and the L-Lattice. (1) The maximum depth of a Mesh is Tmax − 1 and the maximum depth of an L-Lattice is  Tmax/2 + 1. (2) In a mesh, only the left part of two CN s can be shared (except for the leaf nodes), while in an L-Lattice multiple parts of two CN s can be shared. (3) The number of leaf nodes in a 2.3. CANDIDATE NETWORK EVALUATION 25 mesh is O((|V(G S )|·2 l ) 2 ) because there are O(|V(G S )|·2 l ) clusters in a mesh and each cluster may contain O(|V(G S )|·2 l ) leaf nodes. The number of leaf nodes in an L-Lattice is O(2 l ). After sharing computational cost using either the Meshorthe L-Lattice,allCN s areevaluated using joins in DISCOVER or S-KWS. A join plan is shown in Figure 2.9(b) to process the CN in Figure 2.9(a) using 5 projects and 4 joins. The resulting relation, the output of the join (j 4 ), is a temporal relation with 5 TIDs from the 5 projected relations, where a resulting tuple represents an MTJNT . The rightmost two connected trees in Figure 2.3 are the two results of the operator tree Figure 2.9(b), (p 2 ,c 5 ,p 4 ,w 5 ,a 3 ) and (p 3 ,c 4 ,p 4 ,w 5 ,a 3 ). In KRDBMS [Qin et al., 2009a], the authors observe that evaluating all CN s using only joins may always generate a large number of temporal tuples.They propose to use semijoin/join sequences to compute a CN . A semijoin between R and S is defined in Eq. 2.18, which is to project () the tuples from R that can possibly join at least a tuple in S. R  S =  R (R ✶ S) (2.18) Based on semijoin, a join R ✶ S can be supported by a semijoin and a join as given in Eq. 2.19. R ✶ S = (R  S) ✶ S (2.19) Recall that semijoin/joins were proposed to join relations in a distributed rdbms, in order to reduce high communication cost at the expense of I/O cost and CPU cost. But, there is no communication in a centralized rdbms. In other words, there is no obvious reason to use (R  S) ✶ S to process a single join R ✶ S since the former needs to access the same relation S twice. Below, we address the significant cost saving of semijoin/joins over joins when the number of joins is large, in a centralized rdbms. When evaluating all CN s, the temporal tuples generated can be very large, and the majority of the generated temporal tuples do not appear in any MTJNT s. When evaluating all CN s using the semijoin/join based strategy, computing R ✶ (S ✶ T)is done as S  ← ST , R  ← RS  , with semijoins, in the reduction phase, followed by T ✶ (S  ✶ R  ) in the join phase. For the CN given in Figure 2.9(a), in the reduction phase (Figure 2.12(a)), C  ← C{}P {XML}, W  ← W {}A{Michelle}, P  ← P {}C  , and P  ← P  W  , and in the join phase (Figure 2.12(b)), P  ✶ C  is joined first because P  is fully reduced, such that every tuple in P  must appear at an MTJNT . The join order is shown in Figure 2.12(b). Figure 2.13 shows the number of temporal tuples generated using a real database DBLP on IBM DB2.The five 3-keyword queries with different keyword selectivity (the probability that a tuple contains a keyword in DBLP) were randomly selected with Tmax = 5. The number of generated temporal tuples are shown in Figure 2.13(a). The number of tuples generated by the semijoin/join approach is significantly less than that by the join approach. In a similar fashion, the number of temporal tuples generated by the semijoin/join approach is significantly less than that generated by the join approach when Tmax increases (Figure 2.13(b)) for a 3-keyword query. When processing a large number of joins for keyword search on rdbmss, it is the best practice to process a large number of small joins in order to avoid intermediate join results becoming very 26 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES 10K 100K 1M 10M 4E-4 8E-4 1.2E-31.6E-3 2E-3 # Temp Tuples Join SemiJoin-Join (a) Vary Keyword Selectivity 10K 100K 1M 10M 2 3 4 5 # Temp Tuples Join SemiJoin-Join (b) Vary l Figure 2.13: # of Temporal Tuples (Default Tmax = 5, l = 3) large and dominative if it is difficult to find an optimal query processing plan or the cost of finding an optimal query processing plan is high. Besides evaluating all CN s in a static environment, S-KWS and KDynamic focus on monitor- ing all MTJNT s in a relational data stream where tuples can be inserted/deleted frequently. In this situation, it is necessary to find new MTJNT s or expire MTJNT s in order to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given l-keyword query. More precisely, it reports new MTJNT s when new tuples are inserted, and, in addition, reports the MTJNT s that become invalid when tuples are deleted. A sliding window (time interval), W , is specified. A tuple,t, has a lifespan from its insertion into the window at time t.start to W + t.start − 1, if t is not deleted before then. Two tuples can be joined if their lifespans overlap. S-KWS processes a keyword query in a relational data stream using the mesh as introduced above.The authors observe that in a data stream environment some joins need to be processed when there are incoming new tuples from its inputs but not all joins need to be processed all the time, and, therefore, they propose a demand-driven operator execution. A join operator has two inputs and is associated with an output buffer.The output buffer of a join operator becomes input to many other join operators that share the join operator (as indicated in the mesh). A tuple that is newly output by a join operator in its output buffer will be a new arrival input to those joins that share the join operator. A join operator will be in a running state if it has newly arrived tuples from both inputs. A join operator will be in a sleeping state if either it has no new arriving tuples from the inputs or all the join operators that share it are currently sleeping. The demand-driven operator execution noticeably reduces the query processing cost. KDynamic processes a keyword query in a relational data streamusingthe L-Lattice.Although S-KWS can significantly reduce the computational cost, the scalability issues is also a problem es- pecially when Tmax, |G S |, l, W or the stream speed is high. This is because a large number of intermediate tuples that are computed by many join operators in the mesh with high processing cost will eventually not be output. S-KWS cannot avoid computing such a large number of unnecessary intermediate tuples because it is unknown whether an intermediate tuple will appear in an MTJNT 2.3. CANDIDATE NETWORK EVALUATION 27 beforehand. The probability of generating a large number of unnecessary intermediate results increases when either the size of sliding window, W , is large, or new data arrive at high speed. It is challenging to reduce the processing cost by reducing the number of intermediate results. In KDynamic, an algorithm CNEvalDynamic is proposed, which works as follows. We can maintain |V(G S )| relations in total to process an l-keyword query Q ={k 1 ,k 2 , ··· ,k l }, due to the lattice structure that is used. A node, v, in lattice L is uniquely identified with a node id. The node v represents a sub-relation R i {K  }. By utilizing the unique node id, it is easy to maintain all the 2 l sub-relations for a relation R i together. Let us denote such a relation as R i .The schema of R i is the same as R i plus an additional attribute (Vid) to keep the node id in L. When we need to obtain a sub-relation R i {K  } for K  ⊆ Q associated with a node, v, in the lattice, we use the node id to select and project R i {K  } from R i .Therefore, a relation R i {K  } can be possibly virtually maintained. Below, we use R i {K  } to denote such a sub-relation. It is fast to obtain R i {K  } if an index is built on the additional attribute Vid on relation R i . CNEvalDynamic is outlined in Algorithm 6. When a new update operator, op(t , R i ), arrives, it processes it in lines 3-9 if the operation is an insertion or in lines 11-14 if it is a deletion. The procedure EvalPath joins all the needed tuples in a top-down fashion. EvalPath is implemented similar to the semijoin-join based static evaluation as discussed above using an additional path, which records where the join sequence comes from to reduce join cost.The two procedures, namely insert and delete, maintain a list of tuples for each node in the lattice using only selections (lines 17- 18, lines 26-27, and lines 34-35). The selected tuples can join at least one tuple from each list of its child nodes in the lattice. If the list of one node in the lattice is changed, it will trigger the father nodes to change their lists accordingly (lines 24-27 and lines 32-35). If the root node is changed, this means the results should be updated. At this time, we use joins to report the updated MTJNT s. When we join, all the tuples that participate in joins will contribute to the results. In this way, we can achieve full reduction when joining. As the number of results itself can be exponentially large, we analyze the extra cost for the algorithms to evaluate all CN s.The extra cost is defined to be the number of tuples generated by the algorithm minus the number of tuples in the result. Suppose the number of tuples in every relation is n. Given a CN with size t, the extra cost for the algorithm using the left deep tree proposed in S-KWS to evaluate the CN is O(n t−1 ), and the extra cost for the CNEvalDynamic algorithm to evaluate the CN is O(n · t). Finally, we discuss how to implement the event-driven evaluation. As shown in Figure 2.14, there are multiple nodes labeled with identical R i {K  }. For example, W{} appears in two different nodes in the lattice. For each R i {K  }, we maintain 3 lists named Rlist (Ready), Wlist (Wait) and Slist (Suspend). The three lists together contain all the node ids in the lattice. A node in the lattice L labeled R i {K  } can only appear in one of the three lists for R i {K  }. A node v in L appears in Wlist, if the sub-relations represented by all child nodes of v in L are non-empty, but the sub-relation represented by v is empty. A node v in L appears in Rlist, if the sub-relations represented by all child nodes of v in L are non-empty, and the sub-relation represented by v itself 28 2. SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES Algorithm 6 CNEvalDynamic(L,Q,) Input: An l-keyword query Q, a lattice L, and a set of MTJNT s denoted  1: while a new update op (t , R i ) arrives do 2: let K  be the set of all keywords appearing in tuple t 3: if op is to insert t into relation R i then 4:  ←∅ 5: for each v in L labeled R i {K  } do 6: path ←∅ 7: insert(v,t,) 8: report new MTJN T sin 9:  ←  ∪  10: else if op is to delete t from R i then 11: for each v in L labeled R i {K  } do 12: if t ∈ R i {K  } then 13: delete(v, t) 14: delete MTJNT sin that contain t, and report such deletions 15: Procedure insert(v,t,) 16: let the label of node v be R i {K  } 17: if t ∈ R i {K  } and t can join at least one tuple in every relation represented by all v’s children in L then 18: insert tuple t into the sub-relation R i {K  } 19: if t ∈ R i {K  } then 20: push (v, t) to path 21: if v is a root node in L then 22:  ←  ∪ EvalPath(v,t,path) 23: else 24: for each father node of v, u,inL do 25: let the label of node u be R j {K"} 26: for each tuple t  in π K" (r(R j )) that can join t do 27: insert(u, t  ,) 28: pop (v, t) from path 29: Procedure delete(v, t) 30: let the label of node v be R i {K  } 31: delete tuple t from the sub-relation R i {K  } 32: for each father node of v, u in L do 33: let the label of node u be R j {K"} 34: for each tuple t  in R j {K"} that can join t only do 35: delete(u, t  ) is non-empty too. Otherwise, v appears in Slist. When a new tuple t of relation R i with keyword set K  is inserted, we only insert it into all relations in the nodes v,inL,onRlist and Wlist specified for R i {K  }. Each insertion may notify some father nodes of v to move from Wlist or Slist to Rlist. Node v may also be moved from Wlist to Rlist. When a tuple t of relation R i with keyword set K  is about to be deleted, we only remove it from all relations associated with node . areevaluated using joins in DISCOVER or S-KWS. A join plan is shown in Figure 2.9(b) to process the CN in Figure 2.9(a) using 5 projects and 4 joins. The resulting relation, the output of the join (j 4 ),. L-Lattice P { XML } C {} P {} W {} A { Michelle } C W P (a) Semijoin P C W P{XML } A{Michelle} (b) Join Figure 2.12: Join vs Semijoin/Join CN C i , we attempt to find the largest subtrees in L that C i can share with using the index, and we link to the. join operator in its output buffer will be a new arrival input to those joins that share the join operator. A join operator will be in a running state if it has newly arrived tuples from both inputs. A

Định dạng
Số trang	5
Dung lượng	172,6 KB