Keyword Search in Databases- P24 doc

115 CHAPTER 5 Other Topics for Keyword Search on Databases In this chapter, we discuss several interesting research issues regarding keyword search on databases. In Section 5.1, we discuss some approaches that are proposed to select some RDB among many to answer a keyword query. In Section 5.2, we discuss keyword search in a spatial database. In Section 5.3, we introduce a PageRank based approach called ObjectR ank in RDB, and an approach that projects a database that only contains tuples relating to a keyword query. 5.1 KEYWORD SEARCH ACROSS DATABASES There are two main issues to be considered in keyword search across multiple databases: 1. When the number of databases is large, a proper subset of databases need to be selected that are most suitable to answer a keyword query. This is the problem of keyword-based selection of the top-k databases, and it is studied in M-KS [Yu et al., 2007] and G-KS [Vu et al., 2008]. 2. The keyword query needs to be executed across the databases that are selected. This problem is studied in Kite [Sayyadian et al., 2007]. 5.1.1 SELECTION OF DATABASES In order to rank a set of databases D ={D 1 ,D 2 , ···} according to the their suitability to answer a certain keyword query Q, a score function score(D,Q) is defined for each database D ∈ D.In the ideal case, if the keyword query is evaluated in each database individually, the best database to answer the query is the one that can generate high quality results. Suppose T ={T 1 ,T 2 , } is the set of results (MTJNT s, see Chapter 2) for query Q over database D. The following equation can be used to score database D: score(D,Q) =  T ∈T score(T,Q) (5.1) where score(T,Q) can be any scoring function for the MTJNT T as discussed in Chapter 2. In practice, it is inefficient to evaluate Q on every database D ∈ D. A straightforward way to solve the problem efficiently is to calculate the keyword statistics for each k i ∈ Q on each database D ∈ D and summarize the statistics as a score reflecting the relevance of Q to D. There are two 116 5. OTHER TOPICS FOR KEYWORD SEARCH ON DATABASES drawbacks to this solution. First, the keyword statistics can not reveal the importance of the keyword to the databases.For example, a term in a primary key attribute of a table may be referred to by a large number of foreign keys. Such a term may be very important in answering the keyword query, but its frequency in the database can be very low. Furthermore, two different keywords can be connected through a sequence of foreign key references in a relational database.The length and number of such connections may largely reveal the capability of the database to answer a certain keyword query.The statistics of single keywords can not capture such relationships between keywords, and thus they may choose a database that has high keyword frequency, but they may not generate any MTJNT . Suppose the keyword space is K ={w 1 ,w 2 , , w s }. For each database D ∈ D, we can con- struct a keyword relationship matrix (KRM) R = (r i,j ) s×s , which is a s by s matrix where each element is defined as follows: R i,j = δ  d=0 ϕ d · ω d (w i ,w j ) (5.2) Here, ω d (w i ,w j ) is the number of joining sequences of length d: t 0 ✶ t 1 ✶ ✶ t d where t i ∈ D(1 ≤ i ≤ d) is a tuple, and t 0 contains keyword w i and t d contains keyword w j . δ is a parameter to control the maximum length of the joining sequences, because it is meaningless if two tuples t 0 and t j are too far away from each other. ϕ d is a function of d that measures the importance of the joining sequence of length d, it can be specified based on different requirements. ϕ d = 1 d + 1 (5.3) For example, in M -KS [Yu et al., 2007], the value ω d (w i ,w j ) increases exponentially with respect to d, so another control parameter M is set such that if the value  δ d=0 ω d (w i ,w j )>M, the R i,j value is changed to be: R i,j = δ  −1  d=0 ϕ d · ω d (w i ,w j ) + ϕ δ  · (M − δ  −1  d=0 ω d (w i ,w j )) (5.4) where δ  is a value such that δ  ≤ δ and  δ  d=0 ω d (w i ,w j ) ≥ M and  δ  −1 d=0 ω d (w i ,w j )<M, i.e., δ  = min{δ p |  δ p d=0 ω d (w i ,w j ) ≥ M}. Given the KRM of database D, and a keyword query Q, the score(D,Q) can be calculated as follows. score(D,Q) =  w i ∈Q,w j ∈Q,i<j R i,j (5.5) In place of summation, it is possible to use aggregate functions min, max,orproduct according to different requirements. A number of drawbacks of KRM have been identified [Vu et al., 2008]. First, KRM only considers the pairwise relationship between keywords in a query, and this may generate many false 5.1. KEYWORD SEARCH ACROSS DATABASES 117 1 2 0,2 1,2 2 1,3 3 1 0,2 0,2 1 (a) KRG (b) JKT w 1 w 4 w 3 w 2 w 5 w 1 w 2 w 3 w 4 Figure 5.1: KRG and one of its JKT for query Q ={w 1 ,w 2 ,w 3 ,w 4 } positives because each real result MTJNT constructs all keywords in the shape of a tree rather than a pairwise graph. Second, considering only the connections between keywords in a relational database is not enough to rank databases; it is important to also integrate IR-Styled score in the scoring function. These can be addressed as follows. Suppose the keyword space is K ={w 1 ,w 2 , }. For each database D ∈ D, a keyword relationship graph (KRG) can be constructed, G(V , E), where, for each keyword w i ∈ Q, there is a node w i ∈ V (G), and for every two keywords w i ∈ Q and w j ∈ Q,ifw i and w j can be connected through at least one joining sequence of tuples in D, then an edge (w i ,w j ) ∈ E(G) is added. For each edge (w i ,w j ) ∈ E(G), a set of weights are assigned. More precisely, when there is a joining sequence of tuples with length d that connect w i and w j in the two ends, then a weight d is added to the edge (w i ,w j ) in G. Given the KRG G for database D, and a keyword query Q ={k 1 ,k 2 , , k l }, a Join Keyword Tree (JKT ) is a tree that satisfies the following conditions. • Each node in the tree contains at least one keyword. • The tree contains all the keywords (total), and there exist no subtrees that contain all the keywords (minimal). • Each edge of the tree has a positive integer weight, and the total weight for all edges in the tree is smaller than Tmax. 1 • For any two keywords w i and w j contained in nodes v 1 and v 2 , respectively, suppose the distance (total weight of edges) between v 1 and v 2 in the tree is d, then there exists an edge (w i ,w j ) in G whose weight is d. 1 Tmax is the maximum number of nodes allowed in a tree. 118 5. OTHER TOPICS FOR KEYWORD SEARCH ON DATABASES An example of a KRG is shown in Figure 5.1(a) where there are five keywords w 1 , w 2 , , w 5 . For edge (w 1 ,w 5 ), the two keywords are contained in a certain tuple in database D, so we add weight 0. There also exists a joining sequence of length 2 that connects w 1 and w 5 at the two ends, so we add weight 2.AJKT of the KRG is shown in Figure 5.1(b). For two keywords w 1 and w 2 in the JKT , their distance in the tree is 0 because they are contained in the same node. The edge (w 2 ,w 3 ) of the KRG has weight 0. The distance of the two keywords w 2 and w 4 is 3, and we can also find that edge (w 2 ,w 4 ) has weight 3. Given a database D and its KRG G, we have the following theorem. Theorem 5.1 Given a keyword query Q, for a database D ∈ D, if its KRG does not contain a JKT for Q, the results (MTJNTs) for the keyword query Q over the database D will be empty. Using Theorem 5.1, it can prune databases that do not contain a JKT for the keyword query Q. For other databases,a new scoring function is defined in order to rank them.The scoring function considers both the IR ranking score and the structural score (distances between keywords).The score consists of two parts, namely the node score and the edge score. For database D that is not pruned and for the keyword space K, the node score and the edge score are as follows: • The node score: The score of each keyword w i ∈ K is score(D,w i ) =  t∈Dandtcontainsw i score(t, D, w i ) N(D,w i ) (5.6) where N(D,w i ) is the number of tuples in D that contain keyword w i and the score for each tuple t with respect to w i , score(t, D,w i ) is defined as follows: score(t, D, w i ) = tf (t , w i )  w∈t tf (t, w) · ln N(D) N(D,w i ) + 1 (5.7) where tf (t , w i ) is the term frequency of w i in the tuple t, and N(D) is the total number of tuples in D. • The edge score: For any two keywords w i ∈ K and w j ∈ K, the edge score score(D,w i ,w j ) is defined as follows: score(D,w i ,w j ) = δ  d=1 score d (D, w i ,w j ) (5.8) Here δ is a parameter to control the maximum distance between two keywords, and score d (D, w i ,w j ) =  (t,t  )∈P d (w i ,w j ,D) tf (t , w i ) · tf (t  ,w j ) · ln N d (D) N d (w i ,w j ,D)+1 N d (w i ,w j ,D) (5.9) . 115 CHAPTER 5 Other Topics for Keyword Search on Databases In this chapter, we discuss several interesting research issues regarding keyword search on databases. In Section 5.1, we discuss some. projects a database that only contains tuples relating to a keyword query. 5.1 KEYWORD SEARCH ACROSS DATABASES There are two main issues to be considered in keyword search across multiple databases: 1 among many to answer a keyword query. In Section 5.2, we discuss keyword search in a spatial database. In Section 5.3, we introduce a PageRank based approach called ObjectR ank in RDB, and an approach that

Định dạng
Số trang	5
Dung lượng	114,9 KB