124 5. OTHERTOPICS FOR KEYWORD SEARCH ON DATABASES • Global ObjectRank: The global ObjectRank vector r G is defined as follows: r G = dAr G + (1 − d) e n (5.13) Here, d is a constant in (0, 1) and r G is a vector of size n, where r G [i] denotes the global ObjectRank for object v i . The intuition behind the global ObjectRank is that an object is important if it is pointed by many other important objects. The global ObjectRank for each object is only related to the structure of the graph and is keyword independent. • Keyword-specific ObjectRank: For each keyword k, the keyword-specific ObjectRank vector r k is defined as follows: r k = dAr k + (1 − d) s(k) n(k) (5.14) Here, n(k) is the number of objects that contain the keyword k and r k is a vector of size n, where r k [i] denotes the ObjectRank for object v i with respect to keyword k. The intuition behind the keyword-specific ObjectRank is that an object is important if (1) it is pointed by many other important objects, or (2) it contains a specific keyword k or it is pointed by many other important objects that contain the specific keyword k. • Multiple-Keywords ObjectRank: The keyword-specific ObjectRank only deals with one key- word. When there are multiple keywords, Q ={k 1 ,k 2 , , k l }, they are combined to get the Multiple-Keywords ObjectRank score r Q depending on the different semantics. More specif- ically, under the AND semantics, it becomes r Q [i]= k∈Q r k [i] (5.15) and under the OR semantics, it becomes r Q [i]= a r k a [i]− a<b r k a [i]·r k b [i]+ a<b<c r k a [i]·r k b [i]·r k c [i]− (5.16) • The final ObjectRank: The final object rank r G,Q [i] for each object v i with respect to the keyword query Q is defined as follows: r G,Q [i]=r Q [i]·(r G [i]) g (5.17) where g is a parameter denoting the importance of the global ObjectRank in the final rank. Based on the final ObjectRank, the top-k objects are returned to answer a given keyword query. Chakrabarti et al. [2006] studied the binary (many to many) relationships between objects and documents. For each object t, suppose there are n documents associated with it, denoted as D t ={d 1 ,d 2 , , d n }. For a keyword query Q ={k 1 ,k 2 , , k l }, let the score for keyword k i on document d j be DocScore(d j ,k i ). Two classes of score functions are given to rank objects: 5.3. VARIATIONS OF KEYWORD SEARCH ON DATABASES 125 • Row-Marginal Class: Such kinds of score functions first compute scores on the level of keywords, and then compute the final score on the level of documents. score(t, Q) = agg d∈D t (comb k∈Q (DocScore(d, k))) (5.18) • Column-Marginal Class: Such kinds of score functions first compute scores on the level of documents, and then compute the final score on the level of keywords. score(t, Q) = comb k∈Q (agg d∈D t (DocScore(d, k))) (5.19) Based on the two classes of score functions and the monotonicity of the aggregate/combine function, several early stop strategies can be designed to obtain the top-k objects. 5.3.2 SUBSPACES AS RESULTS KDAP [Wu et al., 2007] focuses on keyword search in On-Line Analytical Processing (OLAP). Given a data warehouse where the schema contains both the fact table and the dimensions, a keyword query Q ={k 1 ,k 2 , ··· ,k l } is processed in two steps. 1. In the first differentiation step, a set of subspaces is found, such that each subspace can be considered as a star that consists of a set of dimension-to-fact tuple paths. In each such path, at least one keyword is matched to the text attributes of the dimension tuple. Each star can be considered as an interpretation of the keyword query. A user needs to select one star which will be processed in the second step. 2. In the second exploration step, the star (subspace) selected in the first step will be further divided into smaller subspaces (dynamic facets) according to the set of dimensions that do not appear in the star. The criteria to choose dimensions to be divided is application dependent. For example, it can choose the dimension such that the aggregated measure of the subspace is largely different from other subspaces in the same level. After such partitioning, the roll-up and drill-down operations are also allowed to navigate the result space. For example, consider a keyword query Q ={IBM, notebook} against a fact table with three dimensions (brand, name, color) and a measure price. In the first step, the subspaces such as (brand=IBM,name=notebook X61), (brand=IBM,name=notebook T60), etc., are returned. Suppose the user chooses the subspace (brand=IBM,name=notebook X61); in the second step, the chosen subspace is further divided into smaller subspaces, such as (brand=IBM,name=notebook X61,color=white), (brand=IBM,name=notebook X61,color=black), etc., and the price for each subspace is aggregated. In Agg-Keyword-Query [Zhou and Pei, 2009], aggregate keyword search over a relation R is studied. Suppose a relation R consists of two parts of attributes: the dimension attributes A D = {d 1 ,d 2 , , d n } and the text attributes A T ={a 1 ,a 2 , , a s }. The dimension attributes are used to 126 5. OTHERTOPICS FOR KEYWORD SEARCH ON DATABASES represent subspaces and the text attributes include several texts where keyword search is allowed. For a tuple t, let t.d i (1 ≤ i ≤ n) denote the value in the dimension attribute d i of t . A tuple t is said to contain keyword k if k is contained in one of the text attributes of t. For the subspace c = (v 1 ,v 2 , , v n ) that consists of n values, let S(c) denote the set of tuples in R, where for each tuple t, t.d i = c.v i for all 1 ≤ i ≤ n, where v i =∗(1 ≤ i ≤ n) serves as a wildcard to indicate that all values can be matched. Then, S(c) ={t|t ∈ R and (t.d i = c.v i or c.v i =∗for all 1 ≤ i ≤ n)} (5.20) Given a keyword query Q ={k 1 ,k 2 , , k l }, a subspace c is said to match Q if for all k ∈ Q, k is contained in at least one tuple in S(c). For any two subspaces,c 1 and c 2 , c 1 ≺ c 2 if for all 1 ≤ i ≤ n, if c 2 .v i =∗then c 1 .v i = c 2 .v i . The answer to keyword query Q consists of all subspaces, such that for each subspace c, (1) c matches Q (Total) and (2) there is no subspace c such that c also matches Q and c ≺ c (Minimal). The answer can be generated by iteratively joining the list of tuples that contain keyword k i with the list of tuples that contain keyword k i+1 for all (1 ≤ i ≤ l − 1)inorder to get a set of candidate spaces and then prune non-minimal spaces after each join. As an example, suppose the relation R contains three dimension attributes A D = {brand,name,color} and 1 text attribute A T ={complaint}, and the keyword query Q = {monitor,keyboard}. The result will contain a list of subspaces such that all keywords in Q are contained in the at least one text attribute of a tuple in the returned subspace. Such as (brand=IBM,name=notebook X61,color=*),(brand=IBM,name=notebook T60,color=black), etc. Subspaces such as (brand=IBM,name=notebook T60,color=*) will not be returned if there exists (brand=IBM,name=notebook T61,color=black) already, because the latter is more specific. 5.3.3 TERMS AS RESULTS Frequent co-occurring term (FCT )[Tao and Yu, 2009] focuses on finding frequent co-occurring terms for keyword search in a relational database. Given a keyword query Q ={k 1 ,k 2 , , k l }, the traditional methods return a set of MTJNT s T (Q) ={T 1 ,T 2 , } as the result of the keyword query. In FCT , rather than returning a set of MTJNT s, a set of terms are returned. For example, in the DBLP database, for the keyword query Q ={database,management,system}, one of the co- occurring terms may be Jim Gray, which means that Jim Gray is an expert of database management system. For an MTJNT T , and a term w, let count (T , w) denote the number of occurrences of w in the T . The frequency of a term w with respect to keyword query Q is defined as follows: f req(Q, w) = T ∈T (Q) count (T , w) (5.21) Given a keyword query Q and an integer k, FCT search retrieves the top-k terms with highest frequency f req(Q, w) such that w/∈ Q. A naive approach to answering a FCT query is to first evaluate the keyword query in order to generate all MTJNT s. Then for each term w, it counts the term frequency count (T , w) for 5.3. VARIATIONS OF KEYWORD SEARCH ON DATABASES 127 all T ∈ T (Q) before summarizing them together to get the top-k terms. An efficient approach is proposed in FCT to avoid enumerating all MTJNT s in order to get the results. Suppose the set of CN s for keyword query Q is C(Q) ={C 1 ,C 2 , }. For each C ∈ C(Q), let MTJNT (C) denote the set of MTJNT s generated by C, and T (Q) = C∈C(Q) MTJNT (C). Then, it can evaluate all CN s C ∈ C(Q) individually, i.e., f req(Q, w) = C∈C(Q) f req(C, w) (5.22) where f req(C, w) is the frequency of w that occurs in any MTJNT of MTJNT (C), i.e., f req(C, w) = T ∈MTJNT (C) count (T , w) (5.23) A CN C is said a star CN if itcan finda root nodeR inC such thatall the othernodes {R 1 ,R 2 , , R s } in C are connected to R. A two step approach is used to evaluate a star CN . In the first step, it finds the tuple frequency for all tuples in the database that contain keywords in MTJNT (C). In the second step, it counts the term frequencies using the tuple frequencies calculated in the first step. A non-star CN C can be made as a star CN by joining some tuples and considering the joint relation as a single node in the CN . For example, CN C = A{Michelle} ✶ W ✶ P ✶ C ✶ P {XML} is not a star CN , we can make it as a star CN by combining R = C ✶ P {XML} as a single node R to obtain C = A{Michelle} ✶ W ✶ R which becomes a star CN . DataClouds [Koutrika et al.,2009] finds a set of terms called data clouds,that are most relevant to the keyword query. The set of terms can be used to guide the users to refine their searches. In DataClouds , given a keyword query Q ={k 1 ,k 2 , , k l } over a relational database D, where each result of Q is not an MTJNT , a subgraph of the database graph is centered at tuple t that includes nodes/edges within a certain distance of t. Each result can be uniquely identified by the center tuple t, denoted G(t). The answer to keyword query Q is ans(Q) ={G(t)|t ∈ D and G(t ) contains all keywords in Q}. Here, G(t) contains a keyword k if k is contained in any tuple of G(t ).The score of each result G(t) is definedas score(G(t), Q) = k∈Q tf (G(t), k) · idf (k).Here tf (G(t), k) isthe IR ranking score by considering each G(t) as a virtual document. For each term w/∈ Q, DataClouds considers three kinds of scoring functions to rank w with respect to Q. • Popularity-based: This score is similar to the term frequency score as defined in FCT [Tao and Yu, 2009]: score(Q, w) = G(t)∈ans(Q) f req(G(t ), w) (5.24) Here f req(G(t), w) is the number of occurrences of w in G(t). • Relevance-based: As different terms will have different importance, the IR scores are used as term weights: score(Q, w) = G(t)∈ans(Q) tf (G(t), w) · idf (w) (5.25) 128 5. OTHERTOPICS FOR KEYWORD SEARCH ON DATABASES • Query-dependence: The relevance-based score only considers the importance of a term itself, without considering which result the term comes from. In some situations, the more important result G(t) is, the higher would be the weights of the terms that come from G(t). Thus, score(Q, w) = G(t)∈ans(Q) (tf (G(t), w) · idf (w)) · score(G(t), Q) (5.26) In FCT , each term is treated equally, and each result is also treated equally. Whenever a term appears in a result, it will contribute a unit score to the final score. The popularity-based score is similar to the score defined in FCT . In the relevance-based score, each result is treated equally, but each term is not treated equally. The contribution of a term in a result is proportional to its TF-IDF relevance score with respect to the result. In the query-dependence score, each result is not treated equally, and each term is not treated equally. The contribution of a term in a result is proportional to its TF-IDF relevance score as well as the query’s TF-IDF relevance score with respect to the result. 5.3.4 SQL QUERIES AS RESULTS sql queries enable users to query relational databases, but also require users to have knowledge of rdbms as well as the syntax of sql. For a non-expert user, a practical and easy way to query a relational database is to use a keyword query that is less expressive than sql. Therefore, there is a trade-off between the expressive power of the user given query and the ease to use the query. Given a keyword query over a relational database, many systems (e.g., [Chu et al., 2009; Tata and Lohman, 2008; Tran et al., 2009]) attempt to interpret the keyword query into top-k sql queries that can best explain the user given keyword query. Such sql queries can be represented as forms [Chu et al., 2009] or conjunctive queries [Tran et al., 2009]. All such systems use similar ideas to generate sql queries. We introduce SQAK [Tata and Lohman, 2008] as the representative approach. In SQAK , the system enables users to post keyword queries that include aggregations. For example, for the DBLP database with schema graph shown in Figure 2.1, a user can post the keyword query Q ={author, number, paper, XML} to get the number of papers about “XML” for each author. One of the possible results for such a query can be the following sql: select count(Paper.TID), Author.TID, Author.Name from Paper, Write, Author where Paper.TID=Write.PID and Write.AID=Author.TID and contain(Paper.Title,XML) group by Author.TID, Author.Name Formally, a keyword query consists of two parts: the general keywords, such as “paper”, “author”, and “XML”, and the aggregate keywords, such as “number”. In the first step of the SQAK system, the keyword query is interpreted into a list of Candidate Interpretations (CI), where each CI = (C,a,F,w) includes four parts: C includes a set of attributes and a prediction on some of the attributes (e.g., the keywords contained in each attribute), a is an attribute in C over which the aggregate function is performed, F is the aggregate function, and w is the group-by attribute in C . c ≺ c (Minimal). The answer can be generated by iteratively joining the list of tuples that contain keyword k i with the list of tuples that contain keyword k i+1 for all (1 ≤ i ≤ l − 1)inorder to. contain keywords in MTJNT (C). In the second step, it counts the term frequencies using the tuple frequencies calculated in the first step. A non-star CN C can be made as a star CN by joining some. AS RESULTS Frequent co-occurring term (FCT )[Tao and Yu, 2009] focuses on finding frequent co-occurring terms for keyword search in a relational database. Given a keyword query Q ={k 1 ,k 2 ,