Tài liệu Advances in Database Technology- P10 docx

50 322 0
Tài liệu Advances in Database Technology- P10 docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

432 A Nash and B Ludäscher This domain enumeration approach has been used in various forms [DL97] Note that in our setting of ANSWER* we can create a very dynamic handling of answers: if ANSWER* determines that the user may want to decide at that point whether he or she is satisfied with the answer or whether the possibly costly domain enumeration views should be used Similarly, the relative answer completeness provided by ANSWER* can be used to guide the user and/or the system when introducing domain enumeration views Feasibility of Unions of Conjunctive Queries with Negation We now establish the complexity of deciding feasibility for safe 5.1 queries Query Containment We need to consider query containment for queries In general, query P is said to be contained in query Q (in symbols, if for all instances D, We write for the following decision problem: For a class of queries given determine whether For P, a function is a containment mapping if P and Q have the same free (distinguished) variables, is the identity on the free variables of Q, and, for every literal in Q, there is a literal in P Some early results in database theory are: Proposition [CM77] CONT(CQ) and CONT(UCQ) are NP-complete Proposition [SY80,LS93] complete and are For many important special cases, testing containment can be done efficiently In particular, the algorithm given in [WL03] for containment of safe and uses an algorithm for CONT(CQ) as a subroutine Chekuri and Rajaraman [CR97] show that containment of acyclic CQs can be solved in polynomial time (they also consider wider classes of CQs) and Saraiya [Sar91] shows that containment of CQs, in the case where no relation appears more than twice in the body, can be solved in linear time By the nature of the algorithm in [WL03], these gains in efficiency will be passed on directly to the test for containment of CQs and UCQs (so the check will be in NP) and will also improve the test for containment of and 5.2 Feasibility Definition (Feasibility Problem) is the following decision problem: given decide whether Q is feasible for the given access patterns Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 433 Before proving our main results, Theorems 16 and 18, we need to establish a number of auxiliary results Recall that we assume queries to be safe; in particular Theorems 12 and 13 hold only for safe queries Proposition so that both and is unsatisfiable iff there exists a relation R and terms appear in Q PROOF Clearly if there are such R and then Q is unsatisfiable If not, then consider the frozen query is a Herbrand model of Clearly so Q is satisfiable Therefore, we can check whether every in look for in Proposition If is satisfiable in quadratic time: for is Q-answerable, then it is Proposition 10 If is Q-answerable, and for every literal in is P-answerable, then is P-answerable PROOF If is Q-answerable, it is by Proposition By definition, there must be executable consisting of and literals from Since every literal in is P-answerable, there must be executable consisting of and literals from P Then the conjunction of all is executable and consists of and literals from P That is, is P-answerable Proposition 11 If and is Q-answerable, then is a containment mapping is P-answerable PROOF If the hypotheses hold, there must be executable consisting of and literals from Q Then consists of and literals from P Since we can use the same adornments for as the ones we have for is executable and therefore, is P-answerable Given P, where and and where and are quantifier free (i.e., consist only of joins), we write P, Q to denote the query Recently, [WL03] gave the following theorems Theorem 12 [WL03, Theorem 2] If P, satisfiable or there is a containment mapping such that, for every negative literal and P, then in Q, Theorem 13 [WL03, Theorem 5] If and then iff P is unsatisfiable or if there is a containment mapping witnessing for every negative in is not in P and iff P is unwitnessing is not in P with and such that, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 434 A Nash and B Ludäscher Therefore, if and with we have that iff there is a tree with root for some and where each node is of the form and represents a true containment except when is unsatisfiable, in which case also the node has no children Otherwise, for some containment mapping witnessing the containment, there is one child for every negative literal in Each child is of the form where and appears in We will need the following two facts about this tree, in the special case where with E executable, in the proof of Theorem 16 Lemma 14 If answerable is it is PROOF By induction It is obvious for Assume that the lemma holds for and that is We have for some witnessed by a containment mapping and for some literal appearing in Since is executable, by Propositions and 9, is Therefore by Proposition 11, is and by the induction hypothesis, Therefore, by Proposition 10 and the induction hypothesis, Lemma 15 If the conjunction conjunction ans(Q), is unsatisfiable, then the is also unsatisfiable PROOF If Q is satisfiable, but Q, Proposition we must have some in Q from some in and some containment map is unsatisfiable, then by must have been added satisfying Since is executable, by Propositions and 9, answerable Therefore by Proposition 11, is answerable and by Lemma 14, Therefore, we must have in ans(Q), so ans(Q), is also unsatisfiable is We include here the proof of Proposition and then prove our main results, Theorems 16 and 18 PROOF (Proposition 4) For this is clear since ans(Q) contains only literals from Q and therefore the identity map is a containment mapping from ans(Q) to Q If and Q is unsatisfiable, the result is obvious Otherwise the identity is a containment mapping from to If a negative literal appears in ans(Q), then since also appears in Q, we have that is unsatisfiable, and therefore by Theorem 12 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 435 Theorem 16 If E is executable, and then E That is, ans(Q) is a minimal feasible query containing Q PROOF We have from Proposition Set We know that for all We will show that implies from which it follows that If is unsatisfiable, then is also unsatisfiable, so holds trivially, Therefore assume, to get a contradiction, that is satisfiable, and Since is satisfiable and by [WL03, Theorem 4.3] we must have a tree with root for some and where each node is of the form and represents a true containment except when is unsatisfiable, in which case also the node has no children Otherwise, for some containment mapping witnessing the containment there is one child for every negative literal in Each child is of the form where and appears in Since E, if in this tree we replace every by by Lemma 15 we must have some non-terminal node where the containment doesn’t hold Accordingly, assume that and For this to hold, there must be a containment mapping which maps into some literal which appears in but not in That is, there must be some so that appears in and By Propositions and 9, since is executable, is By Proposition 11, is and so, by Lemma 14, Therefore, is in which is a contradiction Corollary 17 Q is feasible iff Theorem 18 That is, determining whether a query is feasible is polynomial-time manyone equivalent to determining whether a query is contained in another query PROOF One direction follows from Corollary 17 and Proposition For the other direction, consider two queries where The query where is a variable not appearing in P or Q and B is a relation not appearing in P or Q with access pattern We give relations R appearing in P or Q output access patterns (i.e., As a result, P and Q are both executable, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 436 A Nash and B Ludascher but If and is not feasible We set Clearly, then so by Corollary 17, then since and we have so again by Corollary 17, is not feasible Since is Corollary 19 is feasible If we have is includes the classes CQ, UCQ, and We have the following strict inclusions Algorithm FEASIBLE essentially consists of two steps: (i) compute ans(Q), and (ii) test Below we show that FEASIBLE provides optimal processing for all the above subclasses of Also, we compare FEASIBLE to the algorithms given in [LC01] 5.3 Conjunctive Queries Li and Chang [LC01] show that FEASIBLE(CQ) is NP-complete and provide two algorithms for testing feasibility of Find a minimal so call this algorithm CQstable) Compute ans(Q), then check that CQstable*) then check that ans(M) = M (they (they call this algorithm The advantage of the latter approach is that ans(Q) may be equal to Q, eliminating the need for the equivalence check For conjunctive queries, algorithm FEASIBLE is exactly the same as CQstable* Example (CQ Processing) Consider access patterns conjunctive query and and the which is not orderable Algorithm CQstable first finds the minimal then checks M for orderability (M is in fact executable) Algorithms CQstable* and FEASIBLE first find A := ans(Q) then check that 5.4 holds (which is the case) Conjunctive Queries with Union Li and Chang [LC01] show that FEASIBLE(UCQ) is NP-complete and provide two algorithms for testing feasibility of with Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 437 Find a minimal (with respect to union) so with then check that every is feasible using either CQstable or CQstable* (they call this algorithm UCQstable) Take the union P of all the feasible then check that (they call this algorithm UCQstable*) Clearly, holds by construction For UCQs, algorithm FEASIBLE is different from both of these and thus provides an alternate algorithm The advantage of CQstable* and FEASIBLE over CQstable is that P or ans(Q) may be equal to Q, eliminating the need for the equivalence check Example 10 (UCQ Processing) Consider access patterns and the query and Algorithm UCQstable first finds the minimal (with respect to union) then checks that M is feasible (it is) Algorithm UCQstable* first finds P, the union of the feasible rules in Q then checks that holds (it does) Algorithm FEASIBLE finds A := ans(Q) the union of the answerable part of each rule in Q then checks that 5.5 holds (it does) Conjunctive Queries with Negation Proposition 20 Assume are given by where the and are not necessarily distinct and the not necessarily distinct Then define and are also Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 438 A Nash and B Ludäscher with access patterns Then clearly and therefore iff iff ans(L) iff L is feasible The second iff follows from the fact that every containment mapping corresponds to a unique containment mapping ans(L) and vice versa Since Corollary 21 is we have is Discussion and Conclusions We have studied the problem of producing and processing executable query plans for sources with limited access patterns In particular, we have extended the results by Li et al [LC01,Li03] to conjunctive queries with negation and unions of conjunctive queries with negation Our main theorem (Theorem 18) shows that checking feasibility for and is equivalent to checking containment for and respectively, and thus is complete Moreover, we have shown that our treatment for nicely unifies previous results and techniques for CQ and UCQ respectively and also works optimally for In particular, we have presented a new uniform algorithm which is optimal for all four classes We have also shown how we can often avoid the theoretical worst-case complexity, both by approximations at compile-time and by a novel runtime processing strategy The basic idea is to avoid performing the computationally hard containment checks and instead (i) use two efficiently computable approximate plans and which produce tight underestimates and overestimates of the actual query answer for Q (algorithm PLAN*), and defer the containment check in the algorithm FEASIBLE if possible, and (ii) use a runtime algorithm ANSWER*, which may report complete answers even in the case of infeasible plans, and which can sometimes quantify the degree of completeness [Li03, Sec.7] employs a similar technique to the case of CQ However, since union and negation are not handled, our notion of bounding the result from above and below is not applicable there (essentially, the underestimate is always empty when not considering union) Although technical in nature, our work is driven by a number of practical engineering problems In the Bioinformatics Research Network project [BIR03], we are developing a database mediator system for federating heterogeneous brain data [GLM03,LGM03] The current prototype takes a query against a global-asview definition and unfolds it into a plan We have used ANSWERABLE and a simplified version (without containment check) of PLAN* and ANSWER* in the system Similarly, in the SEEK and SciDAC projects [SEE03,SDM03] we are building distributed scientific workflow systems which can be seen as procedural variants of the declarative query plans which a mediator is processing Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 439 We are interested in extending our techniques to larger classes of queries and to consider the addition of integrity constraints Even though many questions become undecidable when moving to full first-order or Datalog queries, we are interested in finding analogous compile-time and runtime approximations as presented in this paper Acknowledgements Work supported by NSF-ACI 9619020 (NPACI), NIH 8P41 RR08605-08S1 (BIRN-CC), NSF-ITR 0225673 (GEON), NSF-ITR 0225676 (SEEK), and DOE DE-FC02-01ER25486 (SciDAC) References [BIR03] [CM77] [CR97] [DL97] [FLMS99] [GLM03] [LC01] [LGM03] [Li03] [LS93] [NL04] [PGH98] [Sar91] [SDM03] [SEE03] Biomedical Informatics Research Network Coordinating Center (BIRNCC), University of California, San Diego http://nbirn.net/, 2003 A K Chandra and P M Merlin Optimal Implementation of Conjunctive Queries in Relational Data Bases In ACM Symposium on Theory of Computing (STOC), pp 77–90, 1977 C Chekuri and A Rajaraman Conjunctive query containment revisited In Intl Conf on Database Theory (ICDT), Delphi, Greece, 1997 O M Duschka and A Y Levy Recursive plans for information gathering In Proc IJCAI, Nagoya, Japan, 1997 D Florescu, A Y Levy, I Manolescu, and D Suciu Query Optimization in the Presence of Limited Access Patterns In SIGMOD, pp 311–322, 1999 A Gupta, B Ludäscher, and M Martone BIRN-M: A Semantic Mediator for Solving Real-World Neuroscience Problems In ACM Intl Conference on Management of Data (SIGMOD), 2003 System demonstration C Li and E Y Chang On Answering Queries in the Presence of Limited Access Patterns In Intl Conference on Database Theory (ICDT), 2001 B Ludäscher, A Gupta, and M E Martone Bioinformatics: Managing Scientific Data In T Critchlow and Z Lacroix, editors, A ModelBased Mediator System for Scientific Data Management Morgan Kaufmann, 2003 C Li Computing Complete Answers to Queries in the Presence of Limited Access Patterns Journal of VLDB, 12:211–227, 2003 A Y Levy and Y Sagiv Queries Independent of Updates In Proc VLDB, pp 171–181, 1993 A Nash and B Ludäscher Processing First-Order Queries under Limited Access Patterns submitted for publication, 2004 Y Papakonstantinou, A Gupta, and L M Haas Capabilities-Based Query Rewriting in Mediator Systems Distributed and Parallel Databases, 6(1):73–110, 1998 Y Saraiya Subtree elimination algorithms in deductive databases PhD thesis, Computer Science Dept., Stanford University, 1991 Scientific Data Management Center (SDM) http://sdm.lbl.gov/sdmcenter/ and http://www.er.doe.gov/scidac/, 2003 Science Environment for Ecological Knowledge (SEEK) http://seek.ecoinformatics.org/, 2003 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 440 A Nash and B Ludäscher [SY80] [Ull88] [WL03] [WSD03] Y Sagiv and M Yannakakis Equivalences Among Relational Expressions with the Union and Difference Operators Journal of the ACM, 27(4) :633– 655, 1980 J Ullman The Complexity of Ordering Subgoals In ACM Symposium on Principles of Database Systems (PODS), 1988 F Wei and G Lausen Containment of Conjunctive Queries with Safe Negation In Intl Conference on Database Theory (ICDT), 2003 Web Services Description Language (WSDL) Version 1.2 http://www.w3.org/TR/wsdl12, June 2003 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Projection Pushing Revisited* Benjamin J McMahan1, Guoqiang Pan1, Patrick Porter2, and Moshe Y Vardi1 Department of Computer Science, Rice University, Houston, TX 77005-1892, U.S.A {mcmahanb,gqpan}@rice.edu, vardi@cs.rice.edu Scalable Software Patrick.Porter@scalablesoftware.com Abstract The join operation, which combines tuples from multiple relations, is the most fundamental and, typically, the most expensive operation in database queries The standard approach to join-query optimization is cost based, which requires developing a cost model, assigning an estimated cost to each queryprocessing plan, and searching in the space of all plans for a plan of minimal cost Two other approaches can be found in the database-theory literature The first approach, initially proposed by Chandra and Merlin, focused on minimizing the number of joins rather then on selecting an optimal join order Unfortunately, this approach requires a homomorphism test, which itself is NP-complete, and has not been pursued in practical query processing The second, more recent, approach focuses on structural properties of the query in order to find a project-join order that will minimize the size of intermediate results during query evaluation For example, it is known that for Boolean project-join queries a project-join order can be found such that the arity of intermediate results is the treewidth of the join graph plus one In this paper we pursue the structural-optimization approach, motivated by its success in the context of constraint satisfaction We chose a setup in which the cost-based approach is rather ineffective; we generate project-join queries with a large number of relations over databases with small relations We show that a standard SQL planner (we use PostgreSQL) spends an exponential amount of time on generating plans for such queries, with rather dismal results in terms of performance We then show how structural techniques, including projection pushing and join reordering, can yield exponential improvements in query execution time Finally, we combine early projection and join reordering in an implementation of the bucket-elimination method from constraint satisfaction to obtain another exponential improvement Introduction The join operation is the most fundamental and, typically, the most expensive operation in database queries Indeed, most database queries can be expressed as select-project-join queries, combining joins with selections and projections Choosing an optimal plan, i.e., the particular order in which to perform the select, project, and join operations * Work supported in part by NSF grants CCR-9988322, CCR-0124077, CCR-0311326, IIS9908435, IIS-9978135, and EIA-0086264 E Bertino et al (Eds.): EDBT 2004, LNCS 2992, pp 441–458, 2004 © Springer-Verlag Berlin Heidelberg 2004 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark On Containment of Conjunctive Queries with Arithmetic Comparisons some elements, the corresponding tuple is in the answer to space limitations, we give the proofs of all theorems in [11] 467 Due to Theorem Consider two CQAC queries and shown in Figure that may not be normalized Suppose contains only and and (correspondingly not imply restrictions Then if and only if: where are all the containment mappings from to Case 2: The following theorem shows that we not need to normalize the queries if they have the homomorphism property Lemma Assume the comparisons in and not imply equalities If there is a containment mapping from to such that then there must be a containment mapping from to such that Using the lemma above, we can prove: Theorem Suppose the comparisons in The homomorphism property holds between and and and not imply equalities iff it holds between Conditions for Homomorphism Property Now we look for constraints in the form of syntactic conditions on queries and under which the homomorphism property holds The conditions are sufficiently tight in that, if at least one of them is violated, then there exist queries and for which the homomorphism property does not hold The conditions are syntactic and can be checked in polynomial time We consider the case where the containing query (denoted by all through the section) is a conjunctive query with only arithmetic comparisons between a variable and a constant; i.e., all its comparisons are semi-interval (SI), which are in the forms of or We call a point inequality (PI) This section is structured as follows Section 4.1 discusses technicalities on the containment implication, and in particular in what cases we not need a disjunction In Section 4.2 we consider the case where the containing query has only left-semi-interval (LSI) subgoals We give a main result in Theorem In Section 4.3, we extend Theorem by considering the general case, where the containing query may use any semi-interval subgoals and point inequality subgoals In Section 4.4, we discuss the case for more general inequalities than SI Section 4.5 gives an algorithm for checking whether these conditions are met In [11], we include many examples to show that the conditions in the main theorems are tight Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 468 F Afrati, C Li, and P Mitra 4.1 Containment Implication In this subsection, we will focus on the implication in Theorem We shall give some terminology and some basic technical observations The left-hand side (lhs) is a conjunction of arithmetic comparisons (in Example 2, the lhs is: The right-hand side (rhs) is a disjunction and each disjunct is a conjunction of arithmetic comparisons For instance, in Example 2, the rhs is: which has two disjuncts, and each is the conjunction of two comparisons Given an integer we shall call containment implication any implication of this form: i) the lhs is a conjunction of arithmetic comparisons, and ii) the rhs is a disjunction and each disjunct is a conjunction of arithmetic comparisons Observe that the rhs can be equivalently written as a conjunction of disjunctions (using the distributive law) Hence this implication is equivalent to a conjunction of implications, each implication keeping the same lhs as the original one, and the rhs is one of the conjuncts in the implication that results after applying the distributive law We call each of these implications a partial containment implication.1 In Example 2, we write equivalently the rhs as: Thus, the containment implication in Example can be equivalently written as Here we get four partial containment implications A partial containment implication is called a direct implication if there exists an such that if this implication is true, then is also true Otherwise, it is called a coupling implication For instance, is a direct implication, since it is logically equivalent to On the contrary, is a coupling implication The following lemma is used as a basis for many of our results Lemma Consider a containment implication that is true, where each of the and is a conjunction of arithmetic comparisons If all its partial containment implications are direct implications, then there exists a single disjunct in the rhs of the containment implication such that Notice that containment implications and their partial containment implications are not necessarily related to mappings and query containment, only the names are borrowed Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark On Containment of Conjunctive Queries with Arithmetic Comparisons 469 We give conditions to guarantee direct implications in containment test Corollary Consider the normalized queries and in Theorem Suppose all partial containment implications are direct Then there is a mapping from to such that 4.2 Left Semi-interval Comparisons (LSI) for We first consider the case where is a conjunctive query with left semi-interval arithmetic comparison subgoals only (i.e., one of the form or or both may appear in the same query) The following theorem is a main result describing the conditions for the homomorphism property to hold in this case Theorem Let be a conjunctive query with left semi-interval arithmetic comparisons and a conjunctive query with any arithmetic comparisons If they satisfy all the following conditions, then the homomorphism property holds: Condition (i)-lsi: There not exist subgoals as follows which all share the same constant: An open-LSI subgoal in a closed-LSI subgoal in closure of and a subgoal in Condition (ii)-lsi: Either has no shared variables or there not exist subgoals as follows which all share the same constant: An open-LSI subgoal in a closed-LSI subgoal in the closure of and, a subgoal in Condition (iii)-lsi: Either has no shared variables or there not exist subgoals as follows which all share the same constant: An open-LSI subgoal in and two closed-LSI subgoals in the closure of It is straightforward to construct corollaries of Theorem with simpler conditions The following is an example Corollary Let be a conjunctive query with left semi-interval arithmetic comparisons and a conjunctive query with any arithmetic comparisons If the arithmetic comparisons in not share a constant with the closure of the arithmetic comparisons in then the homomorphism property holds The results in Theorem can be symmetrically stated for RSI queries as containing queries The symmetrical conditions of Theorem for the RSI case will be referred to as conditions (i)-rsi, (ii)-rsi, and (iii)-rsi, respectively 4.3 Semi-interval (SI) and Point-Inequalities (PI) Queries for Now we extend the result of Theorem to treat both LSI and RSI subgoals occurring in the same containing query We further extend it to include point inequalities (of the form The result is the following SI Queries for We consider the case where has both LSI and RSI inequalities called “SI inequalities,” i.e., any of the and In this case we need one more condition, namely Condition (iv), in order to avoid coupling implications Thus Theorem is extended to the following theorem, which is the second main result of this section Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark F Afrati, C Li, and P Mitra 470 Theorem Let be a conjunctive query with left semi-interval and right semi-interval arithmetic comparisons and a conjunctive query with SI arithmetic comparisons If they satisfy all the following conditions, then the homomorphism property holds: Conditions (i)-lsi, (ii)-lsi, (iii)-lsi, (i)-rsi, (ii)-rsi, and (iii)-rsi Condition (iv)-si: Any constant in an RSI subgoal of is strictly greater than any constant in an LSI subgoal of We refer to the last conditin as (iv)-si PI Queries for If the containing query has point inequalities, three more forms of coupling implications can occur Thus Theorem is further extended to Theorem 6, which is the third main result of this section Theorem Let be a conjunctive query with left semi-interval and right semi-interval and point inequality arithmetic comparisons and a conjunctive query with SI arithmetic comparisons If and satisfy all the following conditions, then the homomorphism property holds: Conditions (i)-lsi, (ii)-lsi, (iii)-lsi, (i)-rsi, (ii)-rsi, (iii)-rsi and (iv)-si Condition (v)-pi: Either has no repeated variables, or it does not have point inequalities Condition (vi)-pi: does not have a constant that occurs in or or 4.4 Beyond Semi-interval Queries for Our results have already captured subtle cases where the homomorphism property holds There is not much hope beyond those cases, unless we restrict the number of subgoals of the contained query, which is known in the literature (e.g., [14]) Couplings due to the implication: indicate that if the containing query has closed comparisons, then the homomorphism does not hold The following is such an example: Clearly 4.5 is contained in but the homomorphism property does not hold A Testing Algorithm We summarize the results in this section in an algorithm shown in Figure Given two CQAC queries and , the algorithm tests if the homomorphism property holds in checking Queries may not satisfy these conditions but still the Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark On Containment of Conjunctive Queries with Arithmetic Comparisons 471 Fig An algorithm for checking homomorphism property in testing Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 472 F Afrati, C Li, and P Mitra homomorphism property may hold For instance, it could happen if they not have self-joins, or if domain information yields that certain mappings are not possible (see Section 5) Hence, in the diagram, we can also put this additional check: Whenever one of the conditions is not met, we also check whether there are mappings that would enable a coupling implication We did not include the formal results for this last test for brevity, as they are a direct consequence of the discussion in the present section Improvements Using Domain Information So far we have discussed in what cases we not need to normalize queries in the containment test, and in what cases we can reduce the containment test to checking the existence of a single homomorphism If a query does not satisfy these conditions, the above results become inapplicable For instance, often a query may have both < and comparisons, not satisfying the conditions in Theorem In this section, we study how to relax these conditions by using domain knowledge of the relations and queries The intuition of our approach is the following We partition relation attributes into different domains, such as “car models,” “years,” and “prices.” We can safely assume that for realistic queries, their conditions respect these domains In particular, for a comparison X A, where X is a variable, A is a variable or a constant, the domain of A should be the same as that of X For example, it may be meaningless to have conditions such as “carYear = $6,000.” Therefore, in the implication of testing query containment, it is possible to partition the implication into different domains The domain information about the attributes is collected only once before queries are posed For instance, given the following implication We not need to consider implication between constants or variables in different domains, such as between “1998” and “$6,000,” and between “year” and “price.” As a consequence, this implication can be projected to the following implications in two domains: We can show that is true iff both and are true In this section, we first formalize this domain idea, and then show how to partition an implication into implications of different domains 5.1 Domains of Relation Attributes and Query Arguments Assume each attribute in a relation has a domain Consider two tables: house (seller, street, city, price) and crimerate(city, rate) Relation house has housing information, and relation crimerate has information about crime rates of cities The following table shows the domains of different attributes in these relations Notice that attributes house.city and crimerate.city share the same domain: Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark On Containment of Conjunctive Queries with Arithmetic Comparisons 473 We equate domains of variables and constants using the following rules: For each argument (either a variable or a constant) in a subgoal in query Q, the domain of is the corresponding domain of the attribute in relation R For each comparison X between variable X and constant we set Constants from different domains are always treated as different constants For instance, in two conditions carYear = 2000 and car Price = $2000, constants “2000” and “$2000” are different constants We perform this process on all subgoals and comparisons in the query In this calculation we make the following realistic assumptions: (1) If X is a shared variable in two subgoals, then the corresponding attributes of the two arguments of X have the same domain (2) If we have a comparison X Y, where X and Y are variables, then Dom(X) and Dom(Y) are always the same Consider the following queries on the relations above The computed domains of the variables and constants are shown in the table below It is easy to see that the domain information as defined in this section can be obtained in polynomial time 5.2 Partitioning Implication into Domains According to Theorem 1, to test the containment for two given queries and we need to test the containment implication in the theorem We want to partition this implication to implications in different domains, since testing the implication in each domain is easier Now we show that this partitioning idea is feasible We say a comparison X A is in domain D if X and A are in domain D The following are two important observations Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 474 F Afrati, C Li, and P Mitra If a mapping maps an argument X in query to an argument Y in query based on the calculation of argument domains, clearly X and Y are from the same domain In query normalization, each new introduced variable has the same domain as the replaced argument (variable or constant) Definition Consider the following implication For a domain D of the arguments in is the following implication: includes all comparisons of comparisons of in domain D Suppose we want to test one containment mapping from in Theorem 1: the projection of in D, denoted in domain D Similarly, to includes all for the two queries above There is only and we need to test the implication: The projection of on domain (float numbers in dollars) is Similarly, is Theorem Let be the domains of the arguments in the implication Then is true iff all the projected implications are true In the example above, by Theorem 7, is true iff and are true Since the latter two are true, is true Thus In general, we can test the implication in Theorem by testing the implications in different domains, which are much cheaper than the whole implication [11] gives formal results that relax the conditions in the theorems of the previous section to apply only on elements of the same domain Experiments In this section we report on experiments to determine whether the homomorphism property holds for real queries We checked queries in many introductory database courses available on the Web, some data-mining queries provided by Microsoft Research, and the TPC-H benchmark queries [20] We have observed that, for certain applications (e.g., the data-mining queries), usually queries not have self-joins; thus the homomorphism property holds In addition, among the queries that use only semi-interval (SI) and point inequality (PI) comparisons, the majority have the homomorphism property For a more detailed discussion, we focus on our evaluation results on the TPC-H benchmark queries [20], which represent typical queries in a wide range of decision-support applications To the best of our knowledge, our results are the first evidence that containment is easy for those queries The following is a summary of our experiments on the TPC-H benchmark queries Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark On Containment of Conjunctive Queries with Arithmetic Comparisons 475 All, except two ( and ) of the 22 queries use semi-interval comparisons (SI’s) and point inequalities (PI’s) only When the homomorphism property may not hold, it is always because of the following situation: a variable X (usually of “date” type) is bounded in an interval between two constants In such a case, the property is guaranteed to hold if the contained query does not contain self-joins of the subgoal that uses a variable that X can map to As a consequence, if the contained query is also one of the 22 queries, since they not have self-joins of relations that share a variable with SI predicates, the homomorphism property holds The detailed experimental results are in [11] Here we use the following query adapted from TPC-H query as an example (For simplicity we call this query ) We show how to apply the results in the earlier sections to test the following: in testing if is containing another CQAC query, does the homomorphism property hold in the test? Consider the case where we check for the containment of any conjunctive query with semi-interval arithmetic comparisons in the above query We shall apply Theorem Notice that the above query has shared variables (expressed by the equality in the WHERE clause), as well as it contains both LSI and RSI arithmetic comparisons However the variables o_orderdate (used in a comparison) and (a shared variable) are obviously of different domains Hence conditions (ii)-lsi, (ii)-rsi, (iii)-lsi, (iii)-rsi are satisfied Also using domain information, we see that (i)-lsi and (i)-rsi are satisfied In general, the condition (iv) in Theorem may not be satisfied, but the scenario in which it is not satisfied either uses a query with a self-join on relation lineitem or a self-join on relation orders Such a query (a) is not included in the benchmark, and (b) would ask for information that is not natural or is of a very specific and narrow interest (e.g., would ask of pairs of orders sharing a property) Consequently, to test containment of any natural SI query in we need only one containment mapping Notice that without using the domain information, we could not derive this conclusion Conclusion In this paper we considered the problem of testing containment between two conjunctive queries with arithmetic comparisons We showed in what cases the normalization step in the algorithm [5,6] is not needed We found various syntactic conditions on queries, under which we can reduce considerably the number of mappings needed to test containment to a single mapping (homomorphism Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 476 F Afrati, C Li, and P Mitra property) These syntactic conditions can be easily checked in polynomial time Our experiments using real queries showed that many of these queries pass this test, so they have the homomorphism property, making it possible to use more efficient algorithms for the test Acknowledgments We thank Microsoft Research for providing us their datamining queries to our experiments References Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Optimizing queries with materialized views In: ICDE (1995) 190–200 Theodoratos, D., Sellis, T.: Data warehouse configuration In: Proc of VLDB (1997) Ullman, J.D.: Information integration using logical views In: ICDT (1997) 19–40 Halevy, A.: Answering queries using views: A survey In: Very Large Database Journal (2001) Gupta, A., Sagiv, Y., Ullman, J.D., Widom, J.: Constraint checking with partial information In: PODS (1994) 45–55 Klug, A.: On conjunctive queries containing inequalities Journal of the ACM 35 (1988) 146–160 Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases STOC (1977) 77–90 van der Meyden, R.: The complexity of querying indefinite data about linearly ordered domains In: PODS (1992) Levy, A., Mendelzon, A.O., Sagiv, Y., Srivastava, D.: Answering queries using views In: PODS (1995) 95–104 10 Afrati, F., Li, C., Mitra, P.: Answering queries using views with arithmetic comparisons In: PODS (2002) 11 Afrati, F., Li, C., Mitra, P.: On containment of conjunctive queries with arithmetic comparisons (extended version) Technical report, UC Irvine (2003) 12 Saraiya, Y.: Subtree elimination algorithms in deductive databases Ph.D Thesis, Computer Science Dept., Stanford Univ (1991) 13 Qian, X.: Query folding In: ICDE (1996) 48–55 14 Kolaitis, P.G., Martin, D.L., Thakur, M.N.: On the complexity of the containment problem for conjunctive queries with built-in predicates In: PODS (1998) 197–204 15 Chandra, A., Lewis, H., Makowsky, J.: Embedded implication dependencies and their inference problem In: STOC (1981) 342–354 16 Cosmadakis, S.S., Kanellakis, P.: Parallel evaluation of recursive queries In: PODS (1986) 280–293 17 Chaudhuri, S., Vardi, M.Y.: On the equivalence of recursive and nonrecursive datalog programs In: PODS (1992) 55–66 18 Shmueli, O.: Equivalence of datalog queries is undecidable Journal of Logic Programming 15 (1993) 231–241 19 Wang, J., Maher, M., Toper, R.: Rewriting general conjunctive queries using views In: 13th Australasian Database Conf (ADC), Melbourne, Australia, ACS (2002) 20 TPC-H: http://www.tpc.org/tpch/ (2003) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark XPath with Conditional Axis Relations Maarten Marx* Language and Inference Technology, ILLC, Universiteit van Amsterdam, The Netherlands marx@science.uva.nl Abstract This paper is about the W3C standard node-addressing language for XML documents, called XPath XPath is still under development Version 2.0 appeared in 2001 while the theoretical foundations of Version 1.0 (dating from 1998) are still being widely studied The paper aims at bringing XPath to a “stable fixed point” in its development: a version which is expressively complete, still manageable computationally, with a user-friendly syntax and a natural semantics We focus on an important axis relation which is not expressible in XPath 1.0 and is very useful in practice: the conditional axis With it we can express paths specified by for instance “do a child step, while test is true at the resulting node” We study the effect of adding conditional axis relations to XPath on its expressive power and the complexity of the query evaluation and query equivalence problems We define an XPath dialect which is expressively complete, has a linear time query evaluation algorithm and for which query equivalence given a DTD can be decided in exponential time Introduction XPath 1.0 [38] is a variable free language used for selecting nodes from XML documents XPath plays a crucial role in other XML technologies such as XSLT [42], XQuery [41] and XML schema constraints, e.g., [40], The latest version of XPath (version 2.0) [39] is much more expressive and is close to being a full fledged tree query language Version 2.0 contains variables which are used in if–then–else, for and quantified expressions The available axes are the same in both versions The purpose of this paper is to show that useful and more expressive variants of XPath 1.0 can be created without introducing variables We think that the lack of variables in XPath 1.0 is one of the key features for its success, so we should not lightly give it up This paper uses the abstraction to the logical core of XPath 1.0 developed in [17, 16] This means that we are discussing the expressive power of the language on XML document tree models The central expression in XPath is axis node_label[filter] (called a location path) which when evaluated at node yields an answer set consisting of nodes such that the axis relation goes from to the node tag of is node_label, and the expression filter evaluates to true at * Research supported by NWO grant 612.000.106 Thanks to Patrick Blackburn, Linh Nguyen and Petrucio Viana E Bertino et al (Eds.): EDBT 2004, LNCS 2992, pp 477–494, 2004 © Springer-Verlag Berlin Heidelberg 2004 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 478 M Marx We study the consequences of varying the set of available axis relations on the expressive power and the computational complexity of the resulting XPath dialect All axis relations of XPath 1.0, like child, descendant etc., are assumed as given We single out an important axis relation which is not expressible in XPath 1.0 and is very useful in practice: the conditional axis With it we can express paths specified by for instance “do a child step, while test is true at the resulting node” Such conditional axes are widely used in programming languages and temporal logic specification and verification languages In temporal logic they are expressed by the Since and Until operators The acceptance of temporal logic as a key verification language is based on two interrelated results First, as shown by Kamp [21] and later generalized and given an accessible proof by Gabbay, Pnueli, Shelah and Stavi [14], temporal logic over linear structures can express every first order definable property This means that the language is in a sense complete and finished This is an important point in the development of a language It means that no further expressivity (up to first order of course) needs to be added The second important result concerned the complexity of the model checking and validity problem Both are complete for polynomial space, while model checking can be done in time with and the size of the query and data, respectively [8] The paper aims at bringing navigational XPath to a similar “ fixed point” in its development: a version which is expressively complete, still manageable computationally, with a user-friendly syntax and a natural semantics The key message of the paper is that all of this is obtainable by adding the conditional axes to the XPath syntax Our main contributions are the following: A motivation of conditional paths with natural examples (Section 2) The definition of several very expressive variable free XPath dialects, in particular the language (Section 3) A comparison of the expressive power of the different dialects (Section 4) An investigation of the computational complexity of the discussed dialects The main results are that the extra expressivity comes at no (theoretical) extra cost: query evaluation is in linear time in the size of the query and the data (Section 5), and query equivalence given a DTD is decidable in exponential time (Section 6) For succinctness and readability all proofs are put in the Appendix We end this introduction with a few words on related literature The observation that languages for XML like XPath, DTD’s and XSchema can be viewed as prepositional temporal, modal or description logics has been made by several authors For instance, Miklau and Suciu [25] and Gottlob et al [15] embed XPath into CTL The group round Calvanese, de Giacomo and Lenzerini published a number of papers relating DTD’s and XPath to description logic, thereby obtaining powerful complexity results cf, e.g., [7] Demri, de Rijke and Alechina reduce certain XML constraint inference problems to propositional dynamic logic [2] The containment problem for XPath given a set of constraints (a DTD for instance) has been studied in [36,9,25,26] The complexity of XPath query evaluation was determined in [15,16], We are convinced that this list is incomplete, if only for the sole reason that the connection between the formalisms and the structures in which they are interpreted is so obvious Our work differs from most in that we consider node labeled trees, whereas (following [1]) most authors model semistructured data as edge labeled trees This is more than just a different semantic Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark XPath with Conditional Axis Relations 479 viewpoint It allows us to use the same syntax as standard XPath and to give simpler and more perspicuous embeddings into known formalisms Taking this viewpoint it turns out that many results on XPath follow easily from known results: linear time query evaluation follows directly from linear time model checking for propositional dynamic logic (PDL); the exponential time query equivalence algorithm can also be deduced (with quite a bit more work) from results on PDL, and finally Gabbay’s separation technique for temporal logic is directly applicable to node labeled sibling ordered trees Conditional Paths This section motivates the addition of conditional paths to the XPath language Example Consider an XML document containing medical data as in Figure Each node describes a person with an attribute assigning a name and an attribute stating whether or not the person has or had leukemia The child-of relation in the tree models the real child-of relation between persons Thus the root of the tree is person a which has leukemia Person a has a child a1 without the disease and a grandchild a12 which has it Fig XML document containing medical data Consider the following information need: given a person find descendants of without leukemia such that all descendants of between and had/have leukemia The answer set of this query in the example document when evaluated at the root is the set of nodes with name attribute a1 and a22 When evaluated at node a1, the answer set is {a11, a13}, when evaluated at a2 the answer set is {a22}, and at all other nodes, the answer set is empty The information need can be expressed in first order logic using a suitable signature Let child and descendant be binary and P and has_leukemia be unary predicates The information need is expressed by a first order formula in two free variables: This information need can be expressed in XPath 1.0 for arbitrarily deep documents by the infinite disjunction (for readability, we abbreviate leukemia to l) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 480 M Marx Theorem below states that it is not possible to express this information need in Core XPath What seems to be needed is the notion of a conditional path Let denote all pairs such that is a child of and the test succeeds at Then the information need can be expressed by The axis describes the reflexive transitive closure of whence either the path self or a child–path for such that at all nodes the leukemia attribute equals yes The next example discusses an information need which is not expressible using conditional paths Admittedly the example is rather contrived In fact it is very difficult to find natural information needs on the data in Figure which are not expressible using transitive closure over conditional paths Theorem below states that every first order expressible set of nodes can be described as an XPath expression with transitive closure over conditional paths Thus the difficulty in finding not expressible natural queries arises because such queries are genuinely second order We come back to this point in Section when we discuss the expressive power of XPath dialects Example Consider the query “find all descendants which are an even number of steps away from the node of evaluation”, expressible as Note that the axis relation contains a transitive closure over a sequence of atomic paths The proof of Theorem shows that this information need is not expressible using just transitive closure over conditional paths The information need really expresses a second order property A Brief Introduction to XPath [16] proposes a fragment of XPath 1.0 which can be seen as its logical core, but lacks many of the functionality that account for little expressive power In effect it supports all XPath’s axis relations, except the attribute relation1, it allows sequencing and taking unions of path expressions and full booleans in the filter expressions It is called Core XPath A similar logical abstraction is made in [4] As the focus of this paper is expressive power, we also restrict XPath to its logical core We will define four XPath languages which only differ in the axis relations allowed in their expressions As in XPath 1.0, we distinguish a number of axis relations Instead of the rather verbose notation of XPath 1.0, we use a self-explanatory graphical notation, together with regular expression operators and * For the definition of the XPath languages, we follow the presentation of XPath in [ 16] The expressions obey the standard W3C unabbreviated XPath 1.0 syntax, except This is without loss of generality as instead of modeling attributes as distinct axes, as in the standard XML model, we may assign multiple labels to each node, representing whether a certain attribute-value pair is true at that node Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark XPath with Conditional Axis Relations 481 for the different notation of the axis relations The semantics is as in [4] and [15], which is in line with the standard XPath semantics from [34] Our simplest language is slightly more expressive than Core XPath (cf Remark 1) We view as the baseline in expressive power for XPath languages for conditional XPath, simply extends with conditional paths The key difference between and the three other languages is the use of filter expressions inside the axis relations, in particular as conditional paths Regular expressions2 with tests are well studied and often applied in languages for specifying paths (see the literature on Propositional Dynamic Logic (PDL) and Kleene algebras with tests [18,23]) Definition The syntax of the XPath languages fined by the grammar is de- where “locpath” (pronounced as location path) is the start production, “axis” denotes axis relations and “ntst” denotes tags labeling document nodes or the star ‘*’ that matches all tags (these are called node tests) The “fexpr” will be called filter expressions after their use as filters in location paths With an XPath expression we always mean a “locpath” The semantics of XPath expressions is given with respect to an XML document modeled as a finite node labeled sibling ordered tree3 (tree for short) Each node in the tree is labeled with a name tag from some alphabet Sibling ordered trees come with two binary relations, the child relation, denoted by and the immediate_right_sibling relation, denoted by Together with their inverses and they are used to interpret the axis relations Each location path denotes a binary relation (a set of paths) The meaning of the filter expressions is given by the predicate which assigns a boolean value Thus a filter expression fexpr is most naturally viewed as denoting sets of nodes: all such that is true For examples, we refer to Section and to [16] Given a tree and an expression the denotation or meaning of in is written as Table contains the definition of Regular expressions are strings generated by the grammar with a a primitive symbol and the empty string The operations have their usual meaning A sibling ordered tree is a structure isomorphic to where N is a set of finite sequences of natural numbers closed under taking initial segments, and for any sequence if then either or For holds iff for a natural number; holds iff and Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark ... in different domains, since testing the implication in each domain is easier Now we show that this partitioning idea is feasible We say a comparison X A is in domain D if X and A are in domain... expensive operation in database queries Indeed, most database queries can be expressed as select-project-join queries, combining joins with selections and projections Choosing an optimal plan,... projection pushing and join reordering in terms of the treewidth of the join graph (Note that the focus in [28] is on projection pushing in recursive queries; the non-recursive case is not investigated.)

Ngày đăng: 24/12/2013, 02:18

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan