The VLDB journal vol 14 issue 1 mar 2005

VLDB Journal (2005) 14: 2–29 / Digital Object Identifier (DOI) 10.1007/s00778-003-0111-3 Join operations in temporal databases Dengfeng Gao1 , Christian S Jensen2 , Richard T Snodgrass1 , Michael D Soo3 Computer Science Department, P.O Box 210077, University of Arizona, Tucson, AZ 85721-0077, USA e-mail: {dgao,rts}@cs.arizona.edu Department of Computer Science, Aalborg University, Fredrik Bajers Vej 7E, 9220 Aalborg Ø, Denmark e-mail: csj@cs.auc.dk Amazon.com, Seattle; e-mail: soo@amazon.com Edited by T Sellis Received: October 17, 2002 / Accepted: July 26, 2003 Published online: October 28, 2003 – c Springer-Verlag 2003 Abstract Joins are arguably the most important relational operators Poor implementations are tantamount to computing the Cartesian product of the input relations In a temporal database, the problem is more acute for two reasons First, conventional techniques are designed for the evaluation of joins with equality predicates rather than the inequality predicates prevalent in valid-time queries Second, the presence of temporally varying data dramatically increases the size of a database These factors indicate that specialized techniques are needed to efficiently evaluate temporal joins We address this need for efficient join evaluation in temporal databases Our purpose is twofold We first survey all previously proposed temporal join operators While many temporal join operators have been defined in previous work, this work has been done largely in isolation from competing proposals, with little, if any, comparison of the various operators We then address evaluation algorithms, comparing the applicability of various algorithms to the temporal join operators and describing a performance study involving algorithms for one important operator, the temporal equijoin Our focus, with respect to implementation, is on non-index-based join algorithms Such algorithms not rely on auxiliary access paths but may exploit sort orderings to achieve efficiency Keywords: Attribute skew – Interval join – Partition join – Sort-merge join – Temporal Cartesian product – Temporal join – Timestamp skew Introduction Time is an attribute of all real-world phenomena Consequently, efforts to incorporate the temporal domain into database management systems (DBMSs) have been ongoing for more than a decade [39,55] The potential benefits of this research include enhanced data modeling capabilities and more conveniently expressed and efficiently processed queries over time Whereas most work in temporal databases has concentrated on conceptual issues such as data modeling and query languages, recent attention has been on implementationrelated issues, most notably indexing and query processing strategies In this paper, we consider an important subproblem of temporal query processing, the evaluation ad hoc temporal join operations, i.e., join operations for which indexing or secondary access paths are not available or appropriate Temporal indexing, which has been a prolific research area in its own right [44], and query evaluation algorithms that exploit such temporal indexes are beyond the scope of this paper Joins are arguably the most important relational operators This is so because efficient join processing is essential for the overall efficiency of a query processor Joins occur frequently due to database normalization and are potentially expensive to compute [35] Poor implementations are tantamount to computing the Cartesian product of the input relations In a temporal database, the problem is more acute Conventional techniques are aimed at the optimization of joins with equality predicates, rather than the inequality predicates prevalent in temporal queries [27] Moreover, the introduction of a time dimension may significantly increase the size of the database These factors indicate that new techniques are required to efficiently evaluate joins over temporal relations This paper aims to present a comprehensive and systematic study of join operations in temporal databases, including both semantics and implementation Many temporal join operators have been proposed in previous research, but little comparison has been performed with respect to the semantics of these operators Similarly, many evaluation algorithms supporting these operators have been proposed, but little analysis has appeared with respect to their relative performance, especially in terms of empirical study The main contributions of this paper are the following: • To provide a systematic classification of temporal join operators as natural extensions of conventional join operators • To provide a systematic classification of temporal join evaluation algorithms as extensions of common relational query evaluation paradigms • To empirically quantify the performance of the temporal join algorithms for one important, frequently occurring, and potentially expensive temporal operator D Gao et al.: Join operations in temporal databases Our intention is for DBMS vendors to use the contributions of this paper as part of a migration path toward incorporating temporal support into their products Specifically, we show that nearly all temporal query evaluation work to date has extended well-accepted conventional operators and evaluation algorithms In many cases, these operators and techniques can be implemented with small changes to an existing code base and with acceptable, though perhaps not optimal, performance Research has identified two orthogonal dimensions of time in databases – valid time, modeling changes in the real world, and transaction time, modeling the update activity of the database [23,51] A database may support none, one, or both of the given time dimensions In this paper, we consider only single-dimension temporal databases, so-called valid-time and transaction-time databases Databases supporting both time dimensions, so-called bitemporal databases, are beyond the scope of this paper, though many of the described techniques extend readily to bitemporal databases We will use the terms snapshot, relational, or conventional database to refer to databases that provide no integrated support for time The remainder of the paper is organized as follows We propose a taxonomy of temporal join operators in Sect This taxonomy extends well-established relational operators to the temporal context and classifies all previously defined temporal operators In Sect 3, we develop a corresponding taxonomy of temporal join evaluation algorithms, all of which are non-index-based algorithms The next section focuses on engineering the algorithms It turns out that getting the details right is essential for good performance In Sect 5, we empirically investigate the performance of the evaluation algorithms with respect to one particular, and important, valid-time join operator The algorithms are tested under a variety of resource constraints and database parameters Finally, conclusions and directions for future work are offered in Sect Temporal join operators In the past, temporal join operators were defined in different temporal data models; at times the essentially same operators were even given different names when defined in different models Further, the existing join algorithms have also been constructed within the contexts of different data models This section enables the comparison of join definitions and implementations across data models We thus proceed to propose a taxonomy of temporal joins and then use this taxonomy to classify all previously defined temporal joins We take as our point of departure the core set of conventional relational joins that have long been accepted as “standard” [35]: Cartesian product (whose “join predicate” is the constant expression TRUE), theta join, equijoin, natural join, left and right outerjoin, and full outerjoin For each of these, we define a temporal counterpart that is a natural, temporal generalization of it This generalization hinges on the notion of snapshot equivalence [26], which states that two temporal relations are equivalent if they consist of the same sequence of timeindexed snapshots We note that some other join operators exist, including semijoin, antisemijoin, and difference Their temporal counterparts have been explored elsewhere [11] and are not considered here Having defined this set of temporal joins, we show how all previously defined operators are related to this taxonomy of temporal joins The previous operators considered include Cartesian product, Θ-JOIN, EQUIJOIN, NATURAL JOIN, TIME JOIN [6,7], TE JOIN, TE OUTERJOIN, and EVENT JOIN [20,46,47,52] and those based on Allen’s [1] interval relations ([27,28,36]) We show that many of these operators incorporate less restrictive predicates or use specialized attribute semantics and thus are variants of one of the taxonomic joins 2.1 Temporal join definitions To be specific, we base the definitions on a single data model We choose the model that is used most widely in temporal data management implementations, namely, the one that timestamps each tuple with an interval We assume that the timeline is partitioned into minimal-duration intervals, termed chronons [12], and we denote intervals by inclusive starting and ending chronons We define two temporal relational schemas, R and S, as follows R = (A1 , , An , Ts , Te ) S = (B1 , , Bm , Ts , Te ) The Ai , ≤ i ≤ n and Bi , ≤ i ≤ m are the explicit attributes found in corresponding snapshot schemas, and Ts and Te are the timestamp start and end attributes, recording when the information recorded by the explicit attributes holds (or held or will hold) true We will use T as shorthand for the interval [Ts , Te ] and A and B as shorthand for {A1 , , An } and {B1 , , Bn }, respectively Also, we define r and s to be instances of R and S, respectively Example Consider the following two temporal relations The relations show the canonical example of employees, the departments they work for, and the managers who supervise those departments Employee Manages EmpName Dept T Dept MgrName T Ron George Ron Ship Ship Mail [1,5] [5,9] [6,10] Load Ship Ed Jim [3,8] [7,15] Tuples in the relations represent facts about the modeled reality For example, the first tuple in the Employee relation represents the fact that Ron worked for the Shipping department from time to time 5, inclusive Notice that none of the attributes, including the timestamp attributes T, are set-valued – the relation schemas are in 1NF 2.2 Cartesian product The temporal Cartesian product is a conventional Cartesian product with a predicate on the timestamp attributes To define it, we need two auxiliary definitions First, intersect(U, V ), where U and V are intervals, returns TRUE if there exists a chronon t such that t ∈ U ∧ t ∈ V Second, overlap(U, V ) returns the maximum interval contained in its two argument intervals If no nonempty intervals exist, the function returns ∅ To state this more precisely, let first and last return the smallest and largest of two argument chronons, respectively Also, let Us and Ue denote, respectively, the starting and ending chronons of U , and similarly for V   [last(Us , Vs ), f irst(Ue , Ve )] if last(Us , Vs ) ≤ f irst(Ue , Ve ) overlap(U, V ) = ∅ otherwise Definition The temporal Cartesian product, r ×T s, of two temporal relations r and s is defined as follows r ×T s = {z (n+m+2) | ∃x ∈ r ∃y ∈ s ( z[A] = x[A] ∧ z[B] = y[B] ∧ z[T] = overlap(x[T], y[T]) ∧ z[T] = ∅)} The second line of the definition sets the explicit attribute values of the result tuple z to the concatenation of the explicit attribute values of x and y The third line computes the timestamp of z and ensures that it is nonempty Example Consider the query “Show the names of employees and managers where the employee worked for the company while the manager managed some department in the company.” This can be satisfied using the temporal Cartesian product Employee ×T Manager EmpName Dept Dept MgrName T Ron Ship Load Ed [3,5] George Ship Load Ed [5,8] George Ship Ship Jim [7,9] Ron Mail Load Ed [6,8] Ron Mail Ship Jim [7,10] The overlap function is necessary and sufficient to ensure snapshot reducibility, as will be discussed in detail in Sect 2.7 Basically, we want the temporal Cartesian product to act as though it is a conventional Cartesian product applied independently at each point in time When operating on intervalstamped data, this semantics corresponds to an intersection: the result will be valid during those times when contributing tuples from both input relations are valid The temporal Cartesian product was first defined by Segev and Gunadhi [20,47] This operator was termed the time join, and the abbreviation T-join was used Clifford and Croker [7] defined a Cartesian product operator that is a combination of the temporal Cartesian product and the temporal outerjoin, to be defined shortly Interval join is a building block of the (spatial) rectangle join [2] The interval join is a onedimensional spatial join that can thus be used to implement the temporal Cartesian product 2.3 Theta join Like the conventional theta join, the temporal theta join supports an unrestricted predicate P on the explicit attributes of its input arguments The temporal theta join, r ✶TP s, of two D Gao et al.: Join operations in temporal databases relations r and s selects those tuples from r ×T s that satisfy predicate P (r[A], s[B]) Let σ denote the standard selection operator Definition The temporal theta join, r ✶TP s, of two temporal relations r and s is defined as follows r ✶TP s = σP (r[A],s[B]) (r ×T s) A form of this operator, the Θ-JOIN, was proposed by Clifford and Croker [6] This operator was later extended to allow computations more general than overlap on the timestamps of result tuples [53] 2.4 Equijoin Like snapshot equijoin, the temporal equijoin operator enforces equality matching among specified subsets of the explicit attributes of the input relations Definition The temporal equijoin on two temporal relations r and s on attributes A ⊆ A and B ⊆ B is defined as the theta join with predicate P ≡ r[A ] = s[B ] r ✶Tr[A ]=s[B ] s Like the temporal theta join, the temporal equijoin was first defined by Clifford and Croker [6] A specialized operator, the TE-join, was developed independently by Segev and Gunadhi [47] The TE-join required the explicit join attribute to be a surrogate attribute of both input relations Essentially, a surrogate attribute would be a key attribute of a corresponding nontemporal schema In a temporal context, a surrogate attribute value represents a time-invariant object identifier If we augment schemas R and S with surrogate attributes ID, then the TE-join can be expressed using the temporal equijoin as follows r TE-join s ≡ r ✶Tr[ID]=s[ID] s The temporal equijoin was also generalized by Zhang et al to yield the generalized TE-join, termed the GTE-join, which specifies that the joined tuples must have their keys in a specified range while their intervals should intersect a specified interval [56] The objective was to focus on tuples within interesting rectangles in the key-time space 2.5 Natural join The temporal natural join and the temporal equijoin bear the same relationship to one another as their snapshot counterparts That is, the temporal natural join is simply a temporal equijoin on identically named explicit attributes followed by a subsequent projection operation To define this join, we augment our relation schemas with explicit join attributes, Ci , ≤ i ≤ k, which we abbreviate by C R = (A1 , , An , C1 , , Ck , Ts , Te ) S = (B1 , , Bm , C1 , , Ck , Ts , Te ) Definition The temporal natural join of r and s, r ✶T s, is defined as follows D Gao et al.: Join operations in temporal databases r ✶T s = {z (n+m+k+2) | ∃x ∈ r ∃y ∈ s(x[C] = y[C]∧ z[A] = x[A] ∧ z[B] = x[B] ∧ z[C] = y[C]∧ z[T] = overlap(x[T], y[T]) ∧ z[T] = ∅)} The first two lines ensure that tuples x and y agree on the values of the join attributes C and set the explicit attributes of the result tuple z to the concatenation of the nonjoin attributes A and B and a single copy of the join attributes, C The third line computes the timestamp of z as the overlap of the timestamps of x and y and ensures that x[T] and y[T] actually overlap This operator was first defined by Clifford and Croker [6], who named it the natural time join We showed in earlier work that the temporal natural join plays the same important role in reconstructing normalized temporal relations as the snapshot natural join for normalized snapshot relations [25] Most previous work in temporal join evaluation has addressed, either implicitly or explicitly, the implementation of the temporal natural join or the closely related temporal equijoin 2.6 Outerjoins and outer Cartesian products Like the snapshot outerjoin, temporal outerjoins and Cartesian products retain dangling tuples, i.e., tuples that not participate in the join However, in a temporal database, a tuple may dangle over a portion of its time interval and be covered over others; this situation must be accounted for in a temporal outerjoin or Cartesian product We may define the temporal outerjoin as the union of two subjoins, like the snapshot outerjoin The two subjoins are the temporal left outerjoin and the temporal right outerjoin As the left and right outerjoins are symmetric, we define only the left outerjoin We need two auxiliary functions The coalesce function collapses value-equivalent tuples – tuples with mutually equal nontimestamp attribute values [23] – in a temporal relation into a single tuple with the same nontimestamp attribute values and a timestamp that is the finite union of intervals that precisely contains the chronons in the timestamps of the valueequivalent tuples (A finite union of time intervals is termed a temporal element [15], which we represent in this paper as a set of chronons.) The definition of coalesce uses the function chronons that returns the set of chronons contained in the argument interval coalesce(r) = {z (n+1) | ∃x∈ r(z[A] = x[A] ⇒ chronons(x[T]) ⊆ z[T]∧ ∀x ∈ r(x[A] = x [A] ⇒ (chronons(x [T]) ⊆ z[T]))) ∧ ∀t ∈ z[T] ∃x ∈ r(z[A] = x [A] ∧ t ∈ chronons(x [T]))} The second and third lines of the definition coalesce all valueequivalent tuples in relation r The last line ensures that no spurious chronons are generated We now define a function expand that returns the set of maximal intervals contained in an argument temporal element, T expand(T ) = {[ts , te ] | ts ∈ T ∧ te ∈ T ∧ ∀t ∈ chronons([ts , te ])(t ∈ T )∧ ¬∃ts ∈ T (ts < ts ∧ ∀t (ts < t < ts ⇒ t ∈ T )) ∧ ¬∃te ∈ T (te > te ∧ ∀t (te < t < te ⇒ t ∈ T ))} The second line ensures that a member of the result is an interval contained in T The last two lines ensure that the interval is indeed maximal We are now ready to define the temporal left outerjoin Let R and S be defined as for the temporal equijoin We use A ⊆ A and B ⊆ B as the explicit join attributes Definition The temporal left outerjoin, r ✶Tr[A ]=s[B ] s, of two temporal relations r and s is defined as follows r ✶Tr[A ]=s[B ] s = {z (n+m+2) | ∃x ∈ coalesce(r) ∃y ∈ coalesce(s) (x[A ] = y[B ] ∧ z[A] = x[A] ∧ z[T] = ∅ ∧ ((z[B] = y[B] ∧ z[T] ∈ expand(x[T] ∩ y[T])) ∨ (z[B] = null ∧ z[T] ∈ expand(x[T] − y[T])))) ∨ ∃x ∈ coalesce(r) ∀y ∈ coalesce(s) (x[A ] = y[B ] ⇒ z[A] = x[A] ∧ z[B] = null ∧ z[T] ∈ expand(x[T ]) ∧ z[T] = ∅)} The first five lines of the definition handle the case where, for a tuple x deriving from the left argument, a tuple y with matching explicit join attribute values is found For those time intervals of x that are not shared with y, we generate tuples with null values in the attributes of y The final three lines of the definition handle the case where no matching tuple y is found Tuples with null values in the attributes of y are generated The temporal outerjoin may be defined as simply the union of the temporal left and the temporal right outerjoins (the union operator eliminates the duplicate equijoin tuples) Similarly, a temporal outer Cartesian product is a temporal outerjoin without the equijoin condition (A = B = ∅) Gunadhi and Segev were the first researchers to investigate outerjoins over time They defined a specialized version of the temporal outerjoin called the EVENT JOIN [47] This operator, of which the temporal left and right outerjoins were components, used a surrogate attribute as its explicit join attribute This definition was later extended to allow any attributes to serve as the explicit join attributes [53] A specialized version of the left and right outerjoins called the TE-outerjoin was also defined The TE-outerjoin incorporated the TE-join, i.e., temporal equijoin, as a component Clifford and Croker [7] defined a temporal outer Cartesian product, which they termed simply Cartesian product 2.7 Reducibility We proceed to show how the temporal operators reduce to snapshot operators Reducibility guarantees that the semantics of the snapshot operator is preserved in its more complex temporal counterpart For example, the semantics of the temporal natural join reduces to the semantics of the snapshot natural join in that the result of first joining two temporal relations and then transforming the result to a snapshot relation yields a result that is the same as that obtained by first transforming the arguments to snapshot relations and then joining the snapshot relations This commutativity diagram is shown in Fig and stated formally in the first equality of the following theorem D Gao et al.: Join operations in temporal databases Temporal relations Snapshot relations τtT ✲ r, r τtT (r), τtT (r ) ✶ ✶T ❄ r ✶T r τtT ❄ ✲ τtT (r ✶T r ) = The timeslice operation τ T takes a temporal relation r as argument and a chronon t as parameter It returns the corresponding snapshot relation, i.e., with the schema of r but without the timestamp attributes, that contains (the nontime stamp portion of) all tuples x from r for which t belongs to x[T] It follows from the theorem below that the temporal joins defined here reduce to their snapshot counterparts Theorem Let t denote a chronon and let r and s be relation instances of the proper types for the operators they are applied to Then the following hold for all t τtT (r ✶T s) = τtT (r) ✶ τtT (s) τtT (r ×T s) = τtT (r) × τtT (s) τtT (r ✶TP s) = τtT (r) ✶P τtT (s) τtT (r ✶T s) = τtT (r) ✶ τtT (s) τtT (r ✶ T s) = τtT (r) ✶ τtT (s) Proof: An equivalence is shown by proving its two inclusions separately The nontimestamp attributes of r and s are AC and BC, respectively, where A, B, and C are sets of attributes and C denotes the join attribute(s) (cf the definition of temporal natural join) We prove one inclusion of the first equivalence, that is, τtT (r ✶T s) ⊆ τtT (r) ✶ τtT (s) The remaining proofs are similar in style Let x ∈ τtT (r ✶ s) (the left-hand side of the equivalence to be proved) Then there is a tuple x ∈ r ✶T s such that x [ABC] = x and t ∈ x [T] By the definition of ✶T , there exist tuples x1 ∈ r and x2 ∈ s such that x1 [C] = x2 [C] = x [C], x1 [A] = x [A], x2 [B] = x [B], x [T ] = overlap(x1 [T ], x2 [T ]) By the definition of τtT , there exist a tuple x1 ∈ τtT (r) such that x1 = x1 [AC] = x [AC] and a tuple x2 ∈ τtT (s) such that x2 = x2 [BC] = x [BC] Then there exists x12 ∈ τtT (r) ✶ τtT (s) (the right-hand side of the equivalence) such that x12 [AC] = x1 and x12 [B] = x2 [B] By construction, x12 = x This proves the ⊆ inclusion 2.8 Summary We have defined a taxonomy for temporal join operators The taxonomy was constructed as a natural extension of corresponding snapshot database operators We also briefly described how previously defined temporal operators are accommodated in the taxonomy Table summarizes how previous work is represented in the taxonomy For each operator defined in previous work, the table lists the defining publication, researchers, the corresponding taxonomy operator, and any τtT (r) ✶τtT (r ) Fig Reducibility of temporal natural join to snapshot natural join restrictions assumed by the original operators In early work, Clifford [8] indicated that an INTERSECTION JOIN should be defined that represents the categorized nonouter joins and Cartesian products, and he proposed that a UNION JOIN be defined for the outer variants Evaluation algorithms In the previous section, we described the semantics of all previously proposed temporal join operators We now turn our attention to implementation algorithms for these operators As before, our purpose is to enumerate the space of algorithms applicable to the temporal join operators, thereby providing a consistent framework within which existing temporal join evaluation algorithms can be placed Our approach is to extend well-understood paradigms from conventional query evaluation to temporal databases Algorithms for temporal join evaluation are necessarily more complex than their snapshot counterparts Whereas snapshot evaluation algorithms match input tuples based on their explicit join attributes, temporal join evaluation algorithms typically must additionally ensure that temporal restrictions are met Furthermore, this problem is exacerbated in two ways Timestamps are typically complex data types, e.g., intervals requiring inequality predicates, which conventional query processors are not optimized to handle Also, a temporal database is usually larger than a corresponding snapshot database due to the versioning of tuples We consider non-index-based algorithms Index-based algorithms use an auxiliary access path, i.e., a data structure that identifies tuples or their locations using a join attribute value Non-index-based algorithms not employ auxiliary access paths While some attention has been focused on index-based temporal join algorithms, the large number of temporal indexes that have been proposed in the literature [44] precludes a thorough investigation in this paper We first provide a taxonomy of temporal join algorithms This taxonomy, like the operator taxonomy of Table 1, is based on well-established relational concepts Sections 3.2 and 3.3 describe the algorithms in the taxonomy and place existing work within the given framework Finally, conclusions are offered in Sect 3.4 3.1 Evaluation taxonomy All binary relational query evaluation algorithms, including those computing conventional joins, are derived from four D Gao et al.: Join operations in temporal databases Table Temporal join operators Operator Initial citation Taxonomy operator Restrictions Cartesian product EQUIJOIN GTE-join INTERVAL JOIN NATURAL JOIN TIME JOIN T-join TE-JOIN TE-OUTERJOIN EVENT JOIN Θ-JOIN Valid-time theta join Valid-time left join [7] [6] [56] [2] [6] [6] [20] [47] [47] [47] [6] [53] [53] Outer Cartesian product Equijoin Equijoin Cartesian product Natural join Cartesian product Cartesian product Equijoin Left outerjoin Outerjoin Theta join Theta join Left outerjoin None None 2, None None None 2 None None None Restrictions: = restricts also the valid time of the result tuples = matching only on surrogate attributes = includes also intersection predicates with an argument surrogate range and a time range basic paradigms: nested-loop, partitioning, sort-merge, and index-based [18] Partition-based join evaluation divides the input tuples into buckets using the join attributes of the input relations as key values Corresponding buckets of the input relations contain all tuples that could possibly match with one another, and the buckets are constructed to best utilize the available main memory buffer space The result is produced by performing an in-memory join of each pair of corresponding buckets from the input relations Sort-merge join evaluation also divides the input relation but uses physical memory loads as the units of division The memory loads are sorted, producing sorted runs, and written to disk The result is produced by merging the sorted runs, where qualifying tuples are matched and output tuples generated Index-based join evaluation utilizes indexes defined on the join attributes of the input relations to locate joining tuples efficiently The index could be preexisting or built on the fly Elmasri et al presented a temporal join algorithm that utilizes a two-level time index, which used a B+ -tree to index the explicit attribute in the upper level, with the leaves referencing other B+ -trees indexing time points [13] Son and Elmasri revised the time index to require less space and used this modified index to determine the partitioning intervals in a partition-based timestamp algorithm [52] Bercken and Seeger proposed several temporal join algorithms based on a multiversion B+ -tree (MVBT) [4] Later Zhang et al described several algorithms based on B+ -trees, R∗ -trees [3], and the MVBT for the related GTE-join This operation requires that joined tuples have key values that belong to a specified range and have time intervals that intersect a specified interval [56] The MVBT assumes that updates arrive in increasing time order, which is not the case for valid-time data We focus on non-index-based join algorithms that apply to both valid-time and transaction-time relations, and we not discuss these index-based joins further We adapt the basic non-index-based algorithms (nested-loop, partitioning, and sort-merge) to support temporal joins To enumerate the space of temporal join algorithms, we exploit the duality of partitioning and sort-merge [19] In particular, the division step of partitioning, where tuples are separated based on key values, is analogous to the merging step of sort-merge, where tuples are matched based on key values In the following, we consider the characteristics of sort-merge algorithms and apply duality to derive corresponding characteristics of partition-based algorithms For a conventional relation, sort-based join algorithms order the input relation on the input relations’ explicit join attributes For a temporal relation, which includes timestamp attributes in addition to explicit attributes, there are four possibilities for ordering the relation First, the relation can be sorted by the explicit attributes exclusively Second, the relation can be ordered by time, using either the starting or ending timestamp [29,46] The choice of starting or ending timestamp dictates an ascending or descending sort order, respectively Third, the relation can be ordered primarily by the explicit attributes and secondarily by time [36] Finally, the relation can be ordered primarily by time and secondarily by the explicit attributes By duality, the division step of partition-based algorithms can partition using any of these options [29,46] Hence four choices exist for the dual steps of merging in sort-merge or partitioning in partition-based methods We use this distinction to categorize the different approaches to temporal join evaluation The first approach above, using the explicit attributes as the primary matching attributes, we term explicit algorithms Similarly, we term the second approach timestamp algorithms We retain the generic term temporal algorithm to mean any algorithm to evaluate a temporal operator Finally, it has been recognized that the choice of buffer allocation strategy, GRACE or hybrid [9], is independent of whether a sort-based or partition-based approach is used [18] Hybrid policies retain most of the last run of the outer relation in main memory and so minimize the flushing of intermediate buffers to disk, thereby potentially decreasing the I/O cost Figure lists the choices of sort-merge vs partitioning, the possible sorting/partitioning attributes, and the possible D Gao et al.: Join operations in temporal databases     Explicit Sort-merge Timestamp × Partitioning  Explicit/timestamp   Timestamp/explicit        × GRACE Hybrid Fig Space of possible evaluation algorithms buffer allocation strategies Combining all possibilities yields 16 possible evaluation algorithms Including the basic nestedloop algorithm and GRACE and hybrid variants of the sortbased interval join mentioned in Sect 2.2 results in a total of 19 possible algorithms The 19 algorithms are named and described in Table We noted previously that time intervals lack a natural order From this point of view spatial join is similar because there is no natural order preserving spatial closeness Previous work on spatial join may be categorized into three approaches Early work [37,38] used a transformation approach based on space-filling curves, performing a sort-merge join along the curve to solve the join problem Most of the work falls in the index-based approaches, utilizing spatial index structures such as the R-tree [21], R+ -tree [48], R∗ -tree [3], Quad-tree [45], or seeded tree [31] While some algorithms use preexisting indexes, others build the indexes on the fly In recent years, some work has focused on non-indexbased spatial join approaches Two partition-based spatial join algorithms have been proposed One of them [32] partitions the input relations into overlapping buckets and uses an indexed nested-loop join to perform the join within each bucket The other [40] partitions the input relations into disjoint partitions and uses a computational-geometry-based plane-sweep algorithm that can be thought of as the spatial equivalent of the sort-merge algorithm Arge et al [2] introduced a highly optimized implementation of the sweepingbased algorithm that first sorts the data along the vertical axis and then partitions the input into a number of vertical strips Data in each strip can then be joined by an internal planesweep algorithm All the above non-index-based spatial join algorithms use a sort- or partition-based approach or combine these two approaches in one algorithm, which is the approach we adopt in some of our temporal join algorithms (Sect 4.3.2) In the next two sections, we examine the space of explicit algorithms and timestamp algorithms, respectively, and classify existing approaches using the taxonomy developed in this section We will see that most previous work in temporal join evaluation has centered on timestamp algorithms However, for expository purposes, we first examine those algorithms based on manipulation of the nontimestamp columns, which we term “explicit” algorithms 3.2 Explicit algorithms Previous work has largely ignored the fact that conventional query evaluation algorithms can be easily modified to evaluate temporal joins In this section, we show how the three paradigms of query evaluation can support temporal join evaluation To make the discussion concrete, we develop an algorithm to evaluate the valid-time natural join, defined in Sect 2, for each of the three paradigms We begin with the simplest paradigm, nested-loop evaluation explicitNestedLoop(r, s): result ← ∅; for each block br ∈ r read(br ); for each block bs ∈ s read(bs ); for each tuple x ∈ br for each tuple y ∈ bs if x[C] = y[C] and overlap(x[T], y[T]) = ∅ z[A] ← x[A]; z[B] ← y[B]; z[C] ← x[C]; z[T] ← overlap(x[T], y[T]); result ← result ∪ {z}; return result; Fig Algorithm explicitNestedLoop 3.2.1 Nested-loop-based algorithms Nested-loop join algorithms match tuples by exhaustively comparing pairs of tuples from the input relations As an I/O optimization, blocks of the input relations are read into memory, with comparisons performed between all tuples in the input blocks The size of the input blocks is constrained by the available main memory buffer space The algorithm operates as follows One relation is designated the outer relation, the other the inner relation [35,18] The outer relation is scanned once For each block of the outer relation, the inner relation is scanned When a block of the inner relation is read into memory, the tuples in that “inner block” are joined with the tuples in the “outer block.” The temporal nested-loop join is easily constructed from this basic algorithm All that is required is that the timestamp predicate be evaluated at the same time as the predicate on the explicit attributes Figure shows the temporal algorithm (In the figure, r is the outer relation and s is the inner relation We assume their schemas are as defined in Sect 2.) While conceptually simple, nested-loop-based evaluation is often not competitive due to its quadratic cost We now describe temporal variants of the sort-merge and partitionbased algorithms, which usually exhibit better performance 3.2.2 Sort-merge-based algorithms Sort-merge join algorithms consist of two phases In the first phase, the input relations r and s are sorted by their join attributes In the second phase, the result is produced by simultaneously scanning r and s, merging tuples with identical values for their join attributes Complications arise if the join attributes are not key attributes of the input relations In this case, multiple tuples in r and in s may have identical join attribute values Hence a given r tuple may join with many s tuples, and vice versa (This is termed skew [30].) As before, we designate one relation as the outer relation and the other as the inner relation When consecutive tuples in D Gao et al.: Join operations in temporal databases Table Algorithm taxonomy Algorithm Name Description Explicit sort Hybrid explicit sort Timestamp sort Hybrid timestamp sort Explicit/timestamp sort Hybrid explicit/timestamp sort Timestamp/explicit sort Hybrid timestamp/explicit sort Interval join Hybrid interval join Explicit partitioning Hybrid explicit partitioning Timestamp partitioning Hybrid timestamp partitioning Explicit/timestamp partitioning Hybrid explicit/timestamp partitioning Timestamp/explicit partitioning Hybrid timestamp/explicit partitioning Nested-loop ES ES-H TS TS-H ETS ETS-H TES TES-H TSI TSI-H EP EP-H TP TP-H ETP ETP-H TEP TEP-H NL GRACE sort-merge by explicit attributes Hybrid sort-merge by explicit attributes GRACE sort-merge by timestamps Hybrid sort-merge by timestamps GRACE sort-merge by explicit attributes/time Hybrid sort-merge by explicit attributes/time GRACE sort-merge by time/explicit attributes Hybrid sort-merge by time/explicit attributes GRACE sort-merge by timestamps Hybrid sort-merge by timestamps GRACE partitioning by explicit attributes Hybrid partitioning by explicit attributes Range partition by time Hybrid range partitioning by time GRACE partitioning by explicit attributes/time Hybrid partitioning by explicit attributes/time GRACE partitioning by time/explicit attributes Hybrid partitioning by time/explicit attributes Exhaustive matching structure state integer current block; integer current tuple; integer f irst block; integer f irst tuple; block tuples; Fig State structure for merge scanning the outer relation have identical values for their explicit join attributes, i.e., their nontimestamp join attributes, the scan of the inner relation is “backed up” to ensure that all possible matches are found Prior to showing the explicitSortMerge algorithm, we define a suite of algorithms that manage the scans of the input relations For each scan, we maintain the state structure shown in Fig The fields current block and current tuple together indicate the current tuple in the scan by recording the number of the current block and the index of the current tuple within that block The fields first block and first tuple are used to record the state at the beginning of a scan of the inner relation in order to back up the scan later if needed Finally, tuples stores the block of the relation currently in memory For convenience, we treat the block as an array of tuples The initState algorithm shown in Fig initializes the state of a scan Essentially, counters are set to guarantee that the first block read and the first tuple scanned are the first block and first tuple within that block in the input relation We assume that a seek operation is available that repositions the file pointer associated with a relation to a given block number The advance algorithm advances the scan of the argument relation and state to the next tuple in the sorted relation If the current block has been exhausted, then the next block of the relation is read Otherwise, the state is updated to mark the next tuple in the current block as the next tuple in the scan The initState(relation, state): state.current block ← 1; state.current tuple ← 0; state.f irst block ←⊥; state.f irst tuple ←⊥; seek(relation, state.current block); state.tuples ← read block(relation); advance(relation, state): if (state.current tuple = MAX TUPLES) state.tuples ← read block(relation); state.current block ← state.current block + 1; state.current tuple ← 1; else state.current tuple ← state.current tuple + 1; currentTuple(state): return state.tuples[state.current tuple] backUp(relation, state): if (state.current block = state.f irst block) state.current block ← state.f irst block; seek(relation, state.current block); state.tuples ← read block(relation); state.current tuple ← state.f irst tuple; markScanStart(state): state.f irst block ← state.current block; state.f irst tuple ← state.current tuple; Fig Merge algorithms 10 D Gao et al.: Join operations in temporal databases explicitSortMerge(r, s, C): r ← sort(r, C); s ← sort(s, C); initState(r , outer state); initState(s , inner state); x [C] ←⊥; result ← ∅; advance(s , inner state); y ← currentTuple(inner state); for i ← to |r | advance(r , outer state); x ← currentTuple(outer state); if x[C] = x [C] backUp(s , inner state); y ← currentTuple(s , inner state); x [C] ← x[C]; while (x[C] > y[C]) advance(s , inner state); y ← currentTuple(inner state); markScanStart(inner state); while (x[C] = y[C]) if overlap(x[T], y[T]) = ∅) z[A] ← x[A]; z[B] ← y[B]; z[C] ← x[C]; z[T] ← overlap(x[T], y[T]); result ← result ∪ {z}; advance(s , inner state); y ← currentTuple(inner state); return result; Fig explicitSortMerge algorithm current tuple algorithm merely returns the next tuple in the scan, as indicated by the scan state Finally, the backUp and markScanStart algorithms manage the backing up of the inner relation scan The backUp algorithm reverts the current block and tuple counters to their last values These values are stored in the state at the beginning of a scan by the markScanStart algorithm We are now ready to exhibit the explicitSortMerge algorithm, shown in Fig The algorithm accepts three parameters, the input relations r and s and the join attributes C We assume that the schemas of r and s are as given in Sect Tuples from the outer relation are scanned in order For each outer tuple, if the tuple matches the previous outer tuple, the scan of the inner relation is backed up to the first matching inner tuple The starting location of the scan is recorded in case backing up is needed by the next outer tuple, and the scan proceeds forward as normal The complexity of the algorithm, as well as its performance degradation as compared with conventional sort-merge, is due largely to the bookkeeping required to back up the inner relation scan We consider this performance hit in more detail in Sect 4.2.2 Segev and Gunadhi developed three algorithms based on explicit sorting, differing primarily by the code in the inner loop and by whether backup is necessary Two of the algorithms, TEJ-1 and TEJ-2, support the temporal equijoin [46]; the remaining algorithm, EJ-1, evaluates the temporal outerjoin [46] TEJ-1 is applicable if the equijoin condition is on the surrogate attributes of the input relations The surrogate attributes are essentially key attributes of a corresponding snapshot schema TEJ-1 assumes that the input relations are sorted primarily by their surrogate attributes and secondarily by their starting timestamps The surrogate matching, sort-ordering, and 1TNF assumption described in Sect 3.3.1 allows the result to be produced with a single scan of both input relations, with no backup The second equijoin algorithm, TEJ-2, is applicable when the equijoin condition involves any explicit attributes, surrogate or not TEJ-2 assumes that the input relations are sorted primarily by their explicit join attribute(s) and secondarily by their starting timestamps Note that since the join attribute can be a nonsurrogate attribute, tuples sharing the same join attribute value may overlap in valid time Consequently, TEJ-2 requires the scan of the inner relation to be backed up in order to find all tuples with matching explicit attributes For the EVENT JOIN, Segev and Gunadhi described the sort-merge-based algorithm EJ-1 EJ-1 assumes that the input relations are sorted primarily by their surrogate attributes and secondarily by their starting timestamps Like TEJ-1, the result is produced by a single scan of both input relations 3.2.3 Partition-based algorithms As in sort-merge-based algorithms, partition-based algorithms have two distinct phases In the first phase, the input relations are partitioned based on their join attribute values The partitioning is performed so that a given bucket produced from one input relation contains tuples that can only match with tuples contained in the corresponding bucket of the other input relation Each produced bucket is also intended to fill the allotted main memory Typically, a hash function is used as the partitioning agent Both relations are filtered through the same hash function, producing two parallel sets of buckets In the second phase, the join is computed by comparing tuples in corresponding buckets of the input relations Partition-based algorithms have been shown to have superior performance when the relative sizes of the input relations differ [18] A partitioning algorithm for the temporal natural join is shown in Fig The algorithm accepts as input two relations r and s and the names of the explicit join attributes C We assume that the schemas of r and s are as given in Sect As can be seen, the explicit partition-based join algorithm is conceptually very simple One relation is designated the outer relation, the other the inner relation After partitioning, each bucket of the outer relation is read in turn For a given “outer bucket,” each page of the corresponding “inner bucket” is read, and tuples in the buffers are joined The partitioning step in Fig is performed by the partition algorithm This algorithm takes as its first argument an input relation The resulting n partitions are returned in the remaining parameters Algorithm partition assumes that a hash function hash is available that accepts the join attribute values x[C] as input and returns an integer, the index of the target bucket, as its result D Gao et al.: Join operations in temporal databases explicitPartitionJoin(r, s, C): result ← ∅; partition(r, r1 , , rn ); partition(s, s1 , , sn ); for i ← to n outer bucket ← read partition(ri ); for each page p ∈ si p ← read page(si ); for each tuple x ∈ outer bucket for each tuple y ∈ p if (x[C] = y[C] and overlap(x[T], y[T]) = ∅) z[A] ← x[A]; z[B] ← y[B]; z[C] ← x[C]; z[T] ← overlap(x[T], y[T]); result ← result ∪ {z}; return result; partition(r, r1 , , rn ): for i ← to p ri ← ∅; for each block b ∈ r read block(b); for each tuple x ∈ b i ← hash(x[C]); ri ← ri ∪ {x}; 11 surrogate attribute and that the input relations are in Temporal First Normal Form (1TNF) Essentially, 1TNF ensures that tuples within a single relation that have the same surrogate value may not overlap in time EJ-2 simultaneously produces the natural join and left outerjoin in an initial phase and then computes the right outerjoin in a subsequent phase For the first phase, the inner relation is scanned once from front to back for each outer relation tuple For a given outer relation tuple, the scan of the inner relation is terminated when the inner relation is exhausted or the outer tuple’s timestamp has been completely overlapped by matching inner tuples The outer tuple’s natural join is produced as the scan progresses The outer tuple’s left outerjoin is produced by tracking the subintervals of the outer tuple’s timestamp that are not overlapped by any inner tuples An output tuple is produced for each subinterval remaining at the end of the scan Note that the main memory buffer space must be allocated to contain the nonoverlapped subintervals of the outer tuple In the second phase, the roles of the inner and outer relations are reversed Now, since the natural join was produced during the first phase, only the right outerjoin needs to be computed The right outerjoin tuples are produced in the same manner as above, with one small optimization If it is known that a tuple of the (current) outer relation did not join with any tuples during the first phase, then no scanning of the inner relation is required and the corresponding outerjoin tuple is produced immediately Incidentally, Zurek proposed several algorithms for evaluating temporal Cartesian product on multiprocessors based on nested loops [57] Fig Algorithms explicitPartitionJoin and partition 3.3.2 Sort-merge-based timestamp algorithms 3.3 Timestamp algorithms In contrast to the algorithms of the previous section, timestamp algorithms perform their primary matching on the timestamps associated with tuples In this section, we enumerate, to the best of our knowledge, all existing timestamp-based evaluation algorithms for the temporal join operators described in Sect Many of these algorithms assume sort ordering of the input by either their starting or ending timestamps While such assumptions are valid for many applications, they are not valid in the general case, as valid-time semantics allows correction and deletion of previously stored data (Of course, in such cases one could resort within the join.) As before, all of the algorithms described here are derived from nested loop, sort-merge, or partitioning; we not consider index-based temporal joins 3.3.1 Nested-loop-based timestamp algorithms One timestamp nested-loop-based algorithm has been proposed for temporal join evaluation Like the EJ-1 algorithm described in the previous section, Segev and Gunadhi developed their algorithm, EJ-2, for the EVENT JOIN [47,20] (Table 1) EJ-2 does not assume any ordering of the input relations It does assume that the explicit join attribute is a distinguished To date, four sets of researchers – Segev and Gunadhi, Leung and Muntz, Pfoser and Jensen, and Rana and Fotouhi – have developed timestamp sort-merge algorithms Additionally, a one-dimensional spatial join algorithm proposed by Arge et al can be used to implement a temporal Cartesian product Segev and Gunadhi modified the traditional merge-join algorithm to support the T-join and the temporal equijoin [47, 20] We describe the algorithms for each of these operators in turn For the T-join, the relations are sorted in ascending order of starting timestamp The result is produced by a single scan of the input relations For the temporal equijoin, two timestamp sorting algorithms, named TEJ-3 and TEJ-4, are presented Both TEJ-3 and TEJ-4 assume that their input relations are sorted by starting timestamp only TEJ-4 is applicable only if the equijoin condition is on the surrogate attribute In addition to assuming that the input relations are sorted by their starting timestamps, TEJ-4 assumes that all tuples with the same surrogate value are linked, thereby allowing all tuples with the same surrogate to be retrieved when the first is found The result is performed with a linear scan of both relations, with random access needed to traverse surrogate chains Like TEJ-2, TEJ-3 is applicable for temporal equijoins on both the surrogate and explicit attribute values TEJ-3 assumes that the input relations are sorted in ascending order of Y Tzitzikas et al.: Mediators over taxonomy-based information sources can be evaluated as follows: Il− (a1 ) ql1 (a1 ) ql2 (a1 ) = I1 ( ql1 (a1 ) ) ∪ I2 ( ql2 (a1 ) ), where = b2 ∨ (b3 ∨ b4 ) = c3 , while the answer Iu− can be evaluated as follows: Iu− (a1 ) = I1 ( qu1 (a1 ) ) ∪ I2 ( qu2 (a1 ) ), where qu1 (a1 ) = b1 ∨ b2 qu2 (a1 ) = c1 ∨ (c2 ∧ c3 ) If the mediator knows that a source Si is compatible, then the mediator can set qli (t) = max {taili (s) | s t} Note that if the entire articulation is stored (including the transitive links), then the computation of til can be done in O(|ai |) time The same holds for tiu Thus the computation of qli (t) can be done in O(|T | ∗ |ai |) time The same holds for qui (t) This means that the computation of all qli (t), or qui (t), for i = k can be done in O(|T | ∗ |a|), where a denotes the union of all articulations, i.e., a = a1 ∪ ∪ak Now, the set operations over the answers returned by the sources that are needed for computing Il− (t) can be performed in O(k ∗ U ) time Thus the total computation needed by the mediator can be done in O(|T | ∗ |a| + k ∗ U ) time • Disjunctive queries If the query is a disjunction of terms, i.e., q = t1 ∨ ∨ tn , then Il− (t1 ∨ ∨ tn ) = Iu− (t1 Ii (qli (t1 ) ∨ ∨ qli (tn )) i=1 k Ii (qui (t1 ) ∨ ∨ qui (tn )) ∨ ∨ tn ) = i=1 k Again, the mediator can evaluate the query by sending at most one query to each source If, furthermore, a source Si is compatible, then the mediator can send to Si the query   max  (∪{taili (s) | s tj }) 123 Thus for evaluating the query, the mediator has to send at most one query to each source for each term that appears in the conjunction This means that the mediator will send at most k ∗ n queries The computation of all qli (tj ) for j = n can be done in O(|T | ∗ |a| ∗ n) time The set operations for computing Il− (t) can be performed in O(k ∗ n ∗ U ) time Thus the total computation needed by the mediator can be done in O(|T | ∗ |a| ∗ n + k ∗ U ∗ n) time • Conjunctive normal form queries (CNF Queries) A CNF query is a conjunction of maxterms where each maxterm is either a single term or a disjunction of distinct terms ([29]), i.e., q = d1 ∧ ∧ dm where dj = tj1 ∨ ∨ tjnj , j = m, nj ≤ |T | In this case, Il− (q) = Iu− (q) Il− (t1 ∧ ∧ tn ) = Ii (qli (tj )) j=1 n i=1 k Iu− (t1 ∧ ∧ tn ) = Ii (qui (tj )) j=1 n i=1 k j=1 m i=1 k = Ii ( qui (tj1 ) ∨ ∨ qui (tjnj ) ) ( j=1 m i=1 k The mediator first evaluates each maxterm (disjunction) by sending at most one query to each source and then takes the intersection of the returned results This means that the mediator will send at most k ∗ m queries, where m is the number of maxterms Let l be the length of the query, that is, the number of term appearances in the query, i.e., l = j=1 m nj The computation of qli (t), i = k for all t that appear in q can be done in O(|T | ∗ |a| ∗ l) time The set operations for computing Il− (t) can be performed in O(k ∗ m ∗ U ) time Thus the total computation needed by the mediator can be done in O(|T | ∗ |a| ∗ l + k ∗ m ∗ U ) time • Disjunctive normal form queries (DNF Queries) A DNF query is a disjunction of minterms where a minterm is either a single term or a conjunction of distinct terms, i.e., q = c1 ∨ ∨ cm , where cj = tj1 ∧ ∧ tjnj , j = m, nj ≤ |T | In this case,   Il− (q) = j=1 n Clearly, the computation of each qli (t1 ) ∨ ∨ qli (tn ) can be done in O(|T | ∗ |ai | ∗ n) time Thus the computation of all qli (t1 ) ∨ ∨ qli (tn ) for i = k can be done in O(|T | ∗ |a| ∗ n) time The set operations for computing Il− (t) can be performed in O(k ∗ U ) time Thus the total computation needed by the mediator can be done in O(|T | ∗ |a| ∗ n + k ∗ U ) time • Conjunctive queries If the query is a conjunction of terms, i.e., q = t1 ∧ ∧ tn , then Ii ( qli (tj1 ) ∨ ∨ qli (tjnj ) ) (  j=1 m   Iu− (q) = j=1 m ( h=1 nj i=1 k ( Ii (qli (tjh )))  Ii (qui (tjh ))). h=1 nj i=1 k Thus M will send at most k ∗ l queries, where l is the length of the query The computation of all qli (t) for i = k for all t that appear in q can be done in O(|T | ∗ |a| ∗ l) time The set operations for computing Il− (t) can be performed in O(k ∗ l ∗ U ) time Thus the total computation needed by the mediator can be done in O(|T | ∗ |a| ∗ l + k ∗ l ∗ U ) time Table summarizes the number of calls complexity and Table the time complexity Note that any query that contains the logical connectives ∧ and ∨ can be converted to DNF or CNF by using one of the existing algorithms (e.g., see [29]) In our case, CNF is preferred to DNF since the evaluation of a query in CNF requires sending a smaller number of queries to 124 Y Tzitzikas et al.: Mediators over taxonomy-based information sources Table Number of calls complexity of query evaluation at mediator (for sure model) assuming k sources Query form t t1 ∨ ∨ tn t1 ∧ ∧ tn d1 ∧ ∧ dm where dj = tj1 ∨ ∨ tjnj c1 ∨ ∨ cm where cj = tj1 ∧ ∧ tjnj Single term Disjunction Conjunction CNF DNF Max no of calls k k k∗n k∗m k∗ j=1 m {(1, {Cameras, Underwater}), (2, {Cameras, Miniature})} nj Table Time complexity of query evaluation at mediator (for sure model) Query form t s t1 ∨ ∨ tn t1 ∧ ∧ tn d1 ∧ ∧ dm where dj = tj1 ∨ ∨ tjnj c1 ∨ ∨ cm where cj = tj1 ∧ ∧ tjnj Time complexity (wrt |T |, |a|, k, U ) O(|T | ∗ |a| + k ∗ U ) O(|T | ∗ |a| ∗ n + k ∗ U ) O(|T | ∗ |a| ∗ n + k ∗ U ∗ n) O(|T | ∗ |a| ∗ l + k ∗ m ∗ U ) O(|T | ∗ |a| ∗ l + k ∗ l ∗ U ) the sources For this reason, the mediator first converts the user query in CNF and then evaluates the CNF query by sending queries to the sources We conclude this section by describing the evaluation of queries in the possible models of the mediator, i.e., in the models Il+ and Iu+ The evaluation of a single-term query in Il+ or Iu+ is done by evaluating a conjunction of terms in Il− or Iu− , respectively: I + (t) = {I − (u) | u ∈ head(t) and u ∼ t} = I −( assume that S receives the query q = Cameras and is asked to return both the sure and the possible answer to that query Clearly, in both cases S will return the set {1, 2} However, instead of just returning the set {1, 2}, the source could return the following set: {u | u ∈ head(t) and u ∼ t}), where I + (t) stands for Il+ (t) or Iu+ (t), and I − stands for Il− or Iu− , respectively Therefore, the complexity analysis of evaluating I + (t) can be done using Tables and Finally, the evaluation of a disjunction in I + is done by evaluating a DNF query in I − , and the evaluation of a conjunction in I + is done by evaluating a conjunction in I − In this set, each object is accompanied by the set of all terms under which the object is indexed This information could provide valuable help to the user Indeed, the user of our example may have actually been looking for miniature cameras, but he only used the term Cameras in his query for one of several reasons For example, • The user may have forgotten to use the term Miniature; • Or the user did not know that the term Miniature was included in the terminology of the source; • Or the user did not know that the objects of the source were indexed in such specificity We believe that including in the answer all terms under which each object returned is indexed might aid the user in selecting the objects that are most relevant to his information need In addition, such terms could aid the user in getting better acquainted with the taxonomy of the source Indeed, more often than not users are not familiar with the source taxonomy and know little about its specificity and coverage (see [50]) As a result, user queries are often imprecise and not reflect real user needs We believe that familiarity with the source taxonomy is essential for a precise formulation of user queries Therefore, we extend the notion of answer to be a set of objects each accompanied by its index, i.e., by the set of all terms under which the object is indexed Definition 13 The index of an object o with respect to an interpretation I, denoted by DI (o), is the set of all terms that contain o in their interpretation, i.e., DI (o) = {t ∈ T | o ∈ I(t)} For brevity, henceforth we shall sometimes write D(o) instead of DI (o), D− (o) instead of DI − (o), and D+ (o) instead of DI + (o) when the interpretation I is clear from the context Clearly, the index of an object depends on the interpretation I, so the same object can have different indexes under different interpretations Here are some examples of indexes in the source shown in Fig 6: D(1) = {StillCams} D− (1) = {StillCams, Cameras} D+ (1) = {StillCams, Cameras, Reflex, Miniature, Enhancing the quality of answers with object descriptions We have just seen how to compute several kinds of answers at the mediator We shall now see how to improve the “quality” of the answers by providing additional information on the objects returned First, let us see an example in the context of a single source Consider a source S that contains an object indexed under two terms, Cameras and Underwater, and an object also indexed under two terms, Cameras and Miniature Next, MovingPictureCams, UnderwaterDevices} D(2) = {Cameras} D− (2) = {Cameras} D+ (2) = {Cameras, StillCams, MovingPictureCams, UnderwaterDevices} We have seen that the user of a source can submit a query and ask for a sure or a possible answer Following our discussion on indexes, the user can now also ask for the sure or possible index for each object in the answer Y Tzitzikas et al.: Mediators over taxonomy-based information sources 125 S1 Table Answers to a query Object set Sure Sure Possible Possible Object index Sure Possible Sure Possible Animal Answer returned { (o, D− (o)) | o ∈ I − (q)} { (o, D+ (o)) | o ∈ I − (q)} { (o, D− (o)) | o ∈ I + (q)} { (o, D+ (o)) | o ∈ I + (q)} This means that the answer returned by the source to a given query q can have one of the forms shown in Table It is up to the user to specify the desired form of the answer Note that if I − is stored at the source, then the evaluation of D− for an object o is straightforward If, however, only the interpretation I is stored at the source, then we can compute D− (o) as follows: Proposition D− (o) = {head(t) | o ∈ I(t)}, alently, D− (o) = {head(t) | t ∈ D(o)} or equiv- If we have computed D− (o), then we can compute D+ (o) as follows: Proposition D+ (o) = {t | head(t) \ {t |t ∼ t} ⊆ D− (o)} By analogy to the single-source case, a mediator can return answers consisting of objects that are accompanied by their indexes In other words, a mediator can return a set of pairs (o, DI (o)), where I is the model used by the mediator for answering queries For example, consider two sources, S1 and S2 , providing information about animals (e.g., photos), as shown in Fig 15 The terms of source S1 are in English, while the terms of source S2 are in French Moreover, a mediator M integrates the information of the two sources and provides a unified access through a taxonomy with English terms Assume now that the mediator receives the query q = Animal, in which case the mediator sends the query q1 = Animal ∨ Dog to source S1 and the query q2 = Mammif` ere ∨ Chat to source S2 Moreover, assume that the sources S1 and S2 return objects accompanied by their sure indexes Then, the source S1 will return the answer { (1, {Dog, Animal}), (2, {Canis, Dog, Animal}) } and the source S2 the answer { (1, {Mammif` ere}), (3, {Chat, Mammif` ere}) } Next, assume that the mediator operates under operation − mode (Table 1), that is, the mediator uses the model Il− for answering queries Moreover, assume that the mediator − returns objects accompanied by their sure indexes (in Il− ) In this case, the mediator will return the following answer: { (1, {Dog, Mammal, Animal}), (2, {Dog, Mammal, Animal}), (3, {Mammal, Animal) } Let I denote any of the four interpretations Il− , Il+ , Iu− and Iu+ of the mediator, and assume that we want to compute D− (o), i.e the sure index of some object o, at the mediator Since the interpretation I is not stored at the mediator, we cannot compute D− (o) like we for a source (Proposition above) Instead, we must exploit the articulations and the indexes Di (o) returned by the sources Specifically, the mediator can compute Dl (o) (i.e., the index of o with respect to Il ) a1 M a2 S2 Animal Mammal Dog Mammifere Dog Chat Canis Fig 15 A mediator over two sources and Du (o) (i.e., the index of o with respect to Iu ) as stated by the following proposition Note that again we write Il instead of Il− or Il+ , and Iu instead of Iu− or Iu+ , since the computation of the object indexes at the mediator does not depend on the evaluation of queries at the underlying sources Proposition Dli (o), where Dl (o) = i=1 k Dli (o) = {t ∈ T | ti ∈ Di (o) and ti t} Dui (o), where Du (o) = i=1 k Dui (o) = {t ∈ T | (headi (t) = ∅ and headi (t) ⊆ Di (o)) or (headi (t) = ∅ and ti ∈ Di (o) and ti t).} Now, Dl− (o) and Du− (o) can be computed by applying Proposition to Dl (o) and Du (o), respectively Similarly, Dl+ (o) and Du+ (o) can be computed by applying Proposition to Dl− (o) and Du− (o), respectively Extending our model In this section we discuss various extensions of our model Specifically, in Sect 6.1 we extend the form of our articulations, in Sect 6.2 we discuss mediators that also have a stored interpretation of their terminology, in Sect 6.3 we describe how our mediators can be combined with information retrieval systems, and in Sect 6.4 we discuss how our approach can lead to a network of articulated sources 6.1 Extending the form of articulations According to Sect 3, an articulation consists of subsumption relationships between terms only However, we can extend the definition of an articulation to include subsumption relationships between terms and queries as well This extension is useful because now the designer of the mediator can define articulations containing more complex relationships, as in the following examples: • ElectronicsM (TVi ∨ Mobilesi ∨ Radiosi ), • DBArticlesM ∼ (Databasesi ∧ Articlesi ) 126 Y Tzitzikas et al.: Mediators over taxonomy-based information sources In the first example, the users of the mediator can use the term Electronics instead of a long disjunction of terms at source Si (benefit: brevity), while in the second they can use the term DBArticles instead of the conjunction of two terms at source Si (note, however, that this is useful only if Si supports multiple classification) Definition 14 Let (T, ) be the taxonomy of a mediator and (Ti , i ) the taxonomy of source Si An articulation is a subsumption relation over T ∪ QTi , where QTi is the set of all queries over Ti Let us now discuss the consequences of this extension with regard to the functionality of the mediators For each term t ∈ T the tail and head of t with respect to can be defined as follows: Definition 15 Given a term t ∈ T and articulation , we define taili (t) = {s ∈ QTi | sai t} and headi (t) = {u ∈ QTi | tai u} Note that now the tail and head of a term are not sets of terms of Ti , but sets of queries over Ti The lower and upper approximations of t with respect to are defined as in Sect The four interpretations and the eight answer models of the mediator are defined in the same way, too In this framework, the concept of compatibility is now redefined as follows: taxonomy Mediator articulations a1 object db a2 taxonomy taxonomy object db object db S1 S2 Sources Fig 16 Architecture of a mediator with a stored interpretation (a) He can ask for an answer derived from IM , (b) He can ask for an answer derived from the interpretations of the remote sources, or (c) He can ask for an answer derived from both IM and the interpretations of the remote sources In the first case, the mediator operates as a source (Sect 2), in the second case it operates like the mediators described earlier, while in the third case it again operates like the mediators described earlier but with one difference: the interpretations Il− , Il+ , Iu− , Iu+ are now defined by taking the union of IM and the interpretations of the sources For instance, the interpretation Il− is now defined as k Definition 16 A source Si is compatible with the mediator M if for any queries s, t in QTi , if sai t, then s i t Il− (t) = IM (t) ∪ As mentioned earlier, maintaining compatibility is not an easy task The mediator should (periodically) check the compatibility of its sources, e.g., by submitting to them queries that allow one to check whether t i t However, now t and t are queries; thus the sources should support subsumption checking over queries In general, an articulation may contain relationships between terms and arbitrary queries For example, consider a source Si implemented in the relational model (as described in Sect 2) and suppose that this source can answer only pure SQL queries In this case, the articulation may contain relationships of the form Cameras qi , where In other words, in the third case, the mediator operates as usual, except that now, in addition to the k external sources S1 , , Sk , we have the mediator’s own source SM = (T, ), IM acting as a (k + 1)-th source In the case where the mediator also stores an interpretation IM of T , the mediator’s ability to “translate” the descriptions of the objects returned by the underlying sources leads to an interesting scenario for the Web Consider a user who has submitted a query to the mediator and assume that the mediator has returned a set of objects to the user If some of these objects are of real interest to the user (e.g., a set of beautiful images, good papers, etc.), he user can store these objects in the database of the mediator These objects will be stored under terms of the mediator’s taxonomy, i.e., in the interpretation IM of T For example, consider the mediator shown in Fig 15 The mediator can store objects and under the terms Dog, Mammal, and Animal and object under the terms Mammal and Animal However, one can easily see that it suffices to store objects and under the term Dog and object under the term Mammal More formally, to store an object o in IM , the mediator associates this object with the following terms of T : qi = Πobject (σterm−name=”Cameras” (IN T ERP RET AT ION T ERM IN OLOGY )) 6.2 Mediators with stored interpretations We can easily extend a mediator so as to also store an interpretation of its terminology T Figure 16 shows graphically the architecture of a mediator of this kind Such an extension can prove quite useful in the context of the Web: a Web user can define his own mediator consisting of a taxonomy that is familiar to him, a set of articulations to other Web catalogs, and a stored interpretation of the mediator’s taxonomy Note that the taxonomy of the mediator and its stored interpretation resemble the bookmark facility of Web browsers However, the addition of articulations now allows the user to browse and query remote catalogs Let IM denote the stored interpretation of T When a user sends a query to the mediator, he has three choices: i=1 M Dl (o) or Ii− (til ) M Du (o) 6.3 Mediators over hybrid sources Let us use the term free retrieval source to refer to a source that indexes objects of interest using an uncontrolled vocabulary Y Tzitzikas et al.: Mediators over taxonomy-based information sources 127 Mediator User / Application a1 t a2 object description translation query translation Mediator answer term query NL query nq T a3 fusion result Web Portal Servers translated term t iu queries wrapper Catalog taxonomy I1− + I1 query object db nq nq Source IRS Source Catalog t 1u IRS wrapper Catalog Catalog taxonomy I2− + I2 Query answer User / Application Evaluator object db t 2u wrapper Catalog taxonomy I3− + I3 object db Fig 18 An architecture for implementing our mediators over the catalogs of the Web Fig 17 Building mediators over hybrid sources In this case, the objects of the domain have textual content and the vocabulary used for indexing them consists of those words that appear in the objects These sources usually accept natural language queries and return a set of objects ordered according to their relevance to the query Text retrieval systems (the typical case of “information retrieval systems”), as well as the search engines of the Web, fall into this category We can now use the term hybrid source to refer to a source that is both taxonomy-based and a free retrieval source A hybrid source accepts two kinds of queries: queries over a controlled vocabulary and natural language queries A source whose functionality is moving in this direction is Google Using Google, one can first select a category, e.g., Sciences/CS/DataStructures, from the taxonomy of Open Directory and then submit a natural language query, e.g., “Tree” The search engine will compute the degree of relevance with respect to the natural language query, “Tree”, only of those pages that fall in the category Sciences/CS/DataStructures in the catalog of Open Directory Clearly, this enhances the precision of the retrieval and is computationally more economical Our approach can be used for building mediators over hybrid sources whose functionality extends the functionality offered by the existing metasearchers of the Web (e.g., MetaCrawler [63], SavvySearch [38], Profusion [27]) The user of a hybrid mediator can use the taxonomy of the mediator to browse or query those parts of the sources that are of interest to him Moreover, he is able to query the databases of these sources using natural language queries This implies that the mediator will send two kinds of queries to the sources: (a) queries that are evaluated based on indexing of the objects with respect to the taxonomy of the source and (b) queries that are evaluated based on the contents of the objects (pages) Figure 17 describes this architecture graphically The functionality of the mediators described in this paper presumes that each source can provide a sure answer and a possible answer However, the taxonomy-based sources that can be found on the Web, e.g.,Yahoo! or ODP, not currently provide such answers This means that the functionality of our mediators cannot be implemented straightforwardly Nevertheless, we can implement the functionality of our mediators over such sources by employing appropriate wrappers First note that the taxonomy and the interpretation of a Web catalog are published as a set of Web pages For each term t of the taxonomy there is a separate Web page This page contains the name of the term and links pointing to pages that correspond to the terms subsumed by t In addition, the page contains links pointing to the objects, here Web pages, that have been indexed under the term t However, we can employ a wrapper to parse each such page and extract the name of the term, the subsumed terms, and the indexed objects Now, the architecture for implementing our mediators over Web catalogs is shown in Fig 18 The key point is that the interpretation of a term t of a source Si in the sure model Ii− and in the possible model Ii+ can be computed on the mediator side This can be achieved by building an appropriate wrapper for that source In particular, to compute Ii+ (t), the wrapper will fetch the pages of all terms t such that t i t , and then it will derive Ii+ (t) by computing the intersection ∩{Ii− (t )|t t } According to this architecture, our mediators can be implemented by using the standard HTTP protocol A prototype version of our mediators over the Web has already been implemented at Université de Paris-Sud 6.4 Networks of articulated sources One can easily see how our approach can be used for creating a complex information network, comprising sources and mediators, in a natural and straightforward manner Indeed, • To add a mediator to such a network one has to (a) design the mediator taxonomy (T, ) based on the domain of interest, (b) select the sources to be mediated, and (c) design the articulations based on the known/observed relations between terms of the mediator and terms of the selected sources; • To remove a mediator from the network, one just has to disconnect the mediator from the network; 128 Y Tzitzikas et al.: Mediators over taxonomy-based information sources M1 query M2 answer Mediator Mediator Source Selection global/mediator view Query Translation Result Fusion M3 description of Source S1 S2 description of Source k Wrapper Wrapper K Source Source k exchange data model Source S4 S3 Fig 19 A network consisting of primary and secondary taxonomybased sources a1,2 a2,1 S1 S2 Fig 20 Mutually articulated sources • Moreover, to add a source to the information network, all one has to is (a) select one or more mediators in the network and (b) design an articulation between the source and each mediator; • Finally, to remove a source from the network, one simply has to remove the corresponding articulation from the mediator(s) to which the source is connected and disconnect the source Note that as each mediator has one articulation for each underlying source, the deletion of an articulation does not affect the rest of the articulations A significant consequence of this approach is that network evolution can be incremental Indeed, new relationships between terms of the mediator and terms of the sources can be added with minimum effort as soon as they are observed, and relationships that are seen to be no more valid can be removed just as easily: simply add/remove the relationships at the appropriate articulation in the mediator database storing the articulation For example, Fig 19 shows a network consisting of four primary sources S1 , S2 , S3 , S4 and three mediators M1 , M2 , and M3 A line segment connecting a mediator M to a source or to a mediator S means that M is a mediator over S, and circles denote articulations For example, M2 is a mediator over the sources S1 and S2 and the mediator M3 Note that the mediator M can also be considered as a primary source because it has a stored interpretation Also note that our approach allows mutually articulated sources, as shown in Fig 20 In this case, we can no longer divide sources into primary and secondary Query evaluation and updating in a network of articulated sources raises several interesting questions For example, a query to a source may trigger an infinite number of calls between the sources if the network is cyclic (e.g., Fig 20) This and other related problems go beyond the scope of this paper and are treated in [69,72] Source k Fig 21 Architecture and functional overview of the mediator Related work The need for integrated and unified access to multiple information sources has stimulated research on mediators The concept of mediator was initially proposed by Wiederhold [78] Since then it has been applied and specialized for several kinds of source and application needs Nevertheless, in every instance we can identify a number of basic architectural components Specifically, in most cases the mediator architecture consists of a mediator’s view (usually in the form of a conceptual model), source descriptions, and wrappers that describe the contents and/or the querying capabilities of each source with respect to the mediator’s view, and an exchange data model, which is used to convey information between the mediator and the sources The mediator accepts queries expressed in the mediator view Upon receiving a user query, the mediator selects the sources to be queried and formulates the query to be sent to each of them These tasks are accomplished based on the source descriptions that encode what the mediator “knows” about the underlying sources Finally, the mediator appropriately combines the returned results and delivers the final answer to the user Figure 21 shows the general architecture and the functional overview of the mediator In this section we compare our approach with other existing mediator approaches Our objective is to identify the basic differences and analogies and identify issues that are worthy of further research, rather than presenting a complete survey of this very broad area Figure 22 shows a rough taxonomy of the mediator approaches according to two criteria: (a) the kind of underlying sources and (b) the kind of mediator view According to the first, we can divide sources into two broad categories: information retrieval systems and information bases The former provide content-based access to a set of (text) documents, while the latter store structured data We will use this dichotomy in our subsequent discussion 7.1 Mediators over information retrieval systems Information retrieval systems (IRS) provide content-based access to a set of (text) documents The content of the documents (as well as the user queries) is described using an “indexing language” that can be either (a) a “free” vocabulary consisting of the words that appear in the documents of the collection, excluding those words that carry no information (such as articles) and reducing words to their grammatical root (a task called “stemming”), or Y Tzitzikas et al.: Mediators over taxonomy-based information sources Underlying Sources Information Retrieval Systems Controlled Vocabulary Information Bases Free Vocabulary Relational Web−based Semistructured Taxonomy−based Boolean Model Probabilistic Model Logic−based Vector Space Model Mediator’s View Unstructured Relational Semistructured Query Templates Taxonomy Logic−based Fig 22 Rough taxonomy of the mediator approaches (b) a “controlled” vocabulary that may be different from the set of words that appear in the documents This vocabulary may be structured by a small set of relations like hyponymy and synonymy For the relative merits of each of these approaches, see [21, 62] One could say that taxonomy-based sources resemble those IRSs that employ the boolean retrieval model (see [5] for a review) and exploit lexical ontologies or word-based thesauri [40] (like WordNet [57] and Roget’s thesaurus) for query expansion, i.e., for expanding queries with synonyms, hyponyms, and related terms to improve recall (e.g., see [35, 49,50,55]) However, note that the IRS techniques are applicable only if the objects of the domain have a textual content (not a prerequisite of our approach) Another remarkable difference with our sources is that the taxonomies employed by IRSs usually not accept the semantic interpretation described in this paper Lexical ontologies like WordNet [57] are structured using lexical relations (synonymy, hyponymy, antonymy) that are not semantic relations For instance, according to Wordnet, window is subsumed by opening and by panel However, every window is not a panel and an opening; thus extensional subsumption does not hold here The justification of the possible answer (in sources) and the eight answer models (in mediators) does not apply to such ontologies (for more about this problem, see [34]) Instead, techniques like spreading activation [55] are more appropriate if lexical ontologies are employed By consequence, our approach is quite different from the mediator approaches that have emerged within the IR community Specifically, a mediator over IRSs that employ free vocabularies does not have to translate the mediator queries, as each source accepts the same set of queries, i.e., natural language queries As a consequence, mediators over such sources mainly focus on issues like source selection and result fusion (metaranking) (e.g., see [6,11,20,28,33,67,76,77]) On the other hand, mediators over systems that employ controlled and structured vocabularies have not received adequate attention until now To the best of our knowledge, all of the existing approaches focus on ontology merging and not on ontology articulation Moreover, as they mainly employ lexical ontologies, the mappings between two ontologies consist of lexical 129 relationships, too (in many cases, one term is associated with a set of terms of the other ontology [3]) Although the controlled indexing languages that are used for information retrieval usually consist of a set of terms structured by a small number of relations (such as hyponymy and synonymy), there are cases where the indexing of the objects is done (especially in the case of a manual indexing process) with respect to more expressive conceptual models representing domain knowledge in a more detailed and precise manner Such conceptual models can be represented using logic-based languages, and the corresponding reasoning mechanisms can be exploited for retrieving objects There are several studies that take this conceptual modeling and reasoning approach to information retrieval (e.g., relevance terminological logics [51], four-valued logics [59]) This conceptual modeling approach is useful and effective if the domain is narrow If the domain is too wide (e.g., the set of all Web pages), then the problem is that it is hard to conceptualize the domain; actually, there are many different ways to conceptualize it, so it is hard to reach a conceptual model of wide acceptance Thus a mediator over such sources has to tackle complex structural differences (recall the example of Sect 2) For this purpose, even today ontologies that have a simple structure, like the one we consider, are usually employed for retrieving objects from large collections of objects ([61]) 7.2 Mediators over information bases We use the term information bases to refer to sources that store structured data, not documents Relational, semistructured, logic-based, and Web-based sources belong to this category Indeed, there are several approaches to building mediators over relational databases (e.g., see [30,31,47,79]), SGML documents (e.g., see [17]), and Web-based sources (e.g., see [4, 14,15]) We include this discussion here because our sources can be considered as information bases as we not presuppose that the domain objects have a textual content Concerning the kind of mediator view, several approaches have been proposed (as the rightmost taxonomy of Fig 22 illustrates) Indeed, we have seen mediators whose unified view has the form of a relational schema (e.g., Infomaster [25, 32]), a semantic network (e.g., SIMS [43]), an F-logic schema (e.g., OntoBroker [7,22]), a description logics schema (e.g., Information Manifold [47], OBSERVER [41,42,52], PICSEL [45]), a set of query templates (e.g., TSIMMIS [16,30], HERMES [66]) Furthermore, several data models have been used for conveying information between the mediator and sources (as shown in Fig 21) including relational tuples (e.g., in Infomaster, SIMS, Information Manifold, OBSERVER), tuples that encode graph data structures (like the OEM in TSIMMIS, or the YAT in [17]), and HTML pages (in Web-oriented approaches) To identify similarities, differences, and analogies between the above approaches and the one presented in this paper, we first describe a set of layers from which we can view an information base, then we use these layers to discuss the kinds of heterogeneity that may exist between two information bases, and, finally, we provide several observations 130 Y Tzitzikas et al.: Mediators over taxonomy-based information sources QL Query Language Data Model Conceptual Model Conceptualization Domain DM CM C Int id String name String postalAddress Person String name name hasPart Classroom Building uses D Student Teacher teaches attends Fig 23 Layers of a source Course code isA String attribute (a) A layered view of an information base We could view a source at five different layers: the domain, the conceptualization, the conceptual model, the data model, and the query language We are aware that these distinctions are not crystal clear or widely accepted; however, they enable us to discuss systematically a number of issues and draw analogies There are dependencies among these layers, as shown in Fig 23, e.g., the query language layer of a source depends on the data model layer of the source, and so on Each source stores information about a part of the real world that we call the domain layer of the source For example, the domain of a source can be the set of all URLs, or the set of all universities, or the set of Greek universities, or the Computer Science Department of the University of Crete (which we call CSD domain in the discussion below) The conceptualization of a domain is the intellectual lens through which the domain is viewed For example, one conceptualization of the CSD domain may describe its static aspects, i.e., what entities or things exist (e.g., persons, buildings, classrooms, computers), their attributes, and their interrelationships Another conceptualization may describe its dynamic aspects in terms of states, state transitions, and processes (e.g., enrollments, graduations, attendances, teaching) A conceptual model is used to describe a particular conceptualization of a domain in terms of a set of (widely accepted) structuring mechanisms that are appropriate for the conceptualization For example, a conceptual model that describes the static aspects of the CSD domain, using generalization and attribution, is shown in Fig 24a, while a conceptual model that describes the dynamic aspects of the CSD domain, using states and state transitions, is shown in Fig 24b The representation of a conceptual model in a computer is done according to a specific data model (e.g., relational, object-oriented, semantic network-based, semistructured) For example, the class Person of the conceptual model of Fig 24 can be represented in the relational model by a relation scheme as follows: Person(id:Int, name:Str, postalAddress:Str) Alternatively, in a different source, it could also be represented using two relation schemes: PERSON(id:Int, name:Str, addressId:Int) and POSTALADDRESS(id:Int, address:Str) However, there are also some data models that allow a straightforward representation of the conceptual model, e.g., the semantic network-based data model of SIS-Telos [19,39] Finally, each source can answer queries expressed in a particular query language For example, a source may respond to Datalog queries, while another may respond only to SQL queries In this case, we say that the query language layers of these sources are different terminal state initial state Enrollment Attendance Examination success Graduation failure (b) Fig 24 Two conceptual models of the CSD domain: one for the static and one for the dynamic aspects Kinds of heterogeneity Given a source Si , we will use Di to denote the domain, Ci the conceptualization, CMi the conceptual model, DMi the data model, and QLi the query language layer of Si Consider now two sources S1 and S2 We may have several forms of heterogeneity between these sources, specifically, there are 25 = 32 different cases (due to the five layers) For example, the case D1 = D2 , C1 = C2 , CM1 = CM2 means that S1 and S2 have the same conceptualization of the (same) domain, but they employ different conceptual models Even if the conceptual models are expressed using the same structuring mechanisms (e.g., generalization, attribution), they may differ due to: • Different naming conventions (also called naming conflicts) A frequent phenomenon is the presence of homonyms and synonyms • Different scaling schemes These occur when different reference systems are used to measure a value, e.g., ft vs 0.304 m, 23◦ C vs 73◦ F • Different levels of granularity For example, CM1 may contain only a class Cameras, while CM2 may contain the classes StillCameras and MovingPictureCameras • Structural differences For example, CM1 may contain a class Person having an attribute owns whose range is a class ArtificialObject, and a class Car defined as a specialization of ArtificalObject, while CM2 may contain a class Car having an attribute owner whose range is the class Person As another example, the case D1 = D2 , C1 = C2 , CM1 = CM2 , DM1 = DM2 means that S1 and S2 have same conceptual model, but these models are represented differently in the data model layer Note that, even if S1 and S2 employ the same data model, e.g., the relational, DM1 and DM2 may differ in that they may represent the conceptual model differently Y Tzitzikas et al.: Mediators over taxonomy-based information sources String name 1:1 Person String String name worksAt 1:1 String (a) 1:1 Person 131 name 1:1 worksAt 1:1 1:n Department (b) Fig 25 Two conceptual models of the CSD domain Remarks and analogies Let us now give some remarks and discuss some analogies between our approach and other mediator approaches that have emerged • An important remark is that, given an existing source, we usually have at our disposal only its data model and query language layer, and more often than not from these two layers we cannot infer the conceptual model or the conceptualization layer of the source For example, consider the following relation scheme: PERSON(name:Str, worksAt:Str) The underlying conceptual model could be any of those shown in Fig 25, as the translations of both a and b to the relational model (by using an algorithm such as the one described in [9]) are identical Note that, according to a, the domain consists of entities of one kind, i.e., persons, while according to b, the domain consists of two kinds of entities: persons and departments Moreover, although two sources may have the same conceptual model, e.g., conceptual model a, their representation in the data model may differ For example, the conceptual model a could be represented in the relational model by one relation scheme (as we saw before) or by the following two relational schemes: PERSON(name:Str, worksAt:Int) and DEP(depId:Int, name:Str) We believe that this is the basic reason why information integration is a difficult and laborious task • According to the layered view described above, the sources of our mediators (a) may have different domains (i.e., may index different sets of objects), (b) conceptualize their domains similarly (i.e., all Ci are denumerable sets of objects), (c) may have different conceptual models (i.e., different taxonomies), (d) may have different query languages (recall the remark at the end of Sect 6.1) • In relational mediators (see [31] for a review), the mediator view is represented as a relational database schema Relational mediators have some critical differences with our mediators Relational mediators and their sources are schema-based, while our mediators and their sources are taxonomy-based Also, recall that the relational model is value-based, not object-based This implies that the conceptualization and the conceptual model of a relational source is hidden, or unclear Therefore, mediators over such sources “work” on the data model layer Instead, we propose a totally different conceptual modeling approach for both sources and mediators Concerning source descriptions, we can distinguish the local-as-view (LAV) and the global-as-view (GAV) approaches (see [13,46] for a comparison) In the LAV approach, the source relations are defined as relational views over the mediator’s relations, while in the GAV approach • • • • the mediator relations are defined as views of the source relations The former approach offers flexibility in representing the contents of the sources, but query answering is “hard” because this requires answering queries using views [24,36,74] On the other hand, the GAV approach offers easy query answering (expansion of queries until arriving at the source relations), but the addition/deletion of a source implies updating the mediator view, i.e., the definition of the mediator relations It is worth mentioning here that as our articulations contain relationships between single terms, these kinds of mappings enjoy the benefits of both GAV and LAV approaches, i.e., they have (a) the query processing simplicity of the GAV approach, as query processing basically reduces to unfolding the query using the definitions specified in the mapping so as to translate the query in terms of accesses (i.e., queries) to the sources, and (b) the modeling scalability of the LAV approach, i.e., the addition of a new underlying source does not require changing the previous mappings On the other hand, termto-query articulations (presented in Sect 6.1) resemble the GAV approach Concerning the translation facilities, relational mediators attempt to construct exact translations of SQL queries, while our mediators allow approximate translations of boolean expressions through their articulations We might say that the answers returned by a relational mediator correspond to the answers returned by a taxonomy-based me− diator in the Il− model Moreover, in several approaches (e.g., in Infomaster) a predicate corresponding to a source relation can appear only in the head or in the tail of a rule This means that granularity heterogeneities cannot be tackled easily A different approach to mediators can be found in [12], which presents the fundamental features of a declarative approach to information integration based on description logics The authors describe a methodology for integrating relational sources, and they resort to very expressive logics to bridge the heterogeneities between the unified view of the mediator and the source views However, the reasoning services for supporting translations have exponential complexity, as opposed to the complexity of our mediators, which is polynomial In addition, the eight possible answers of our approach allow one to provide a novel query relaxation facility One difference between our approach and the system OBSERVER [41,42,52] is that OBSERVER requires merging the ontologies of all underlying sources Instead, we just articulate the taxonomies of the sources with the taxonomy of the mediator Moreover, the compatibility condition introduced here allows the mediator to draw conclusions about the structure of a source taxonomy without having to store that taxonomy In the approximate query mapping approach of [14,15], the translated queries minimally subsume the original ones However, the functionality offered by our mediators is different, first because we support negation while they not, and second because our mediators support multiple operation modes, one of which is the case where the translated queries subsume the original ones An alternative solution to the problem of query relaxation in mediators is query repairing described in [8] If the sub- 132 Y Tzitzikas et al.: Mediators over taxonomy-based information sources mitted query yields no answer, then the mediator provides to the user an answer to a “similar” query The selection of this query is based on a measure of similarity between the concepts and the predicates, which is based on the taxonomic structure of the mediator’s ontology In our opinion, the eight answer models of our mediators offer a better founded approach to query relaxation Concluding remarks We have presented an approach to providing uniform access to multiple taxonomy-based sources through mediators that render the heterogeneities (naming, contextual, granularity) of the sources transparent to users This paper integrates and extends the work presented in [70] and [73] and was inspired by the approach presented in [65] A user of the mediator, apart from being able to pose queries in terms of a taxonomy that was not used to index the objects of the sources being searched, gets an answer comprised of objects that are accompanied by descriptions over the mediator’s taxonomy A mediator is seen as just another source, but without a stored interpretation An interpretation for the mediator is defined based on the interpretations stored at the sources and on the articulations between the mediator and the sources; and in fact, we have seen eight different ways of defining a mediator interpretation depending on the nature of the answers that the mediator provides to its users (Table 1) Since the resulting mediator models are ordered, they can be used to support a form of query relaxation Articulations can be defined by humans, but they can also be constructed automatically or semiautomatically in some specific cases, following a model-driven approach (e.g., [2, 53,64]) or a data-driven approach (e.g., [3,23,37,44,60,68]) The distinctive features of our approach are the following: • We assume that all sources have the same domain and the same conceptualization of that domain The intended domain is the Web, and each source views the Web as a set of objects Obj (URLs) and stores information about a subset of it (i.e., Oi ⊆ Obj) This means that each object has a unique identity over all sources From this point of view, our mediators may be called object-oriented, as opposed to mediators over relational sources, which may be called value-oriented • We consider that the conceptual layer of each source is a triple (T, , I) This conceptual modeling approach has two main advantages: (a) It is easy to create the conceptual model of a source or a mediator, and (b) the integration of information from multiple sources can be done easily Indeed, articulations offer a uniform method of bridging naming, contextual, and granularity heterogeneities between the conceptual models of the sources Given this conceptual modeling approach, the mediator does not have to tackle complex structural differences between the sources (as happens with relational mediators) Moreover, it allows the integration of schema and data in a uniform manner For example, consider a source S having the conceptual model shown in Fig 3a and a source S having the conceptual model shown in Fig 3b, and suppose that both sources are implemented in the relational model In source S, the concept wood would be represented at the data level (it would be an element of the domain of an attribute), while in S , it would be a relation Furthermore, this approach makes the automatic construction of articulations feasible [68] Summarizing, the taxonomy-based mediation approach presented here offers the following advantages: • Easy construction of mediators A mediator can be easily constructed even by ordinary Web users Indeed, the simple conceptual modeling approach that we adopt makes the definition of the mediator’s taxonomy and articulations very easy • Query relaxation Often a query to a mediator yields no answer The sure and the possible answers of sources, as well as a mediator’s several modes of operation, offer a solution to this problem • Efficient query evaluation The time complexity of query translation at the mediator is linear with respect to the size of the subsumption relations of the mediator • Scalability Articulation (instead of merging) enables a natural, incremental evolution of a network of sources The taxonomies employed by Web catalogs contain very large numbers of terms (e.g., the taxonomy of Open Directory contains 450,000 terms) Therefore, the articulation of taxonomies has several advantages compared to taxonomy merging First, merging would introduce storage and performance overheads Second, full merging is a laborious task that in many cases does not pay off because the integrated taxonomy becomes obsolete when the taxonomies involved change Another problem with full merging is that it usually requires full consistency, which may be hard to achieve in practice, while articulation can work on locally consistent parts of the taxonomies involved However, note that the taxonomies considered here present no consistency problems There may only be long cycles of subsumption relationships that induce big classes of equivalent terms • Applicability The taxonomy-based approach presented provides a flexible and formal framework for integrating data from several sources and/or for personalizing the contents of one or more sources The taxonomies considered fit quite well with the content-based organizational structure of Web catalogs (e.g., Yahoo!, Open Directory), keyword hierarchies (e.g., ACM’s thesaurus), XFML [1] taxonomies, and personal bookmarks By defining a mediator, the user can employ his own terminology to access and query several Web catalogs, specifically those parts of the catalogs that are of interest to him Moreover, as a mediator can also have a stored interpretation, our approach can lead to a network of articulated sources Recall that a mediator can translate the descriptions of the objects returned by the underlying sources This implies that all (or some) of these objects can be straightforwardly stored in the mediator base (under terms of the mediator taxonomy) An interesting line of research would be to investigate query evaluation and updating in a network of articulated sources Another interesting line would be to investigate how Y Tzitzikas et al.: Mediators over taxonomy-based information sources the mediator can exploit the object indexes that are returned by a compatible source in order to check whether that source remains compatible If we consider sources that answer queries by returning an ordered set of objects, the mediator should also return ordered sets of objects It would then be interesting to investigate whether the work presented in this paper can be integrated with the work presented in [67] and [26] 133 Proposition If all sources are compatible with the mediator, then − (1) Il− − Iu− − (2) Il+ − Iu+ + (3) Il− + Iu− + (4) Il+ + Iu+ Appendix: Proofs Proposition If I is an interpretation of T , then I − is the unique minimal model of T that is greater than or equal to I Proof (I − is a model of T ) t t ⇒ tail(t) ⊆ tail(t ) ⇒ {I(s) | s ∈ tail(t)} ⊆ {I(s) | s ∈ tail(t )} ⇒ I − (t) ⊆ I − (t ) Thus I − is a model of T (I − is the unique minimal model of T which is greater than I.) Let I be a model of T that is larger than I Below we prove that I − I By the definition of I − (t), if o ∈ I − (t), then either o ∈ I(t) or o ∈ I(s) for a term s such that s t However, if o ∈ I(t), then o ∈ I (t) too because I is larger than I, and if o ∈ I(s) for a term s such that s t, then o ∈ I (t) too because I is a model of T We conclude that for every o ∈ I − (t) it holds that o ∈ I (t) Thus I − is the unique minimal model T that is larger than I Proposition If I is an interpretation of T , then I + is a model of T and I I − I + Proof (I + is a model of T ) t t ⇒ {u | u ∈ head(t)} ⊇ {u | u ∈ head(t )} ⇒ {u | u ∈ head(t) and u ∼ t} ⊇ {u | u ∈ head(t ) and u ∼ t } ⇒ {I(u) | u ∈ head(t) and u ∼ t} ⊆ {I(u) | u ∈ head(t ) and u ∼ t } ⇒ I + (t) ⊆ I + (t ) (I − I + ) Clearly, if t ∈ T , u ∈ head(t) and u ∼ t, then in every model I of T we have I(t) ⊆ I(u) Thus this also holds in the model I − , i.e., I − (t) ⊆ I − (u) From this we conclude that for every t ∈ T : I − (t) ⊆ {I − (u) | u ∈ head(t) and u ∼ t} = I + (t) Thus I − I + Proof Let t be a term of T Clearly, for every s ∈ taili (t) and u ∈ headi (t) it holds that sai u (because sai t and tai u) Since the source Si is compatible, we know that sai u ⇒ s i u This implies that in every model Ii of Ti it holds that {Ii (s) | sai t} ⊆ Ii (til ) − Il− − Iu− + Il− + Iu− − Il− − Il+ − Iu− − Iu+ − Il+ − Iu+ + Il+ + Iu+ + Il− + Il+ + Iu− + Iu+ Proof The proofs of propositions (a)–(d) derive easily from the fact that in every model Ii of a source Si it holds that Ii− Ii+ The proofs of propositions (e)–(h) derive easily from the fact that in every model I of the mediator it holds that I − I + ⊆ Ii (tiu ) From Ii (til ) ⊆ Ii (tiu ) we infer that Ii− (til ) ⊆ Ii− (tiu ) and Ii+ (til ) ⊆ Ii+ (tiu ) From this we obtain propositions (1)–(4) − − For example, the proof of proposition (1) i.e., Il− Iu− , + + and proposition (3) i.e., Il− Iu− , is derived as follows: Since ∀ t ∈ T and ∀ i = k it holds that Ii− (til ) ⊆ Ii− (tiu ), we conclude that Ii− (til ) ⊆ Ii− (tiu ) ⇔ i=1 k Il− (t) ⊆ Iu− (t) ⇒ i=1 k − Il− + Il− − Iu− (I1 + Iu− (I5 I3 ) I7 ) Proposition If q = t ∈ T , then Il− (t) and Iu− (t) can be evaluated as follows: Il− (t) = Ii (qli (t)) where qli (t) = {sil | s Ii (qui (t)) where qui (t) = {siu | s t} i=1 k Iu− (t) = t} i=1 k Proof Il− (t) = Proposition The answer models of the mediator are ordered as follows: (a) (b) (c) (d) (e) (f) (g) (h) {Ii (u) | tai u} ⇔ {Il (s) | s ∈ tail(t)} = = {∪i=1 k Ii (sil ) | s = {∪Ii (sil ) | s {Il (s) | s t} t} t} i=1 k Ii ( = {sil | s Ii (qli (t)) t}) = i=1 k i=1 k Analogously, we prove that Iu− (t) Proposition D− (o) = {head(t) | o ∈ I(t)} = i i=1 k Ii (qu (t)) Proof D− (o) = {t ∈ T | o ∈ I − (t)} = {t ∈ T | o ∈ ∪{I(s) | s = {t ∈ T | o ∈ I(s) and s = t}} t} {head(s) | o ∈ I(s)} Proposition D+ (o) = {t | head(t) \ {t |t ∼ t} ⊆ D− (o)} 134 Y Tzitzikas et al.: Mediators over taxonomy-based information sources Proof t ∈ D+ (o) ⇔ o ∈ I + (t) ⇔ o ∈ {I − (u) | u ∈ head(t) and u ∼ t} ⇔ o ∈ I − (u) ∀ u ∈ head(t) s.t u ∼ t ⇔ u ∈ D− (o) ∀ u ∈ head(t) s.t u ∼ t ⇔ head(t) \ {t |t ∼ t} ⊆ D− (o) Proposition Dli (o), where Dl (o) = i=1 k Dli (o) = {t ∈ T | ti ∈ Di (o) and ti t} Dui (o), where Du (o) = i=1 k Dui (o) = {t ∈ T | (headi (t) = ∅ and headi (t) ⊆ Di (o)) or (headi (t) = ∅ and ti ∈ Di (o) and ti t)} Proof Consider a mediator over a single source Si and let o be an object stored at that source Let ti ∈ Di (o), where Ii is the answer model of the source Si that is used by the mediator If ∃ t ∈ T such that ti t, then certainly o ∈ Il (t) (since Il (t) = {I(ti ) | ti t}), thus t ∈ Dl (o) Hence Dl (o) = {t ∈ T | ti ∈ Di (o) and ti t} However, since there are many sources, we denote the right part of the above formula by Dli (o), and since an object may belong to more than one source, we arrive at the following: Dl (o) = i=1 k Dli (o) Consider again a mediator over a single source Si and let o be an object stored at that source If ∃ t ∈ T such that headi (t) = ∅ and headi (t) ⊆ Di (o), then certainly o ∈ Iu (t) (since Iu (t) = {I(ti ) | t ti }), thus certainly t ∈ Du (o) If ∃ t ∈ T such that headi (t) = ∅ and there is a ti ∈ Di (o) and ti t, then certainly o ∈ Iu (t) (since in this case Iu (t) = {I(ti ) | ti t}) Hence Du (o) = {t ∈ T | (headi (t) = ∅ and headi (t) ⊆ Di (o)) or (headi (t) = ∅ and ti ∈ Di (o) and ti t)} However, since there are many sources, we denote the right part of the above formula by Dui (o), and since an object may belong to more than one source, we arrive at the following: Du (o) = i=1 k Dui (o) Acknowledgements Part of this work was conducted while the first two authors were visiting Meme Media Laboratory, Hokkaido University, Sapporo, Japan Many thanks to Professor Tanaka, director of the Meme Media Laboratory, and to the University of Hokkaido for their hospitality We also want to thank the anonymous referees for their comments, which improved this paper References “XFML: eXchangeable Faceted Metadata Language” http://www.xfml.org Amann B, Fundulaki I (1999) Integrating ontologies and thesauri to build RDF schemas In: Proceedings of the 3rd European conference for digital libraries ECDL’99, Paris, France, 22 September 1999, pp 234–253 Amba S (1996) Automatic linking of thesauri In: Proceedings of SIGIR’96, Zurich, Switzerland, 18–22 August 1996, pp 181– 186 ACM Press, New York Ambite JL, Ashish N, Barish G, Knoblock CA, Minton S, Modi PJ, Muslea I, Philpot A, Tejada S (1998) Ariadne: a system for constructing mediators for Internet sources In: Proceedings of the ACM SIGMOD international conference on management of data, Seattle, 2–4 June, 1998, pp 561–563 Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval ACM Press/Addison-Wesley, Reading, MA Baumgarten C (1999) Probabilistic information retrieval in a distributed heterogeneous environment PhD thesis, Technical University of Dresden, Dresden, Germany Benjamins VR, Fensel D (1998) Community is knowledge! in (KA)2 In: Proceedings of KAW’98, Alberta, Canada, 18–23 April, 1998 Bidault A, Froidevaux C, Safar B (2000) Repairing queries in a mediator approach In: Proceedings of the ECAI’00, Berlin, 20–25 August 2000, pp 406-410 Boman M, Bubenko JA, Johannesson P, Wangler B (1997) Conceptual modelling Prentice-Hall, Upper Saddle River, NJ 10 Boolos G (1998) Logic, logic and logic Harvard University Press, Cambridge, MA 11 Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks In: Proceedings of the 18th international conference on research and development in information retrieval, Seattle, 9–13 July, 1995, pp 21–18 12 Calvanese D, de Giacomo G, Lenzerini M, Nardi D, Rosati R (1998) Description logic framework for information integration In: Proceedings of the 6th international conference on the principles of knowledge representation and reasoning (KR-98), Trento, Italy, 2–5 June 1998, pp 2–13 13 Calvanese D, de Giacomo G, Lenzerini M (2001) A framework for ontology integration In: Proceedings of the 2001 international Semantic Web working symposium (SWWS 2001), Stanford, CA, 30 July–1 August 2001, pp 303–316 14 Chang C-CK, Garc´ıa-Molina H (1999) Mind your vocabulary: query mapping across heterogeneous information sources In: Proceedings of the ACM SIGMOD, Philadelphia, 1–3 June 1999, pp 335–346 15 Chang C-CK, Garc´ıa-Molina H (2001) Approximate query mapping: accounting for translation closeness J Very Large Databases 10(2–3):155–181 16 Chawathe S, Garcia-Molina H, Hammer J, Ireland K, Papakonstantinou Y, Ullman J, Widom J (1994) The TSIMMIS project: integration of heterogeneous information sources In: Proceedings of IPSJ, Tokyo, October 1994, pp 7–18 17 Cluet S, Delobel C, Siméon J, Smaga K (1998) Your mediators need data conversion! In: Proceedings of the ACM SIGMOD international conference on management of data, Seattle, 2–4 June 1998, pp 177–188 18 Codd EF (1970) A relational model of data for large shared data banks Commun ACM 13(6):377–387 19 Constantopoulos P, Doerr M, Vassiliou Y (1993) Repositories for software reuse: the software information base In: Proceedings IFIP WG 8.1 conference on information system development process, Como, Italy, September 1993, pp 285–307 20 Craswell N, Hawking D, Thistlewaite P (1999) Merging results from isolated search engines In: Proceedings of the 10th Australasian database conference, Auckland, New Zealand, 18–21 January 1999, pp 189–200 21 Croft B (1993) Knowledge-based and statistical approaches to text retrieval IEEE Expert 9:8–12 22 Decker S, Erdmann M, Fensel D, Studer R (1999) Ontobroker: ontology based access to distributed and semi-structured information In: Semantic issues in multimedia systems Kluwer, Dordrecht 23 Doan A, Madhavan J, Domingos P, Halevy A (2002) Learning to map between ontologies on the Semantic Web In: Proceedings of the World Wide Web Conference (WWW-2002), Honolulu, 7–11 May 2002, pp 662–673 Y Tzitzikas et al.: Mediators over taxonomy-based information sources 24 Duschka OM, Genesereth MR (1997a) Answering recursive queries using views In: Proceedings of PODS 1997, Tucson, AZ, 12–14 May 1997, pp 109–116 25 Duschka OM, Genesereth MR (1997b) Query planning in infomaster In: Proceedings of the 12th annual ACM symposium on applied computing, SAC’97, San Jose, February 1997, pp 109– 111 26 Fagin R (1999) Combining fuzzy information from multiple systems J Comput Sys Sci 58(1):83–99 27 Fan Y, Gauch S (1999) Adaptive agents for information gathering from multiple, distributed information sources In: Proceedings of the 1999 AAAI symposium on intelligent agents in cyberspace, Stanford, CA, March 1999, pp 40–46 28 Fuhr N (1999) A decision-theoretic approach to database selection in networked IR ACM Trans Inf Sys 17(3):229–249 29 Galton A (1990) Logic for information technology Wiley, New York 30 Garcia-Molina H, Papakonstantinou Y, Quass D, Rajaraman A, Sagiv Y, Ullman J, Vassalos V, Widom J (1994) The TSIMMIS approach to mediation: data models and languages In: Proceedings of IPSJ, Tokyo, October 1994, pp 7–18 31 Garcia-Molina H, Ullman JD, Widom J (2000) Database system implementation, chap 11 Prentice-Hall, Upper Saddle River, NJ 32 Genesareth MR, Keller AM, Duschka O (1997) Infomaster: an information integration system In: Proceedings of 1997 ACM SIGMOD, Tucson, AZ, May 1997, pp 539–542 33 Gravano L, Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies In: Proceedings of the 21st conference on very large databases, Zurich, Switzerland, 11–15 September 1995, pp 78–89 34 Guarino N (1998) Some ontological principles for designing upper level lexical resources In: Proceedings of the 1st international conference on language resources and evaluation, Granada, Spain, May 1998, pp 527–534 35 Guarino N, Masolo C,Vetere G (1999) OntoSeek: content-based access to the Web IEEE Intell Sys 14(3):70–80 36 Halevy AY (2001) Answering queries using views: a survey J Very Large Databases 10(4):270–294 37 Helleg H, Krause J, Mandl T, Marx J, Muller M, Mutschke P, Strogen R (2001) Treatment of semantic heterogeneity in information retrieval Technical Report 23, Social Science Information Centre, Kăoln, Germany, May 2001 http://www.gesis.org/en/publications/reports/iz working papers/ 38 HoweA, Dreilinger D (1997) SavvySearch: a metasearch engine that learns which search engines to query AI Mag 18(2):19–25 39 Information Systems Laboratory The Semantic Index System (SIS) Institute of Computer Science Foundation for Research and Technology Hellas http://zeus.ics.forth.gr/forth/ics/isl/rd-activities/semantic index system.html 40 International Organization for Standardization (1986) Documentation – guidelines for the establishment and development of monolingual thesauri Ref No ISO 2788-1986 41 Kashyap V, Sheth A (1996) Semantic and schematic similarities between database objects: a context-based approach J Very Large Databases 5(4):276–304 42 Kashyap V, Sheth A (1998) Semantic heterogeneity in global information systems: the role of metadata, context and ontologies In: Cooperative information systems: trends and directions Academic, San Diego 43 Knoblock C, Arens Y, Hsu C-N (1994) Cooperating Agents for Information Retrieval In: Procedings of the 2nd international conference on cooperative information systems, Toronto, 17–20 May 1994, pp 122–133 135 44 Lacher M, Groh G (2001) Facilitating the exchange of explicit knowledge through ontology mappings In: Proceedings of the 14th international FLAIRS conference, Key West, FL, 21–23 May 2001, pp 305–309 45 Lattes V, Rousset M-C (1998) The use of CARIN language and algorithms for information integration: the PISCEL project In: Proceedings of the 2nd international and workshop on intelligent information integration, Brighton Centre, Brighton, interdisciplinary UK, August 1998 46 Lenzerini M (2002) Data integration: a theoretical perspective In: Proceedings of ACM PODS 2002, Madison, WI, June 2002, pp 233–246 47 LevyAY, Srivastava D, Kirk T ( Data model and query evaluation in global information systems J Intell Inf Sys 5(2):121–143 48 Luke S, Spector L, Rager D, Hendler J (1997) Ontology-based Web Agents In: Proceedings of the 1st international conference on autonomous agents, Marina del Rey, CA, 5–8 February 1997, pp 59–66 http://www.cs.umd.edu/projects/plus/SHOE/ 49 Mazur Z (1994) Models of a distributed information retrieval system based on thesauri with weights Inf Process Manage 30(1):61–77 50 McGuinness DL (1998) Ontological issues for knowledgeenhanced search In: Proceedings of FOIS’98, Trento, Italy, June 1998 IOS Press, Amsterdam 51 Meghini C, Straccia U (1996) A relevance terminological logic for information retrieval In: Proceedings of SIGIR’96, Zurich, Switzerland, August 1996, pp 197–205 52 Mena E, Kashyap V, Sheth A, Illarramendi A (1996) OBSERVER: an approach for query processing in global information systems based on interoperation across preexisting ontologies In: Proceedings of the 1st IFCIS international conference on cooperative information systems (CoopIS’96), Brussels, Belgium, June 1996, pp 14–25 IEEE Press, New York 53 Mitra P, Wiederhold G, Jannink J (1999) Semi-automatic integration of knowledge sources In: Proceedings of the 2nd international conference on information fusion, Sunnyvale, CA, July 1999 54 Nuutila E (1995) Efficient transitive closure computation in large digraphs PhD thesis, Acta Polytechnica Scandinavica, Helsinki, 1995 http://www.cs.hut.fi/∼enu/thesis.html 55 Paice C (1991) A thesaural model of information retrieval Inf Process Manage 27(5):433–447 56 Prieto-Diaz R (1991) Implementing faceted classification for software reuse Commun ACM 34(5):88–97 57 Princeton University Cognitive Science Laboratory WordNet: a lexical database for the English language http://www.cogsci.princeton.edu/∼wn 58 Ranganathan SR (1965) The colon classification In: Artandi S (ed) Rutgers series on systems for the intellectual organization of information, vol IV Graduate School of Library Science, Rutgers University, New Brunswick, NJ 59 Rolleke T, Fuhr N (1996) Retrieval of complex objects using a four-valued logic In: Proceedings of SIGIR’96, Zurich, Switzerland, August 1996, pp 206–214 60 Ryutaro I, Hideaki T, Shinichi H (2001) Rule induction for concept hierarchy alignment In: Proceedings of the 2nd workshop on ontology learning at the 17th international conference on AI (IJCAI), Seattle, August 2001 61 Sacco GM (2000) Dynamic taxonomies: a model for large information bases IEEE Trans Knowl Data Eng 12(3):468–479 62 Salton G (1983) Introduction to modern information retrieval McGraw-Hill, New York 63 Selberg E, Etzioni O (1995) Multi-service search and comparison using the MetaCrawler In: Proceedings of the 1995 World Wide Web conference, Boston, December 1995 136 64 Sintichakis M, Constantopoulos P (1997) A method for monolingual thesauri merging In: Proceedings of the 20th international conference on research and development in information retrieval, ACM SIGIR’97, Philadelphia, July 1997, pp 129–138 65 Spyratos N (1987) The partition model: a deductive database model ACM Trans Database Sys 12(1):1–37 66 Subrahmanian VS, Adah S, Brink A, Emery R, Rajput A, Ross R, Rogers T, Ward C (1996) HERMES: a heterogeneous reasoning and mediator system http://www.cs.umd.edu/projects/hermes/overview/paper 67 Tzitzikas Y (2001) Democratic data fusion for information retrieval mediators In: Proceedings of the ACS/IEEE international conference on computer systems and applications, Beirut, Lebanon, June 2001 68 Tzitzikas Y, Meghini C (2003a) Ostensive automatic schema mapping for taxonomy-based peer-to-peer systems In: Proceedings of the 7th international workshop on cooperative information agents, CIA-2003, Helsinki, Finland, August 2003 Lecture notes on artificial intelligence, vol 2782, pp 78–92 (Best Paper Award) Y Tzitzikas et al.: Mediators over taxonomy-based information sources 69 Tzitzikas Y, Meghini C (2003b) Query evaluation in peer-topeer networks of taxonomy-based sources In: Proceedings of the 19th international conference on cooperative information systems, CoopIS’2003, Catania, Sicily, Italy, November 2003 70 Tzitzikas Y, Spyratos N, Constantopoulos P (2001) Mediators over ontology-based information sources In: Proceedings of the 2nd international conference on Web information systems engineering, WISE 2001, Kyoto, Japan, December 2001, pp 31– 40 71 Tzitzikas Y, Analyti A, Spyratos N, Constantopoulos P (2003a) An algebraic approach for specifying compound terms in faceted taxonomies In: Proceedings of the 13th EuropeanJapanese conference on information modelling and knowledge bases, Kitakyushu, Japan, June 2003 72 Tzitzikas Y, Meghini C, Spyratos N (2003b) Taxonomy-based conceptual modeling for peer-to-peer networks In: Proceedings of the 22nd international conference on conceptual modeling, ER’2003, Chicago, October 2003, pp 446–460 73 Tzitzikas Y, Spyratos N, Constantopoulos P (2002) Query evaluation for mediators over Web catalogs Int J Inf Theories Appl 9(2) 74 Ullman JD (1997) Information integration using logical views In: Proceedings of the 6th international conference on database theory (ICDT-97), Delphi, Greece, 8–10 January 1997, pp 19– 40 75 Van Harmelen F, Fensel D (1999) Practical knowledge representation for the Web In: Proceedings of the workshop on intelligent information integration, IJCAI’99, Stockholm, Sweden, 31 July 1999 76 Vorhees E, Gupta N, Johnson-Laird B (1995) The collection fusion problem In: Proceedings of the 3rd text retrieval conference (TREC-3), Gaithersburg, MD, November 1995 77 Vorhees E (1997) Multiple search engines in database merging In: Proceedings of the 2nd ACM international conference on digital libraries, 25–28 July 1997, Philadelphia, pp 93–102 78 Wiederhold G (1992) Mediators in the architecture of future information systems IEEE Comput 25:38–49 79 Yerneni R, Li C, Garcia-Molina H, Ullman J (1999) Computing capabilities of mediators In: Proceedings of ACM SIGMOD’99, Philadelphia, 1–3 June 1999, pp 443-454 The VLDB Journal (2005) 14: / Digital Object Identier (DOI) 10.1007/s00778-004-0143-3 Editorial ă M Tamer Ozsu Published online: March 2005 – c Springer-Verlag 2005 At the start of a new year and a new volume of The VLDB Journal (VLDBJ), I would like to give a report to our readers on our activities and on how we are doing My hope is that these reports will become a tradition that will enable us to track the progress of VLDBJ over the years Our objective is to make VLDBJ the preferred publication venue for database researchers This requires attention to paper quality, timely reviews, review quality, and speedy publication of accepted papers Here is how we have done on these fronts in 2004 We published 22 papers in issues Two of these issues were special issues: one was devoted to selected papers from the 2003 VLDB Conference (6 papers) and the other a thematic special issue devoted to data stream processing (5 papers) Submissions toVLDBJ continue to increase; 2004 submissions were 9% higher than 2003 submissions, and since 1999 this number has doubled Our annual acceptance rates are holding steady (with respect to papers accepted in a year regardless of their submission year) at around 30% However, acceptance rates of papers submitted within the same year are lower: for 2003, which is the last year for which we have completed reviewing all the submitted papers, the rate is 20% Judging by where we are with the 2004 submissions, the rate will likely be slightly lower We have been putting considerable effort into reducing our review times I am happy to report that the average submission-to-acceptance time for papers submitted in 2004 was 76 days, while the submission-to-publication time was 141 days These compare with 99 days and 134 days, respectively, in 2003 The slight increase in submission-to-publication time in 2004 is a direct result of increasing submission volume, while the annual page counts remain the same (thus causing a slight publication backlog) By comparison, the same numbers in 2002 were 184 and 215, respectively We have indeed made significant strides in reducing our review times, and we continue to pay attention to this issue VLDBJ is now reaching an increasing number of readers The agreement that was reached in 2003 between the VLDB Endowment, Springer, and ACM that ensures the inclusion of VLDBJ in ACM Digital Library has had a very positive impact In the Internet era, with online journal publications, the way to measure impact has shifted from subscriptions to article downloads In 2004, there were 55,344 full paper downloads: 47,311 from ACM Digital Library and 8,033 from Springer LINK Comparable numbers for 2003 were 20,282 and 7,483, respectively, but it should be remembered that VLDBJ was included in ACM Digital Library starting only in April/May 2003 Nevertheless, even if ACM numbers are extrapolated to the entire 2003 year, we experienced a 46% increase in downloads from 2003 to 2004 These statistical data indicate that we are making significant progress toward our objective of making VLDBJ the preferred publication venue for database papers We are always open to ideas and suggestions and look forward to hearing from you ¨ M Tamer Ozsu Coordinating Editor-in-Chief ... tus Order “O” 11 268835.44 12 3/4 /19 97 0:0:0 15 412 678 35840.07 10 27.0 Null 13 14 13 5943 20 “F” 26 263224.53 27 6/22 /19 93 0:0:0 15 16 9897 17 416 18 66854.93 19 37.0 Null Null 13 14 13 5943 20 “F”... databases Q1 Total Time Inlined 4NF 47 Q1 Response Time 4NF w/o BLOBs Inlined 10 00 4NF 4NF w/o BLOBs 16 0 12 0 10 0 Time (sec) Time (sec.) 14 0 10 10 0 80 60 40 20 0 .1 10 0 .1 100 10 10 0 Selectivity of the. .. Outer size (MB) Inner size (MB) Memory size (MB) 10 and 11 12 13 14 15 and 16 17 18 19 20 21 23 25 None None None None None None 0? ?10 0% one side None 0? ?10 0% one side 0–4% both sides None 0–4% both

Định dạng
Số trang	136
Dung lượng	2,17 MB