Tài liệu Advances in Database Technology- P9 docx

50 412 0
Tài liệu Advances in Database Technology- P9 docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

382 J Zhang et al Fig 21 Cost vs Fig 22 Cost vs made in [CMTV00]) On the other hand, the density of S does not affect significantly the accesses to the obstacle R-tree because high density leads to closer distance between the Euclidean pairs The CPU time of the algorithm (shown in Fig 21b) grows fast with because the dominant factor is the computation required for obtaining the Euclidean closest pairs (as opposed to obstructed distances) Fig 22 shows the cost of the algorithm with for different values of k The page accesses for the entity R-trees (caused by the Euclidean CP algorithm) remain almost constant, since the major cost occurs before the first pair is output (i.e., the k closest pairs are likely to be in the heap after the first Euclidean NN is found, and are returned without extra IOs) The accesses to the obstacle R-tree and the CPU time, however, increase with k because more obstacles must be taken into account during the construction of the visibility graphs Conclusion This paper tackles spatial query processing in the presence of obstacles Given a set of entities P and a set of polygonal obstacles O, our aim is to answer spatial queries with respect to the obstructed distance metric, which corresponds to the length of the Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Spatial Queries in the Presence of Obstacles 383 shortest path that connects them without passing through obstacles This problem has numerous important applications in real life, and several main memory algorithms have been proposed in Computational Geometry Surprisingly, there is no previous work for disk-resident datasets in the area of Spatial Databases Combining techniques and algorithms from both aforementioned fields, we propose an integrated framework that efficiently answers most types of spatial queries (i.e., range search, nearest neighbors, e-distance joins and closest pairs), subject to obstacle avoidance Making use of local visibility graphs and effective R-tree algorithms, we present and evaluate a number of solutions Being the first thorough study of this problem in the context of massive datasets, this paper opens a door to several interesting directions for future work For instance, as objects move in practice, it would be interesting to study obstacle queries for moving entities and/or moving obstacles References Asano, T., Guibas, L., Hershberger, J., Imai, H Visibility of Disjoint Polygons Algorithmica 1, 49-63, 1986 [BKOS97] de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O Computational Geometry, pp 305-315, Springer, 1997 Brinkhoff, T., Kriegel, H., Seeger, B Efficient Processing of Spatial Joins Using [BKS93] R-trees SIGMOD, 1993 [BKSS90] Becker, B., Kriegel, H., Schneider, R, Seeger, B The R*-tree: An Efficient and Robust Access Method SIGMOD, 1990 [CMTV00] Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M Closest Pair Queries in Spatial Databases SIGMOD, 2000 Dijkstra, E A Note on Two Problems in Connection with Graphs Numeriche [D59] Mathematik, 1, 269-271, 1959 Estivill-Castro, V., Lee, I Fast Spatial Clustering with Different Metrics in the [EL01] Presence of Obstacles ACM GIS, 2001 Guttman, A R-trees: A Dynamic Index Structure for Spatial Searching SIGMOD, [G84] 1984 Ghosh, S., Mount, D An Output Sensitive Algorithm for Computing Visibility [GM87] Graphs FOCS, 1987 Hjaltason, G., Samet, H Incremental Distance Join Algorithms for Spatial [HS98] Databases SIGMOD, 1998 Hjaltason, G., Samet, H Distance Browsing in Spatial Databases TODS, 24(2), [HS99] 265-318, 1999 [KHI+86] Kung, R., Hanson, E., Ioannidis, Y., Sellis, T., Shapiro, L Stonebraker, M Heuristic Search in Data Base Systems Expert Database Systems, 1986 Lozano-Pérez, T., Wesley, M An Algorithm for Planning Collision-free Paths [LW79] among Polyhedral Obstacles CACM, 22(10), 560-570, 1979 Pocchiola, M., Vegter, G Minimal Tangent Visibility Graph Computational [PV95] Geometry: Theory and Applications, 1995 [PV96] Pocchiola, M., Vegter, G Topologically Sweeping Visibility Complexes via Pseudo-triangulations Discrete Computational Geometry, 1996 [PZMT03] Papadias, D., Zhang, J., Mamoulis, N., Tao, Y Query Processing in Spatial Network Databases VLDB, 2003 Rivière, S Topologically Sweeping the Visibility Complex of Polygonal Scenes [R95] Symposium on Computational Geometry, 1995 [AGHI86] Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 384 [SRF87] [SS84] [THH01] [W85] [Web] J Zhang et al Sellis, T., Roussopoulos, N Faloutsos, C The R+-tree: a Dynamic Index for Multi-Dimensional Objects VLDB, 1987 Sharir, M., Schorr, A On Shortest Paths in Polyhedral Spaces STOC, 1984 Tung, A., Hou, J., Han, J Spatial Clustering in the Presence of Obstacles ICDE, 2001 Welzl, E Constructing the Visibility Graph for n Line Segments in Time, Information Processing Letters 20, 167-171, 1985 http://www.maproom.psu.edu/dcw Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin1*, Nick Koudas2, and Chen Li1* Information and Computer Science, University of California, Irvine, CA 92697, USA {liangj,chenli}@ics.uci.edu AT&T Labs Research, 180 Park Avenue, Florham Park, NJ 07932, USA koudas@research.att.com Abstract Efficient search for nearest neighbors (NN) is a fundamental problem arising in a large variety of applications of vast practical interest In this paper we propose a novel technique, called NNH (“Nearest Neighbor Histograms”), which uses specific histogram structures to improve the performance of NN search algorithms A primary feature of our proposal is that such histogram structures can co-exist in conjunction with a plethora of NN search algorithms without the need to substantially modify them The main idea behind our proposal is to choose a small number of pivot objects in the space, and pre-calculate the distances to their nearest neighbors We provide a complete specification of such histogram structures and show how to use the information they provide towards more effective searching In particular, we show how to construct them, how to decide the number of pivots, how to choose pivot objects, how to incrementally maintain them under dynamic updates, and how to utilize them in conjunction with a variety of NN search algorithms to improve the performance of NN searches Our intensive experiments show that nearest neighbor histograms can be efficiently constructed and maintained, and when used in conjunction with a variety of algorithms for NN search, they can improve the performance dramatically Introduction Nearest-neighbor (NN) searches arise in a large variety of applications such as image and video databases [1], CAD, information retrieval (IR) [2], data compression [3], and string matching/searching [4] The basic version of the problem is to find the nearest neighbors of a query object in a database, according to a distance measurement In these applications, objects are often characterized by features and represented as points in a multi-dimensional space For instance, we often represent an image as a multi-dimensional vector using features such as histograms of colors and textures A typical query in an image database is to find images most similar to a given query image utilizing such features As another example, in information retrieval, one often wishes to locate * These two authors were supported by NSF CAREER award No IIS-0238586 and a UCI CORCLR grant E Bertino et al (Eds.): EDBT 2004, LNCS 2992, pp 385–402, 2004 © Springer-Verlag Berlin Heidelberg 2004 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 386 L Jin, N Koudas, and C Li documents that are most similar to a given query document, considering a set of features extracted from the documents [2] Variations of the basic problem include high-dimensional joins between point sets For instance, an all-pair neighbor join between two point sets seeks to identify the closest pairs among all pairs from two sets [5,6] An all-pair neighbor semi-join between two point sets reports, for each object in one data set, its nearest neighbors in the second set [5] Many algorithms have been proposed to support nearest-neighbor queries Most of them use a high-dimensional indexing structure, such as an R-tree [7] or one of its variations For instance, in the case of an R-tree, these algorithms use a branch-and-bound approach to traverse the tree top down, and use distance bounds between objects to prune branches (minimum-bounding rectangles, MBR’s) that not need to be considered [8,9] A priority queue of interior nodes is maintained based on their distances to the query object In the various forms of high-dimensional joins between point sets, a queue is maintained to keep track of pairs of objects or nodes in the two data sets One of the main challenges in these algorithms is to perform effective pruning of the search space, and subsequently achieve good search performance The performance of such an algorithm heavily depends on the number of disk accesses (often determined by the number of branches visited in the traversal) and its runtime memory requirements, which indicates memory (for priority-queue storage) and processor requirements for maintaining and manipulating the queue Performance can deteriorate if too many branches are visited and/or too many entries are maintained in the priority queue, especially in a high-dimensional space due to the well-known “curse of dimensionality” problem [1] In this paper we develop a novel technique to improve the performance of these algorithms by keeping histogram structures (called “NNH”) Such structures record the nearest-neighbor distances for a preselected collection of objects (“pivots”) These distances can be utilized to estimate the distance at which the neighbors for each query object can be identified They can subsequently be used to improve the performance of a variety of nearest-neighbor search and related algorithms via more effective pruning The histogram structures proposed can co-exist in conjunction with a plethora of NN algorithms without the need to substantially modify these algorithms There are several challenges associated with the construction and use of such structures (1) The construction time should be small, their storage requirements should be minimal, and the estimates derived from them should be precise (2) They should be easy to use towards improving the performance of a variety of nearest-neighbor algorithms (3) Such structures should support efficient incremental maintenance under dynamic updates In this paper we provide a complete specification of such histogram structures, showing how to efficiently and accurately construct them, how to choose pivots effectively, how to incrementally maintain them under dynamic updates, and how to utilize them in conjunction with a variety of NN algorithms to improve the performance of searches Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark NNH: Improving Performance of Nearest-Neighbor Searches 387 The rest of the paper is organized as follows Section outlines the formal definition of a nearest-neighbor histogram (NNH) structure In Section we show how to use such histograms to improve the performance for a variety of NN algorithms Section discusses how to choose pivots in such a structure In Section we discuss how to incrementally maintain an NN histogram structure in the presence of dynamic updates In Section we report our extensive experimental results, evaluating the construction time, maintenance algorithms, and the efficiency of our proposed histograms when used in conjunction with a variety of algorithms for NN search to improve their performance Due to space limitation, we provide more results in [10] Related Work: Summary structures in the form of histograms have been utilized extensively in databases in a variety of important problems, such as selectivity estimation [11,12] and approximate query answering [13,14] In these problems, the main objective is to approximate the distribution of frequency values using specific functions and a limited amount of space Many algorithms exist for efficiently identifying the nearest neighbors of low and high-dimensional data points for main memory data collections in the field of computational geometry [15] In databases, many different families of highdimensional indexing structures are available [16], and various techniques are known for performing NN searches tailored to the specifics of each family of indexing structures Such techniques include NN searches for the entity grouping family of indexing structures (e.g., R-trees [7]) and NN searches for the space partitioning family (e.g., Quad-trees [17]) In addition to the importance of NN queries as stand-alone query types, a variety of other query types make use of NN searches Spatial or multidimensional joins [18] are a representative example of such query types Different algorithms have been proposed for spatial NN semi-joins and all-pair NN joins [5] NN search algorithms can benefit from the histogram structures proposed in this paper enabling them to perform more effective pruning For example, utilizing a good estimate to the distance of the nearest neighbor of a query point, one can form essentially a range query to identify nearest neighbors, using the query object and the estimated distance and treating the search algorithm as a “black box” without modifying its code Various studies [19,20,21,22] use notions of pivots or anchors or foci for efficient indexing and query processing We will compare our approach with these methods in Section 6.4 NNH: Nearest-Neighbor Histograms Consider a data set with objects in a Euclidean space under some form Formally, there is a distance function such that given two objects and their distance is defined as and are coordinates of objects and respectively Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 388 L Jin, N Koudas, and C Li Definition (NN Search) Given a query object, a Neighbor (NN) search returns the points in the database that are closest to the query object Given an object in the space, its NN distance vector of size is a vector in which each is the distance of nearest neighbor in the database D A nearest-neighbor histogram (NNH) of the data set, denoted H, is a collection of objects (called “pivots”) with their NN vectors In principle, these pivot points may or may not correspond to points in the database In the rest of the paper, we assume that the pivots are not part of the data set for the purpose of easy dynamic maintenance, as discussed in Section Initially all the vectors have the same length, denoted T, which is a design parameter and forms an upper bound on the number of neighbors NN queries specify as their desired result We choose to fix the value of T in order to control the storage requirement for the NNH structure [23] For any pivot let denote the distance of recorded in H Let denote the NN vector for a pivot in the histogram H Fig An NN histogram structure H Figure shows such a structure It has pivots, each of which has an NN vector of size T = For instance, and In Section we discuss how to choose pivots to construct a histogram Once we have chosen pivots, for each of them, we run a T-NN search to find all its T nearest neighbors Then, for each we calculate the distance from the nearest neighbor to the object and use these T distances to construct an NN vector for Improving Query Performance In this section we discuss how to utilize the information captured by nearest neighbor histograms to improve query performance A number of important queries could benefit from such histograms, including neighbor search and various forms of high-dimensional joins between point sets, such as all-pair neighbor joins (i.e., finding closest pairs among all the pairs Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark NNH: Improving Performance of Nearest-Neighbor Searches 389 between two data sets), and neighbor semi-joins (i.e., finding NN’s in the second data set for each object in the first set) The improvements to such algorithms are twofold: (a) the processor time can be reduced substantially due to advanced pruning This reduction is important since for data sets of high dimensionality, distance computations are expensive and processor time becomes a significant fraction of overall query-processing time [24]; and (b) the memory requirement can be reduced significantly due to the much smaller queue size Most NN algorithms assume a high-dimensional indexing structure on the data set To simplify our presentation, we choose to highlight how to utilize our histogram structures to improve the performance of common queries involving R-trees [7] Similar concepts carry out easily to other structures (e.g SS-tree [25], SR-tree [26], etc.) and algorithms as well 3.1 Utilizing Histograms in Queries A typical search involving R-trees follows a branch-and-bound strategy traversing the index top down It maintains a priority queue of active minimum bounding rectangles (MBR’s) [8,9] At each step, it maintains a bound to the distance, from the query point This bound is initialized to infinity when the search starts Using the geometry of an MBR mbr and the coordinates of query object an upper-bound distance of to the nearest point in the mbr, namely can be derived [27] In a similar fashion, a lower bound, namely is derived Using these estimates, we can prune MBR’s from the queue as follows: (1) For an MBR mbr, if its is greater than of another MBR then mbr can be pruned (2) If for an MBR mbr is greater than then this mbr can be pruned In the presence of NN histograms, we can utilize the distances to estimate an upper bound of the of the query object Recall that the NN histogram includes the NN distances for selected pivots only, and it does not convey immediate information about the NN distances of objects not encoded in the histogram structure We can estimate the distance of to its denoted using the triangle inequality between and any pivot in the histogram: Thus, we can obtain an upper bound estimate of as The complexity of this estimation step is Since the histogram structure is small enough to fit in the memory, and the number of pivots is small, the above estimation can be conducted efficiently After computing the initial estimate Note that we use instead of that the pivots are not part of the database because in this paper we assume Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 390 L Jin, N Koudas, and C Li we can use this distance to help the search process prune MBR’s That is, the search progresses by evaluating MINMAXDIST and MINDIST between and an R-tree MBR mbr as before Besides the standard pruning steps discussed above, the algorithm also checks if is true If so, then this mbr can be pruned Thus we not need to insert it into the queue, reducing the memory requirement (queue size) and the later operations of this MBR in the queue Notice that the algorithm in [8,9] is shown to be IO optimal [28] Utilizing our NNH structure may not reduce the number of IOs during the R-tree traversal However, our structure can help reduce the size of the priority queue, and the number of queue operations This reduction can help reduce the running time of the algorithm, as shown in our experiments in Section In addition, if the queue becomes too large to fit into memory, this reduction could even help us reduce the IO cost since part of the queue needs to be paged on disk 3.2 Utilizing Histograms in Joins A related methodology could also be applied in the case of all-pairs join queries using R-trees The bulk of algorithms for this purpose progress by inserting pairs of MBR’s between index nodes from corresponding trees in a priority queue and recursively (top-down) refining the search In this case, using histograms, one could perform, in addition to the type of pruning highlighted above, even more powerful pruning More specifically, let us see how to utilize such a histogram to pruning in a semi-join search [5] This problem tries to find for each object in a data set all neighbors in a data set (This pruning technique can be easily generalized to other join algorithms.) If the two data sets are the same, the join becomes a self semi-join, i.e., finding for all objects in the data set Assume the two data sets are indexed in two R-trees, and respectively A preliminary algorithm described in [5] keeps a priority queue of MBR pairs between index nodes from the two trees In addition, for each object in we can keep track of objects in whose pair has been reported If we can just keep track of objects whose nearest neighbor has been reported For we can output its nearest neighbors while traversing the trees, thus we only need to keep a counter for this object We stop searching for neighbors of this object when this counter reaches Suppose we have constructed a histogram for data set We can utilize to prune portions of the search space in bulk by pruning pairs of MBR’s from the priority queue as follows For each object in besides the information kept in the original search algorithm, we also keep an estimated radius of the in for this object We can get an initial estimate for as before (Section 3.1) Similarly, we will not insert a pair into the queue if MINDIST In this way we can prune more object-MBR pairs Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark NNH: Improving Performance of Nearest-Neighbor Searches 391 Now we show that this histogram is even capable to prune MBR-MBR pairs For each MBR in tree consider a pivot in the NN histogram of dataset For each possible object in using the triangle inequality between and we know that the distance to and object is an upper bound of the distance between any object in Therefore, is an upper bound for the distance for any object in It can be used to prune any from the queue, if is true, where is the lower bound of the distance between any pair of objects from and respectively In order to use this technique to pruning, in addition to keeping an estimate for the distance of the for each object in data set we also keep an estimate of distance for each MBR in This number tends to be smaller than the number of objects in These MBR-distance estimates can be used to prune many MBR-MBR pairs, reducing the queue size and the number of disk IO’s Pruning in all-pair neighbor joins: The pruning techniques described above can be adapted to perform more effective pruning when finding the pairs among all pairs of objects from two data sets [5,6] Notice that so far we have assumed that only has a histogram If the first data set also has a histogram similar pruning steps using can be done to more effective pruning, since the two sets are symmetric in this join problem Constructing NNH Using Good Pivots Pivot points are vital to NNH for obtaining distance estimates in answering NN queries We now turn to the problems associated with the choice of pivot points, and the number of pivots Assume we decide to choose pivot points The storage requirement for NNH becomes since for each pivot point we will associate a vector of distances to its T-NN’s Let the pivot points be Given a query point we can obtain an estimate to the distance of by returning This estimate is an upper bound of the real distance of to its is obtained utilizing the triangle inequality, among a pivot point point Assuming that the true distance of to its is estimation incurs an error of point and and This Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Clustering Multidimensional Extended Objects 417 page sizes would trigger the creation of too many tree nodes resulting in high overheads due to numerous node accesses, both in memory and on disk Execution Parameters and Performance Indicators The following parameters are varied in our tests: number of database objects (up to 2,000,000), number of dimensions (from 16 to 40), and query selectivity (between 0.00005% and 50%) In each experiment, a large number of spatial queries is addressed to the indexing structure and average values are raised for the following performance indicators: query execution time (combining all costs), number of accessed clusters/nodes (relevant for the cost due to disk access operations), size of verified data (relevant for data transfer and check costs) Experimental Process For Sequential Scan, the database objects are loaded and stored in a single cluster Queries are launched, and performance indicators are raised For R*-tree, the objects are first inserted in the indexing structure, then query performance is evaluated For Adaptive Clustering, the database objects are inserted in the root cluster, then a number of queries are launched to trigger the object organization in clusters A database reorganization is triggered every 100 spatial queries If the query distribution does not change, the clustering process reaches a stable state (in less than 10 reorganization steps) We then evaluate the average query response time The reported time also includes the time spent to update query statistics associated to accessed clusters Fig Query Performance when Varying Query Selectivity (Uniform Workload) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 418 7.2 C.-A Saita and F Llirbat Experiments Uniform Workload and Varying Query Selectivity (2,000,000 objects) In the first experiment we examine the impact of the query selectivity on the query performance We consider 2,000,000 database objects uniformlydistributed in a 16-dimensional data space (251MBytes of data) and evaluate the query response time for intersection queries with selectivities varying from 0.00005% to 50% Each database object defines intervals whose sizes and positions are randomly distributed in each dimension The intervals of the query objects are also uniformly generated in each dimension, but minimal/maximal interval sizes are enforced in order to control the query selectivity Performance results are presented in Fig for both storage scenarios: in-memory and diskbased Charts A and B illustrate average query execution times for the three considered methods: Sequential Scan (SS), Adaptive Clustering (AC), and R*tree (RS) Tables and compare AC and RS in terms of total number of clusters/nodes, average ratio of explored clusters/nodes, and average ratio of verified objects Unlike RS for which the number of nodes is constant, AC adapts the object clustering to the actual data and query distribution When the queries are very selective many clusters are formed because few of them are expected to be explored In contrast, when the queries are not selective fewer clusters are created Indeed, their frequent exploration would otherwise trigger significant cost overhead The cost model supporting the adaptive clustering always ensures better performance for AC compared to SS4 RS is much more expensive than SS on disk, but also in memory for queries with selectivities over 0.5% The bad performance of RS confirms our expectations: RS can not deal with high dimensionality (16 in this case) because the MBBs overlap within nodes determines the exploration of many tree nodes5 AC systematically outperforms RS, exploring fewer clusters and verifying fewer objects both in memory and on disk Our object grouping is clearly more efficient In memory, for instance, we verify three times fewer objects than RS in most cases Even for queries with selectivities as low as 50%, when RS practically checks the entire database, only 71% of objects are verified by AC The difference in number of verified objects is not so substantial on disk, but the cost overhead due to expensive random I/O accesses is remarkably inferior6 This happens because the number of AC clusters is much smaller than the number of RS nodes Compared to the memory storage scenario, the small number of clusters formed on disk is due to the cost model that takes into consideration the negative impact of expensive random I/O accesses This demonstrates the flexibility of our adaptive cost-based clustering strategy Thanks to it AC succeeds to outperform SS on disk in all cases The cost of SS in memory increases significantly (up to 3x) for lower query selectivities This happens because an object is rejected as soon as one of its dimensions does not satisfy the intersection condition When the query selectivity is low, more attributes have to be verified on average See ratio of explored nodes in Tables and Note the logarithmic time scale from Chart 7-B Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Clustering Multidimensional Extended Objects 419 Fig Query Performance when Varying Space Dimensionality (Skewed Data) Skewed Workload and Varying Space Dimensionality (1,000,000 objects) With this experiment we intend to demonstrate both the good behavior with increasing dimensionality and the good performance under skewed data Skewed data is closer to reality where different dimensions exhibit different characteristics For this test, we adopted the following skewed scenario: we generate uniformly-distributed query objects with no interval constraints, but consider 1,000,000 database objects with different size constraints over dimensions We vary the number of dimensions between 16 and 40 For each database object, we randomly choose a quart of dimensions that are two times more selective than the rest of dimensions We still control the global query selectivity because the query objects are uniformly distributed For this experiment we ensure an average query selectivity of 0.05% Performance results are illustrated in Fig We first notice that the query time increases with the dimensionality This is normal because the size of the dataset increases too from 126MBytes (16d) to 309MBytes (40d) Compared to SS, AC again exhibits good performance, scaling well with the number of dimensions, both in memory and on disk AC resists to increasing dimensionality better than RS RS fails to outperform SS due to the large number of accessed nodes (> 72%) AC takes better advantage of the skewed data distribution, and groups objects in clusters whose signatures are based on the most selective similar intervals and dimensions of the objects regrouped In contrast, RS does not benefit from the skewed data distribution, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 420 C.-A Saita and F Llirbat probably due to the minimum bounding constraint, which increases the general overlap In memory, for instance, RS verifies four times more objects than AC Point-Enclosing Queries Because queries like “find the database objects containing a given point” can also occur in practice (for instance, in a publishsubscribe application where subscriptions define interval ranges as attributes, and events can be points in these ranges), we also evaluated point-enclosing queries considering different workloads and storage scenarios We not show here experimental details, but we report very good performance: up to 16 times faster than SS in memory, and up to times on disk, mostly due to the good selectivity Compared to spatial range queries (i.e intersections with spatial extended objects), point-enclosing queries are best cases for our indexing method thanks to their good selectivity Conclusion on Experiments While R*-tree fails to outperform Sequential Scan in many cases, our cost-based clustering follows the data and the query distribution and always exhibits better performance in both storage scenarios: inmemory and disk-based Experimental results show that our method is scalable with the number of objects and has good behavior with increasing dimensionality (16 to 40 in our tests), especially when dealing with skewed data or skewed queries For intersection queries, performance is up to times better in memory, and up to times better on disk Better gains are obtained when the query selectivity is high For point-enclosing queries on skewed data, gain can reach a factor of 16 in memory Conclusions The emergence of new applications (such as SDI applications) brings out new challenging performance requirements for multidimensional indexing schemes An advanced subscription system should support spatial range queries over large collections of multidimensional extended objects with many dimensions (millions of subscriptions and tens to hundreds of attributes) Moreover, such system should cope with workloads that are skewed and varying in time Existing structures are not well suited for these new requirements In this paper we presented a simple clustering solution suitable for such application contexts Our clustering method uses an original grouping criterion, more efficient than traditional approaches The cost-based clustering allows us to scale with large number of dimensions and to take advantage of skewed data distribution Our method exhibits better performance than competitive solutions like Sequential Scan or R*-tree both in memory and on disk References M Altimel and M J Franklin Efficient filtering of XML documents for selective dissemination of information In Proc 26th VLDB Conf., Cairo, Egypt, 2000 N Beckmann, H.-P Kriegel, R Schneider, and B Seeger The R*-tree: An efficient and robust access method for points and rectangles In Proc ACM SIGMOD Conf., Atlantic City, NJ, 1990 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Clustering Multidimensional Extended Objects 421 S Berchtold, C Böhm, and H.-P Kriegel The Pyramid-technique: Towards breaking the curse of dimensionality In Proc ACM SIGMOD Conf., Seattle, Washington, 1998 S Berchtold, D A Keim, and H.-P Kriegel The X-tree: An index structure for high-dimensional data In Proc 22nd VLDB Conf., Bombay, India, 1996 K Beyer, J Goldstein, R Ramakrishnan, and U Shaft When is “nearest neighbor” meaningful? Lecture Notes in Computer Science, 1540:217–235, 1999 C Böhm, S Berchtold, and D A Keim Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases ACM Computing Surveys, 33(3):322–373, 2001 C Böhm and H.-P Kriegel Dynamically optimizing high-dimensional index structures In Proc 7th EDBT Conf., Konstanz, Germany, 2000 F Fabret, H Jacobsen, F Llirbat, J Pereira, K Ross, and D Shasha Filtering algorithm and implementation for very fast publish/subscribe systems In Proc ACM SIGMOD Conf., Santa Barbara, California, USA, 2001 C Faloutsos and P Bhagwat Declustering using fractals PDIS Journal of Parallel and Distributed Information Systems, pages 18–25, 1993 10 V Gaede and O Günther Multidimensional access methods ACM Computing Surveys, 30(2):170–231, 1998 11 A Guttman R-trees: A dynamic index structure for spatial searching In Proc ACM SIGMOD Conf., 47-57, 1984 12 H Liu and H A Jacobsen Modelling uncertainties in publish/subscribe systems In Proc 20th ICDE Conf., Boston, USA, 2004 13 B.-U Pagel, H.-W Six, and M Winter Window query-optimal clustering of spatial objects In Proc ACM PODS Conf., San Jose, 1995 14 T Sellis, N Roussopoulos, and C Faloustos The R+-tree: A dynamic index for multi-dimensional objects In Proc VLDB Conf., Brighton, England, 1987 15 Y Tao and D Papadias Adaptive index structures In Proc 28th VLDB Conf., Hong Kong, China, 2002 16 R Weber, H.-J Schek, and S Blott A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces In Proc 24th VLDB Conf., New York, USA, 1998 17 C Yu High-dimensional indexing Transformational approaches to high-dimensional range and similarity searches LNCS 2341, Springer-Verlag, 2002 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns Alan Nash1 and Bertram Ludäscher2 Department of Mathematics, anash@math.ucsd.edu San Diego Supercomputer Center, ludaesch@sdsc.edu University of California, San Diego Abstract We study the problem of answering queries over sources with limited access patterns The problem is to decide whether a given query Q is feasible, i.e., equivalent to an executable query that observes the limited access patterns given by the sources We characterize the complexity of deciding feasibility for the classes (conjunctive queries with negation) and (unions of queries): Testing feasibility is just as hard as testing containment and therefore We also provide a uniform treatment for CQ, UCQ, and by devising a single algorithm which is optimal for each of these classes In addition, we show how one can often avoid the worst-case complexity by certain approximations: At compile-time, even if a query Q is not feasible, we can find efficiently the minimal executable query containing Q For query answering at runtime, we devise an algorithm which may report complete answers even in the case of infeasible plans and which can indicate to the user the degree of completeness for certain incomplete answers Introduction We study the problem of answering queries over sources with limited query capabilities The problem arises naturally in the context of database integration and query optimization in the presence of limited source capabilities (e.g., see [PGH98,FLMS99]) In particular, for any database mediator system that supports not only conventional SQL databases, but also sources with access pattern restrictions [LC01,Li03], it is important to come up with query plans which observe those restrictions Most notably, the latter occurs for sources which are modeled as web services [WSD03] For the purposes of query planning, a web service operation can be seen as a remote procedure call, corresponding to a limited query capability which requires certain arguments of the query to be bound (the input arguments), while others may be free (the output arguments) Web Services as Relations with Access Patterns A web service operation can be seen as a function having an input message (request) with arguments (parts), and an output message (response) with parts [WSD03, Part 2, Sec 2.2] For example, may implement a book search service, returning for a given author A a list of books E Bertino et al (Eds.): EDBT 2004, LNCS 2992, pp 422–440, 2004 © Springer-Verlag Berlin Heidelberg 2004 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 423 authored by A We model such operations as relations with access pattern, here: where the access pattern ‘oio’ indicates that a value for the second attribute must be given as input, while the other attribute values can be retrieved as output In this way, a family of web service operations over attributes can be concisely described as a relation with an associated set of access patterns Thus, queries become declarative specifications for web service composition An important problem of query planning over sources with access pattern restrictions is to determine whether a query Q is feasible, i.e., equivalent to an executable query plan that observes the access patterns Example The following conjunctive query1 with negation asks for books available through a store B which are contained in a catalog C, but not in the local library L Let the only access patterns be and If we try to execute Q from left to right, neither pattern for B works since we either lack an ISBN or an author However, Q is feasible since we can execute it by first calling which binds both and After that, calling or will work, resulting in an executable plan In contrast, calling first and then B does not work: a negated call can only filter out answers, but cannot produce any new variable bindings This example shows that for some queries which are not executable, simple reordering can yield an executable plan However there are queries which cannot be reordered yet are feasible.2 This raises the question of how to determine whether a query is feasible and how to obtain “good approximations” in case the query is not feasible Clearly, these questions depend on the class of queries under consideration For example, feasibility is undecidable for Datalog queries [LC01] and for first-order queries [NL04] On the other hand, feasibility is decidable for subclasses such as conjunctive queries (CQ) and unions of conjunctive queries (UCQ) [LC01] Contributions We show that deciding feasibility for conjunctive queries with negation and unions of conjunctive queries with negation is and present a corresponding algorithm, FEASIBLE Feasibility of CQ and UCQ was studied in [Li03] We show that our uniform algorithm performs optimally on all these four query classes We also present a number of practical improvements and approximations for developers of database mediator systems: PLAN* is an efficient polynomialtime algorithm for computing two plans and which at runtime produce underestimates and overestimates of the answers to Q, respectively Whenever PLAN* outputs two identical and we know at compile-time that Q is We write variables in lowercase Li and Chang call this notion stable [LC01,Li03] Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 424 A Nash and B Ludäscher feasible without actually incurring the cost of the feasibility test In addition, we present an efficient runtime algorithm ANSWER* which, given a database instance D, computes underestimates and overestimates of the exact answer If Q is not feasible, ANSWER* may still compute a complete answer and signal the completeness of the answer to the user at runtime In case the answer is incomplete (or not known to be complete), ANSWER* can often give a lower bound on the relative completeness of the answer Outline The paper is organized as follows: Section contains the preliminaries In Section we introduce our basic notions such as executable, orderable, and feasible In Section we present our main algorithms for computing execution plans, determining the feasibility of a query, and runtime processing of answers In Section we present the main theoretical results, in particular a characterization of the complexity of deciding feasibility of queries Also we show how related algorithms can be obtained as special cases of our uniform approach We summarize and conclude in Section Preliminaries A term is a variable or constant We use lowercase letters to denote terms By we denote a finite sequence of terms A literal is an atom or its negation A conjunctive query Q is a formula of the form It can be written as a Datalog rule Here, the existentially-quantified variables are among the and the distinguished (answer) variables in the head of Q are the remaining free variables of Q, denoted free(Q) Let vars(Q) denote all variables of Q; then we have Conjunctive queries (CQ) are also known as SPJ (selectproject-join) queries A union of conjunctive queries (UCQ) is a query Q of the form where each If then Q in rule form consists of rules, one for each all with the same head A conjunctive query with negation is defined like a conjunctive query, but with literals instead of atoms Hence a query is an existentially quantified conjunction of positive or negated atoms A union of conjunctive queries with negation is a query where each the rule form consists of having the same head For we denote by the conjunction of the positive literals in Q in the same order as they appear in Q and by the conjunction of the negative literals in Q in the same order as they appear in Q A CQ or query is safe if every variable of the query appears in a positive literal in the body A UCQ or query is safe if each of its CQ or parts is safe and if all of them have the same free variables In this paper we only consider safe queries Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 425 Limited Access Patterns and Feasibility Here we present the basic definitions for source queries with limited access patterns In particular, we define the notions executable, orderable, and feasible While the former two notions are syntactic in the sense that they can be decided by a simple inspection of a query, the latter notion is semantic, since feasibility is defined up to logic equivalence An executable query can be seen as a query plan, prescribing how to execute the query An orderable query can be seen as an “almost executable” plan (it just needs to be reordered to yield a plan) A feasible query, however, does not directly provide an execution plan The problem we are interested in is how to determine whether such an executable plan exists and how to find it These are two different, but related, problems Definition (Access Pattern) An access pattern for a relation R is an expression of the form where is word of length over the alphabet {i, o} We call the position of P an input slot if and an output slot if At runtime, we must provide values for input slots, while for output slots such values are not required, i.e., “bound is easier” [Ull88].4 In general, with access pattern we may retrieve the set of tuples as long as we supply the values of corresponding to all input slots in R Example (Access Patterns) Given the access patterns the book relation in Example we can obtain, e.g., the set authors and titles given an ISBN and the set an author but we cannot obtain the set titles, given no input and for of of titles given of authors and Definition (Adornment) Given a set of access patterns, a on is an assignment of access patterns from to relations in Q Definition (Executable) is if can be added to Q so that every variable of Q appears first in an output slot of a nonnegated literal with is if every is We consider the query which returns no tuples, which we write false, to be (vacuously) executable In contrast, we consider the query with an empty body, which we write true, to be non-executable We may have both kinds of queries in ans(Q) defined below From the definitions, it follows that executable queries are safe The converse is false An executable query provides a query plan: execute each rule separately (possibly in parallel) from left to right Other authors use ‘b’ and ‘f ’ for bound and free, but we prefer to reserve these notions for variables under or not under the scope of a quantifier, respectively If a source does not accept a value, e.g., for in one can ignore the binding and call with unbound, and afterwards execute the join for Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 426 A Nash and B Ludäscher Definition (Orderable) with if for every there is a permutation is is of the literals in so that Clearly, if Q is executable, then Q is orderable, but not conversely is Definition (Feasible) executable if it is equivalent to a Clearly, if Q is orderable, then Q is feasible, but not conversely Example (Feasible, Not Orderable) Given access patterns is not orderable since and cannot be bound, but feasible because this query is equivalent to the executable query Usually, we have in mind a fixed set of access patterns and then we simply say executable, orderable, and feasible instead of and The following two definitions and the algorithm in Figure are small modifications of those presented in [LC01] Definition (Answerable Literal) Given we say that a literal (not necessarily in Q) is Q-answerable if there is an executable consisting of and literals in Q Definition (Answerable Part ans(Q)) If is unsatisfiable then ans(Q) = false If Q is satisfiable, ans(Q) is the query given by the Qanswerable literals in Q, in the order given by the algorithm ANSWERABLE (see Figure 1) If with then Notice that the answerable part ans(Q) of Q is executable whenever it is safe Proposition is orderable iff every literal in Q is Q-answerable Proposition There is a quadratic-time algorithm for computing ans(Q) The algorithm is given in Figure Corollary There is a quadratic-time algorithm for checking whether is orderable In Section 5.1 we define and discuss containment of queries and in Section 5.2 we prove the following proposition Query P is said to be contained in query Q (in symbols, if for every instance D, Proposition If Corollary If then ans(Q) is safe, and PROOF If then thus executable, Q is feasible then Q is feasible and therefore, since ans(Q) is safe and We show in Section that the converse also holds; this is one of our main results Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation Fig Algorithm ANSWERABLE for 427 queries Computing Plans and Answering Queries Given a query over a relational schema with access pattern restrictions our goal is to find executable plans for Q which satisfy As we shall see such plans may not always exist and deciding whether Q is feasible, i.e., equivalent to some executable is a hard problem On the other hand, we will be able to obtain efficient approximations, both at compile-time and at runtime By compile-time we mean the time during which the query is being processed, before any specific database instance D is considered or available By runtime we mean the time during which the query is executed against a specific database instance D For example, feasibility is a compile-time notion, while completeness (of an answer) is a runtime notion 4.1 Compile-Time Processing Let us first consider the case of an individual query where each is a literal Figure depicts a simple and efficient algorithm ANSWERABLE to compute ans(Q), the answerable part of Q: First we handle the special case that Q is unsatisfiable In this case we return false Otherwise, at every stage, we will have a set of input variables (i.e., variables with bindings) B and an executable sub-plan A Initially, A and B are empty Now we iterate, each time looking for at least one more answerable literal that can be handled with the bindings B we have so far gives the variables in which are in input slots) If we find such answerable literal we add it to A and we update our variable bindings B When no such is found, we exit the outer loop Obviously, ANSWERABLE is polynomial (quadratic) time in the size of Q We are now ready to consider the general case of computing execution plans for a query Q (Figure 2) For each query of Q, we compute its answerable part and its unanswerable part As the underestimate of we consider if is empty; else we dismiss Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 428 A Nash and B Ludäscher Fig Algorithm P LAN * for queries altogether for the underestimate Either way, we ensure that For the overestimate we give the “benefit of doubt” and consider that it could be true However, we need to consider the case that not all variables in the head of the query occur in the answerable part some may appear only in so we cannot return a value for them Hence we set the variables in which are not in to null This way we ensure that except when has null values We have to interpret tuples with nulls carefully (see Section 4.2) Clearly, if all are empty, then and all can be executed in the order given by ANSWERABLE, so Q is orderable and thus feasible Also note that PLAN* is efficient, requiring at most quadratic time Example (Underestimate, Overestimate Plans) Consider the following query with the access patterns Although we can use to produce bindings for this is not the case for its negation But by moving to the front of the first disjunct, we can first bind and then test against the filter However, we cannot satisfy the access pattern for B Hence, we will end up with the following plans for and Note that the unanswerable part results in an underestimate equivalent to false, so can be dropped from (the unanswerable is also responsible for the infeasibility of this plan) In the overestimate, is moved in front of and is replaced by a special condition equating the unknown value of with null Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation Fig Algorithm FEASIBLE for 429 queries Feasibility Test While PLAN* is an efficient way to compute plans for a query Q, if it returns then we not know whether Q is feasible One way, discussed below, is to not perform any static analysis in addition to PLAN* and just “wait and see” what results and produce at runtime This approach is particularly useful for ad-hoc, one-time queries On the other hand, when designing integrated views of a mediator system over distributed sources and web services, it is desirable to establish at view definition time that certain queries or views are feasible and have an equivalent executable plan for all database instances For such “view design” and “view debugging” scenarios, a full static analysis using algorithm FEASIBLE in Figure is desirable First, FEASIBLE calls PLAN* to compute the two plans and If and coincide, then Q is feasible Similarly, if the overestimate contains some sub-query in which a null value occurs, we know that Q cannot be feasible (since then ans(Q) is unsafe) Otherwise, Q may still be feasible, i.e., if ans(Q) (= overestimate in this case) is contained in Q The complexity of FEASIBLE is dominated by the containment check 4.2 Runtime Processing The worst-case complexity of FEASIBLE seems to indicate that in practice and for large queries there is no hope to obtain plans having complete answers Fortunately, the situation is not that bad after all First, as indicated above, we may use the outcome of the efficient PLAN* algorithm to at least in some cases decide feasibility at compile-time (see first part of FEASIBLE up to the containment test) Perhaps even more important, from a practical point of view, is the ability to decide completeness of answers dynamically, i.e., at runtime Consider algorithm ANSWER* in Figure We first let PLAN* compute the two plans and and evaluate them on the given database instance D to obtain the underestimate and overestimate and respectively If the difference between them is empty, then we know the answer is complete even though the query may not be feasible Intuitively, the reason is that an unanswerable part which causes the infeasibility may in fact be irrelevant for a specificquery Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 430 A Nash and B Ludäscher Fig Algorithm ANSWER* for runtime handling of plans Example (Not Feasible, Runtime Complete) Consider the plans created for the query in Example (here we dropped the unsatisfiable Given that is the only access pattern for B, the query in Example is not feasible since we cannot create bindings for However, for a given database instance D, it may happen that the answerable part does not produce any results In that case, the unanswerable part is irrelevant and the answer obtained is still complete Sometimes it is not accidental that certain disjuncts evaluate to false, but rather it follows from some underlying semantic constraints, in which case the omitted unanswerable parts not compromise the completeness of the answer Example (Dependencies) In the previous example, if is a foreign key referencing then always Therefore, the first disjunct can be discarded at compile-time by a semantic optimizer However, even in the absence of such checks, our runtime processing can still recognize this situation and report a complete answer for this infeasible query In the BIRN mediator [GLM03], when unfolding queries against global-asview defined integrated views into plans, we have indeed experienced query plans with a number of unsatisfiable (with respect to some underlying, implicit integrity constraints) bodies In such cases, when plans are redundant or partially unsatisfiable, our runtime handling of answers allows to report Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Processing Unions of Conjunctive Queries with Negation 431 complete answers even in cases when the feasibility check fails or when the semantic optimization cannot eliminate the unanswerable part In Figure 4, we know that is complete if is empty, i.e., the overestimate plan has not contributed new answers Otherwise we cannot know whether the answer is complete However, if does not contain null values, we can quantify the completeness of the underestimate relative to the overestimate We have to be careful when interpreting tuples with nulls in the overestimate Example (Nulls) Let us now assume that for some variable binding Such a binding, say overestimate tuple from above holds gives rise to an How should we interpret a tuple like The given variable binding gives rise to the following partially instantiated query: Given the access pattern we cannot know the contents of So our special null value in the answer means that there may be one or more values such that is in the answer to Q On the other hand, there may be no such in B which has a as its first component So when (a, null) is in the answer, we can only infer that R(a, b) and are true for some value b; but we not know whether indeed there is a matching tuple The incomplete information on B due to the null value also explains why in this case we cannot give a numerical value for the completeness information in ANSWER* From Theorem 16 below it follows that the overestimates computed via cannot be improved, i.e., the construction is optimal This is not the case for the underestimates as presented here Improving the Underestimate The ANSWER* algorithm computes underand overestimates for queries at runtime If a query is feasible, then we will always have which is detected by ANSWER* However, in the case of infeasible queries, there are still additional improvements that can be made Consider the algorithm PLAN* in Figure 2: it divides a query into two parts, the answerable part and the unanswerable part For each variable which requires input bindings in not provided by we can create a domain enumeration view over the relations of the given schema and provide the bindings obtained in this way as partial domain enumerations to Example (Domain Enumeration) For our running example from above, instead of being false, we obtain an improved underestimate as follows: where could be defined, e.g., as the union of the projections of various columns from other relations for which we have access patterns with output slots: Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark ... pruning steps using can be done to more effective pruning, since the two sets are symmetric in this join problem Constructing NNH Using Good Pivots Pivot points are vital to NNH for obtaining... Utilizing Histograms in Queries A typical search involving R-trees follows a branch-and-bound strategy traversing the index top down It maintains a priority queue of active minimum bounding rectangles... minimum bounding constraint, which increases the general overlap In memory, for instance, RS verifies four times more objects than AC Point-Enclosing Queries Because queries like “find the database

Ngày đăng: 24/12/2013, 02:18

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan