MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO -PEER SYSTEMS ZHANG GONG NATIONAL UNIVERSITY OF SINGAPORE 2004... MULIT-DIMENSIOANL RANGE QUERY EVALU
Trang 1MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO
-PEER SYSTEMS
ZHANG GONG
NATIONAL UNIVERSITY OF SINGAPORE
2004
Trang 2MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO
-PEER SYSTEMS
ZHANG GONG
(B Sci., Xi'an JiaotongUniversity, China)
A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2004
Trang 3Acknowledgement
First, I would like to express my heartfelt thanks to my supervisor, Dr Gary S H Tan, for his supervision through my master study Also, my sincere gratitude goes to Associate Professor Kian-Lee Tan, for all his advice and constant guidance during all phases of this thesis They have conscientiously provided me with careful guidance at every stage of my research, offered various ideas whenever I ran into difficulties, and constructively corrected some of my mistakes in the course of my work I appreciate the fact that participating in their projects has granted me many paths to develop my research and analytical abilities greatly Their support enabled me to both learn and write what is presented in this thesis In addition, they have given me constructive suggestions on my attitude to work, which is helpful to my career development
Others that I would like to thank include Ng Yew Kwong, Hu Yu, Gozali Johan Prawira, whom I enjoyed sharing discussions on P2P systems and programming questions with In addition my sincere appreciation is given to my lab fellows, Ameya Virkar, Liu Ming and Liu Peng for their generous help both in my research and in my life, and for the pleasant and friendly environment of the computer system lab
Last but not least, I would like to convey my gratitude to the thesis examiners for taking time from their busy schedules to assess my research work
Trang 4Table of Index
Chapter 1 Introduction……….……….….……….……… 1
1.1 DHT-based P2P Systems……… ………… …… ….………3
1.2 Complex Query Over DHT-based P2P systems…… ………… …….…… 5
1.2.1 Marrying Database and P2P paradigm……….………… 5
1.2.2 Complex Query over DHT-based P2P systems……… ………….6
1.3 Multi-dimensional Range Query Evaluation……… ……… 8
1.3.1 Motivations- Three complex queries…… ………….….……… 8
1.3.2 Multi-dimensional range query processing………….…….………….15
1.4 Research Contributions………… ………… ….……….………17
1.5 Organization of the Thesis……… ……… ……….19
Chapter 2 Literature Review……… ……… ……… ……….…21
2.1 Related work……… ……… 21
2.2 One-dimensional Indexing for P2P designs……… ………… 24
2.2 Multi-dimensional Indexing for P2P system……… …… 25
2.4 Multi-dimensional Indexing Using Hilbert space filling curve ….……26
Chapter 3 System Model ……… ……… ……….28
3.1 Problem Formulation …… ……….28
3.1.1 Data Management Process in DHT-based P2P Systems … ….28
3.1.2 Problem Formulation……… ……… 30
3.2 Design Principles ……… ……… 32
3.2.1 General Principles ……… ……… 32
3.2.2 Design Goals……… ……… 33
3.3 System Model ……… 38
3.3.1 Three-tiers Architecture………….… ……… … 38
3.3.2 Application layer……… ………….…… …… 39
Trang 53.3.3 Multi-dimensional Indexing layer………… ……… … 40
3.3.4 Two-fold Property of Partitioning Manner… ……… ……… 34
3.3.5 File systems……… ……… 49
Chapter 4 Multi-dimensional Range Query evaluation….……… 50
4.1 Multi-dimensional Single Point Query Evaluation……… ……50
4.2 Multi-dimensional Range Query Evaluation… ……….51
4.3 Zone Maintenance……… ……… 54
4.4 System Performance Evaluation……… … …57
4.4.1 Multi-dimensional Single Point Query……… ………59
4.4.2 Multi-dimensional Range Query Evaluation……… 60
4.5 Selectivity Factor……….……….64
4.6 Design Improvements……… ………63
4.6.1Parallelism Strategy……….………63
4.7 Comparison with Nạve Flooding Method……… 66
Chapter 5 Conclusions ……… ……….71
References……… ……… 74
Appendix A Sample code for Mapping Multi-dimensional Point into Hilbert sequence number……… ……….………77
Appendix B Join Time Load Balance Code……… ………83
Appendix C Overlay Network’s Transit-Stub Topologies Building ………… ……84
Trang 6Summary
In this thesis, we investigate the issue of enabling current DHT-based P2P systems to support multi-dimensional range query towards the long term goal of providing complex query facilities in P2P systems We adopt a multi-dimensional coordinate space model, which is sorted by Hilbert Space filling curve Sorting make the range query processing in multiple coordinate space possible The way that the space is partitioned is both a zone partitioning way and a single direction sequence dividing way This helps extend DHT functionality layer’s fine property of efficient exact-match lookup into higher dimensions We propose a relay-race like query scheme and introduce some strategies to improve the performance such as introducing parallelism, lookup during processing, et al
The performance of the system is evaluated via simulation Several metrics are explored such as the number of the nodes visited, query latency, per hop latency, et al The evaluation shows that the proposed model not only keeps the scalability but also process multi-dimensional range query in bounded costs This system can be incorporated into computational grids to enhance the information discovery capability
Compared with current methods of processing range query in P2P systems, our method as a general method outperforms in these aspects:
1 Existing methods are all one-dimensional Our method provides a general method to
Trang 7answer multi-dimensional rang query and one-dimensional query is one special form
of multi-dimensional query To our knowledge, this is the first approach that supports multi-dimensional range query processing
2 Because our method is oriented to multi-dimensional range query, it avoids expensive join operation For current one-dimensional method, if we want to find resource specified by several attributes (dimensions), an independent index infrastructure for each attribute must be built first Then the information infrastructure queries the appropriate indexing infrastructure for each attribute presents in the query and concatenate the results in a database-like “join” operation [13] Our method handles multiple attributes by single DHT-based system
3 Existing methods process range query by a “flooding” manner Our method processes multi-dimensional range query in a lookup way Hence, except that the fine property of DHT-based P2P system is extended, deterministic structure and performance guarantees in terms of logical hops are provided for processing multi-dimensional queries And also high network overhead is reduced
4 Our method provides a deterministic and complete manner to answer multi-dimensional range queries
Trang 8List of Figures
Figure 1.1 Napster –Centralized P2P systems……… ………… 4
Figure 1.2 Routing in CAN……… ……… .5
Figure.2.1 Hilbert Curve in Two-dimensional Space……… ………… 26
Figure 3.1 Three-tier Architecture……… ……….39
Figure 3.2 Two-dimensional space and range query region……….42
Figure 3.3 System Model……… ……… 44
Figure 3.4 P2P system with 7 peers ……… …………45
Figure 3.5 Zone partitioning……….46
Figure 4.1 Query evaluation………,,,,,53
Figure 4.2 Space partitioning ……….52
Figure.4.3: Multi-dimensional single point query performance……… 60
Figure 4.4 Average Path length………61
Figure 4.5 Query response time……… 61
Figure 4.6 Effect of parallelism strategy on response time……… 64
Figure 4.7 Extra communication overheads introduced by parallelism strategy…….65
Figure 4.8 Comparison of the two schemes on the aspect of the number of visited nodes……….68
Figure 4.9 Comparison of the two schemes on the user perceived response time… 68
Trang 9List of Tables
Table 1.1 relational table distributed into the P2P system……….…….9 Table 1.2 the second tables stored in P2P system………13
Trang 10List of Queries
Query 1.1……….….……… 7
'Query 1.2……….10
Query 1.3……….……… ………… ……… … 10
Query 1.4:……….………….……… 14
Query 1.5……… 13
Query 3.1:……….31
Query 3.2:……….…31
Query 3.3:……….……41
Query 4.1:……….……58
Query 4.2.……….……… 59
Trang 11Chapter 1
Introduction
Peer-to-peer systems (P2P) are one of the most quickly growing technologies in today’s computing In such systems, content data is stored in peer nodes, without the centralized control and planning Thus, data can be exchanged directly between peers, which is contrary to the way of the traditional client-server model The fine properties
of self-organization, fault-tolerance and scalability make P2P systems develop fast in recent years
However, there are two serious limitations for current P2P system development: poor scalability and weak semantics Scalability has always been one of the concerns accompanying the unstructured P2P designs, ranging from the initial centralized design
of Napster [1], to the completely decentralized design of Gnutella [2], to the hierarchical design of FastTrack [3], until the proposal of structured P2P systems Structured P2P design provides Distributed Hashing Table (DHT).So it is also called DHT-based P2P systems Scalability is largely solved in DHT-based P2P networks,
Trang 12because lookups can be solved in log n (or nα for small α) overlay routing hops for an overlay P2P network with n hosts However, query facility is further impoverished in DHT-based P2P; because hash table only supports lookup operation efficiently, which
is one of the most fundamental and simplistic query formats
Traditional database research prides itself on the most notable features: powerful relational query facilities, strict relational model and reliable data management These features are missing in P2P’s distributed environment How to bring database’s rich query processing facilities into widely distributed environment of P2P networks poses
an important challenge both for database community and P2P community
In this thesis, we present one important step towards the long term plan of marrying the powerful query facilities of traditional database with P2P networks — multi-dimensional range query evaluation in DHT-based P2P systems
This chapter is organized as follows: first we briefly introduce DHT-based P2P designs; next, we briefly overview the need for building complex query facility over DHT-based P2P systems; then, we explore the motivation to process multi-dimensional range query in P2P system and presents the overview of the proposed method to process multi-dimensional range query in P2P systems; finally, we summarize the research contributions
Trang 131.1 DHT-based P2P Systems
It is unstructured P2P systems that firstly embarks in the Internet and produces many hugely successful, popular deployments such as Napster [1], Gnutella [2] and BestPeer [4] Napster was introduced in mid-1999 and, as of December 2000, the software has been downloaded by 50 million users, making it the fastest growing application on the Internet It is envisioned that P2P will transform the Internet from a shared bandwidth infrastructure into a combined bandwidth infrastructure and P2P systems may lead to new content distribution models for applications such as software distribution, file sharing, et al
However, there are two serious limitations for current P2P system development: poor scalability and weak semantics For example, in Napster a central server stores the index of all the files available within the user community Although the actual file transmission occurs between user machines, but all of these transmissions must start with the centralized index server It is expensive to scale the central directory Also, the fact that the file directory is centralized into few central machines leads to the security problem of single point of failure Gnutella [2] is one kind of completely decentralized P2P designs It uses the complete flooding manner to process query Although it reduces the risk of single point of failure, the query is processed inefficiently and incompletely Flooding on every request is clearly not scalable and because the flooding has to be curtailed at some point in practice, this may lead to the result that the
Trang 14system fails to find content that is actually in the system
Figure 1.1 Napster –Centralized P2P systems
The emergence of structured P2P designs largely resolves the scaling problem Structured P2P systems builds on the idea of Distributed Hash Table (DHT) The underlying networking infrastructure is a logical network, called overlay network The pairs with the form of (key, value) are stored among peers Structured P2P has different deployment; however, most of the implementations support the basic operations such
as put (key, value) and get (key) Get (key) is the main operation that DHT P2P offers That is: given a keyword, the system lookups for all the files whose name contains this keyword In general, for a given overlay network with n peers, this lookup operation can be resolved in log n (or nα for small α) overlay routing hop
Let us exemplify structured P2P designs by one of the representatives: CAN [5] Content Addressable Network As illustrated in Figure 1.2, CAN has one
Request
& result
Trang 15d-dimensional virtual coordinate space At any instant the entire coordinate space is partitioned among all the nodes in the system Each CAN node owns one zone in the virtual space In addition, a node holds information about a small number of “adjacent” zones in the table Uniform hash function is used to map key values to points in the d-dimensional space All the points falling into one zone of one specific node is maintained by this node The basic operations performed on a CAN are insertion, lookup and deletion on (key, value) pairs Requests (insert, lookup, or delete) for a particular key are routed by intermediate CAN nodes towards the target node whose zone contains that key The design of CAN is completely distributed, scalable, and fault-tolerant Fig.1.2 illustrates the typical routing mechanisms of CAN
Figure 1.2 Routing in CAN
1.2 Complex Query Over DHT-based P2P systems
1.2.1 Marrying Database and P2P paradigm
Let us examine the tracks of this thesis from a grand vision- how to apply database technology into P2P paradigm?
(p, q)
B
A
Trang 16The semantics provided by today’s P2P technology is typically quite weak Theoretically, the data within a P2P system should be accessible at many degrees of granularity However, today’s P2P systems only support the atomic granularity level That is, data consists of a collection of indivisible objects, e.g., complete MP3 files.We can either place an entire object at a peer, or not at all It poses one challenge to form hierarchical granularity in P2P systems In most cases, current P2P file-sharing systems are largely limited to applications in which objects are large, opaque, and whose content has already been described precisely by their name
Database community has many strong data management tools such as queries, views, to express relationships between objects and to define new objects Complex queries can
be posed across multiple sources, and the results of one query can be materialized to answer other queries If these data management techniques can be used to develop better solutions to the weak semantics problem in P2P system, P2P systems will bring not only its inherent popular properties, but also bring the powerful semantics support This is the track that this thesis follows
1.2.2 Complex Query over DHT-based P2P systems
As one important component of applying database technology into P2P community, we are engaged into building the complex query facility over P2P systems, in particular, DHT-based P2P systems
Trang 17Current P2P designs support the simplistic query form: “search” This tool can find all the files whose names contain a given string However, “search” is a limited form of querying, intended for identifying (“finding”) individual items Rich query languages should do more than “find” things: they should be able to involve multiple attributes; they should also allow query for combinations and correlations among the attributes
As an example, it is possible to search in Gnutella for music by Beethoven, but it is not possible to ask specifically for Beethoven’s entire overtures, since they do not typically contain the word “overture” in their name It is difficult to answer complex query showed below:
Query 1.1
Select peer_name From music P2P system Where authour_name =Beethoven AND music_format= Overture AND year>1999
As the new generation of P2P systems, structured P2P largely resolved the scaling problem However, in other aspects, the core idea of Distributed Hash Table impoverishes the query facility Distributed hash table is essentially a decentralized and distributed hash table The most notable functionality of hash table is quick exact-match lookups The index based on hash table is difficult to support range query except the way of overall scan Hence, DHT-based P2P only support exact-match lookups and does not support range query or multi-attributes query efficiently This
Trang 18inherent deficiency aggravates poor query facilities in P2P systems
Hence, here arises the need of providing complex query facility over P2P systems If
we can process relational complex query in a P2P network over DHTs, we can certainly execute traditional exact-match as a special case As DHT-based P2P has larglely solved the scaling problem, it is a critical need of providing complex query facilities over DHT-based P2P systems, while still maintaining the scalability of the DHT infrastructure
1.3 Multi-dimensional Range Query Evaluation
1.3.1 Motivations- Three Complex Queries
Before describing our approach, we first discuss some general issues about processing complex query over P2P systems and then we explain why we propose multi-dimensional range query evaluation in DHT-based P2P systems
As an example, in a P2P music file sharing system, putting aside the issues of protocol and the architectures of underlying network, we assume each resource is described by a set of attributes with globally known types The collection of these attributes and their values forms one relational table Such a relational table is distributed into each peer That is, each peer stored one horizontal partition of the table In our example, the music file is described by the schema (Music_id,
Trang 19Music_name, Author, Orchestra, Year) and showed in Table 1.1 In DHT-based P2P systems, such a relational table will be distributed into the system based on one attribute CAN [5] can be used to construct the overlay network We choose Music_name as the primary key whose hashed value decides where to store a given tuple Given a pair (music_name, ip), the specific music_name is deterministically mapped into one point P in the coordinate space by a uniform hash function The corresponding (key, value) is then stored at the node whose zone encloses the point P
If one user issues one query for music_name="Camen", the same hash function is applied into this key The query is converted into searching the point corresponding to the key If the point is not stored in the local peer, such a search request will be forwarded and routed by the CAN overlay infrastructure Finally, if the search target
is found, the tuple is returned to the peer issuing the query
Music(R)
Music_id (key) Music_name Author Orchestra Year
12 Camen Bizet London Symphony Orchestra 2000
11 Fidelio Beethvon Vienna Philharmonic Orchestra 1999
… … … … …
Table 1.1 relational table distributed into the P2P system
One common problem arises: how can we answer the query involved with non-key attribute For example, if one user issues one query asking for all the music played by
“London Symphony Orchestra” , how can we process the SQL query below:
Trang 20Query 1.2
Select Music_name From Music Where Orchestra= London Symphony Orchestra
Unfortunately, current query processing mechanism built on top of DHT layer select only one attribute as the hash key Such kind of one-dimensional query system is unable
to process such kind of query directly The available method to solve such a problem is
to build several different DHT layers based on different keys According to the given selected attribute, specific DHT index is chosen Naturally, this introduces the expensive cost of building extra DHT layers As a result, the communication result is added
Thus new indexing facility which supports multiple attributes is solicited If we have such kind multi-dimensional indexing facility over DHT layer, we even can solve more complex multi-attributes query like below:
Query 1.3:
Select Music_name From Music Where Orchestra= London Symphony Orchestra And Author=Bizet
Another kind of critical query that poses challenge for DHT-based P2P design is range query This is inherently because the hash table only support exact-match lookup In this setting, DHT-based P2P infrastructure provides no thrust power to solve the query below:
Trang 21Query 1.4:
Select Music_name From Music
Where Year>1998 And Year <2003
The only approach is to flood this query into the systems and check each peer Accordingly, it causes the same negatives as Gnutella Flooding on every request is not scalable Further, because users in a Gnutella network self-organize into an application-level mesh on which requests for a file are flooded with a certain scope, the flooding has to be curtailed at some point This leads to the possibility of failing to find content that is actually in the community
More important, range lookup or range query is one of the fundamental functionality that is needed to support general purpose database query processing In practice, the range selection operation is typically implemented at the leaves of a query plan Hence, supporting range query is the fundamental foundation for the targeted indexing facility
to support complex query in P2P system However, current methods, including Harren
et al [6]’s research agenda did not provide efficient way to enable P2P systems to support general range lookup One of the goals that constrain the design of Harren’s work is “minimal extension to DHT APIs” It is true that this consideration keeps DHT APIs as thin and general-purpose as possible in one side But in our contention, in order
Trang 22to achieve the larger goal of supporting more general and complex query involving range query in the context of DHT-based systems, it is necessary to extend the current P2P designs that only support exact-match lookup to range lookup Otherwise, for DHT P2P design, without the functionality of supporting range query, it is difficult to enable such kind indexing facility to support general or complex query Here comes the second goal of our design extending P2P to support range lookup
Two design goals have already come out But this is not the whole story Consider the third complex query described below
Suppose, in the above P2P music file sharing system, the material about the author of the music is also one hot target that interests the users Similarly, in the scenario of existing one-attribute indexing infrastructure, we have to build another relational table
to store such kind of independent theme except the hash indexing infrastructure built on the music_name This table is distributed into the P2P systems according to the hash value of the attribute author_name As a result, current P2P music sharing system has two tables distributed into it It is easy to find the information bounded to the individual table For instance, given the name of the author, author’s country is enquired This follows the usual DHT exact_match lookup based the value of author_name
Trang 23Author (T)
Author_id(key) Author_name Author_birth Country Reprehension_works
11 Beethvon 1770 German Fidelio
Table 1.2 the second tables stored in P2P system
But how to answer the query which involves two relational tables? For instance, how can we find the music files whose production is earlier than Year 1999 and whose author is born in German?
Query 1.5:
Select Music_name From Music AS R, Author AS T Where
R.author=T.author_name R.year <1999 AND
T.contry=German
Clearly, this query covers both Music table and Author table To answer it, we must search two tables Naturally, we can consider joining these two tables A common range query specified by several attributes (dimensions) must be answered through an expensive join operation, after querying the corresponding indexing infrastructure for each attribute present in the query, provided that one independent indexing infrastructure is built for each attribute involved This approach is adopted in the Harren et al [6]’s agenda They implement the join operation over multiple tables and propose some reasonable algorithms such as “hash join”
Trang 24However, as we see, even in the traditional centralized database systems, join is one expensive operation In distributed environment, this operator is introducing excessive communication cost unavoidably and causing maintenance problem Further, if more tables are inserted into the systems, the cost of join grows From user’s perspective, P2P users are impatient, they expect quick response Hence, the indexing facility that avoid join is preferred Avoiding expensive join operation becomes the third theme arising in our design
In general, from the current query processing functionality of P2P systems, new indexing functionality that
1) supports multi-attributes 2) supports range query 3) avoids expensive join operation
is solicited Our method of multi-dimensional range query processing is towards the goal of providing these key functionalities and try to reach the goal of supporting more general complex query
Trang 251.3.2 Multi-dimensional range query processing
In this section, we briefly overview the proposed multi-dimensional range query processing model for P2P networks
The work reported in this thesis is similar in spirit to that of Harren et al [6], in that we are interested in supporting database query processing over P2P networks Our contention is that in order to support complex queries in the distributed context of peer-to-peer systems, we need to extend the current P2P exact name lookups to range searches In this thesis, we will propose a method for efficiently answering multi-dimensional range queries on a peer-to-peer data sharing system Our long term goal is to support the various types of complex queries in P2P data-sharing system In this thesis we propose to extend P2P systems to support more general queries on potentially more complex and more structured datasets The query scheme keeps the fine functionality of efficient exact-match lookup in previous DHT P2P designs, while extending it to multi-attributes query For instance, multi-dimensional single point query can be evaluated as efficient as the single keyword exact-match lookup This avoids the dilemma described in the first query example in the above section
Unlike previous DHT P2P designs which process one-dimensional range queries in an inefficient manner of “flooding” and does not support multi-dimensional range queries, the proposed method aims to exploit the processing power of range query in DHT P2P architectures in a multi-dimensional setting In the proposed mechanism,
Trang 26multi-dimensional range query is evaluated in a deterministic and complete manner This is important to support the more general and complex query in P2P
As showed in Query 1.5, in the existing DHT P2P query methods such as the work of Harren et al, a common range query specified by several attributes (dimensions) which covers multiple tables, must be answered through an expensive join operation, after querying the corresponding indexing infrastructure for each attribute present in the query, provided that one independent indexing infrastructure is built for each attribute involved From this point of view, the work here extends the DHT to be built upon multiple attributes This forms a multi-dimensional index facility With such an index,
we can avoid building independent indexing infrastructure for individual attribute This makes DHT layer more portable and more general-purpose In such a setting, complex query type can be answered through only one indexing infrastructure
The overall architecture of the proposed system is a three-tier model consisting of application layer, multi-dimensional indexing layer and local file systems The key component is the multi-dimensional indexing layer This part is a distributed hash table (DHT), similar to typical data lookup systems [6, 7] Unlike the existing DHT exact-match lookup facility which is based on only one attribute, multiple attributes can
be mapped into one point of the virtual space Meanwhile, our approach attempts to preserve locality while mapping the data elements to the index space, and allow complex queries This is through a locality-preserving, dimensions-reducing mapping called Space Filling Curve (SFC) [8] In the current implementation, we use the Hilbert
Trang 27SFC [9] for the mapping, and CAN [5] for the overlay network The query scheme support flexible queries (keyword lookup, wildcards, and range queries)
This design can be used to index content in P2P file-sharing systems; as a complement for current resource discovery mechanisms in computational Grids to enhance them with range queries; for Web service discovery
1.4 Research contributions
The primary technical contributions of this work follows:
1 To our knowledge, this is the first approach that supports multi-dimensional range query Existing methods that support query processing in P2P are all one-dimensional With multi-dimensional indexing, uniqueness is required only for the set as a whole and the sets of records can be retrieved on partially specified keys This is especially meaningful for practical P2P systems in which users usually issue multi-attributes queries This provides a foundation for building
complex query facilities over DHT-base P2P systems
2 In particular, our approach processing multi-dimensional range query, does not require querying several indexing infrastructures (i.e DHT) For current one-dimensional method, if we want to find resource specified by several attributes (dimensions), an independent index infrastructure for each attribute must be built
Trang 28first Then the information infrastructure queries the appropriate indexing infrastructure for each attribute presents in the query and concatenates the results in
a database-like “join” operation [6] Our method is oriented to multi-dimensional range query and it avoids expensive join operation Our method handles multiple attributes by one single DHT-based system, while keeping all the functionality of one-dimensional indexing
3 For one-dimensional range, the usual convenient way is to sort first, then query However, without specific organization, the data in the multi-dimensional space is not ordered Lack of order prevents multi-dimensional range query to be processed
in a convenient way like one-dimensional range query in a sorted setting This work attempt to sort the multi-dimensional data in the space, assign them unique orders With the Hilbert space filling curve, we not only sort the multi-dimensional space, but also keep the space locality The sorted space enables the multi-dimensional query to have a similar sorted setting As a consequence, multi-dimensional range query can be processed in the similar efficient way as one dimensional range query in the sorted setting
4 Existing methods process one-dimensional range query by a “flooding” manner Our method processes multi-dimensional range query in a lookup way Hence, except that the fine property of DHT-based P2P system is kept, deterministic structure and performance guarantees in terms of logical hops are provided for processing multi-dimensional queries And also high network overhead is reduced
Trang 295 An ideal information discovery system has to be efficient, self-organizing and result guarantee Our approach not only support flexible searches ( keywords, wildcards, and range queries ), but also guarantees that all existing data elements that match a query will be found with bounded costs
The rest of the thesis will give a detailed description of the above contribution
1.5 Organization of the Thesis
The thesis is organized as follows:
In chapter 2, we survey the related literature to overview the existing method of processing query in P2P and reveal the critical need of building index facility that support multi-dimensional range query Further, we introduce the necessary theoretical background about Hilbert space filling curve
In chapter 3, after illustrating general methods to building multi-dimensional indexing through Hilbert space filling curve, we formalize the problem and highlight the system model of our multi-dimensional range query processing model
In chapter 4, we discuss and exemplify the query scheme adopted and illustrate the querying enhancement strategy and maintenance strategy used in current model
Trang 30Further, we examine the performance of our method through simulation and comparison with existing methods
In chapter 5, we summarize our contributions and envision future work
Trang 31As one kind of unstructured P2P systems, Napster [1] provides a centralized directory
to index the sharing files in the system A central server maintains the index for all the sharing files in the system New peers joining the system register themselves with the server Every peer in the system knows the identity of the central server, while the server keeps information about all the nodes and their sharing resource in the system Whenever a peer wants to lookup an object, it sends the name of the object to the central server The server returns the IP addresses of the peers storing that object to this peer The requesting peer then uses IP routing to pass the request to one of the returned peers and downloads the object directly from that peer There are two important disadvantages with the centralized design of Napster First of all, it is not scalable since the central server needs to store information about all the peers and objects in the
Trang 32system Second, it is not fault tolerant because the central server is the single point of failure
A “flooding” strategy is adopted by Gnutella [2] to solve the problem of centralized design There is no central server in the system Each peer in the Gnutella network knows only about its neighbors The flooding model is used for both object locating and request routing throughout the P2P network Peers flood their requests to their neighbors and further propagate into the whole system One of the problems with this design is the high communication overheads on the network as a result of flooding Also, because, in practice, to prevent too much communication overheads, a stop point
is set when the request is flooding, it is possible that some objects can not be found even
if they are actually stored in the system
In P2P research area, several research initiatives are underway to solve the data management problem For example, Gribble et al [10] in their position paper titled
“What can peer-to-peer do for databases, and vice versa?” outline some of the complexities that need to be addressed before P2P designs can be applied to improve database query processing They examines some important dimensions for data placement in P2P settings, such as scope of decision-making, extent of knowledge sharing, heterogeneity of information sources, dynamics of participants, et al They try
to define the simplified form of P2P data placement problem and assert that the data placement problem is intractable at the extreme of points of each of the dimensions They show that even the simplest form of the problem is NP-complete
Trang 33In a recent paper, Harren et al [6] explore the issue of implementing database operators over DHT-based P2P Satisfying the requirements of broad applicability and minimal extension to DHT APIs, they propose algorithms for implementing Join and Selection operators in DHT-based P2P systems: data storage systems, enhanced DHT layer and query processor The proposed basic Join algorithm on two relations is based on the
“symmetric” hash join DHT infrastructure is used for routing and storing tuples here Their work try to improve the semantics aspect of P2P paradigm from an interesting vision, but predefine one assumption: shared information is organized into different relational tables and stored into the system The underlying technique basically exploits the exact-match lookup functionality of DHT-based P2P systems Each peer holds one section (or a few tuples) of the relational table and forms one mini-database In their work, if the query is bounded into one individual table and focuses on the attribute, which is currently hashed, we can get the answer directly through DHT’s efficient lookup functionality Otherwise, we must use a flooding way to search for the qualified tuples Further, if the query covers multiple tables, we must do a join operation between two different tables However, as we know, join operation is expensive in centralized scenario, needless to say the distributed environment Thus, some mechanism avoiding the prohibitive “join” operation between two tables in the distributed environment is desired This is also one of the motivations for this thesis and we will see that our design owns such a virtue
One of the important open problems revealed in Harren et al [6]’s work is how to
Trang 34support range query over the hash index facility of DHT Without range query processing functionality, it is difficulty to support general relational query over P2P systems Gupta et al [11] advance an initial step to enable general query processing over P2P data sharing paradigm They try to solve the range query problem by an approximate way Approximate answers are provided to the given range query by using Locality Sensitive Hashing functions for relevant range lookups based on similarity Their system tries to locate peers that have the most relevant partitions for the submitted query Although the result is not precise and not guaranteed, their work constitutes an important step towards the final resolving of rich query facility over P2P systems
2.2 One-dimensional Indexing for P2P designs
In DBMS, with one-dimensional index, a value of one field in the relational table, is termed as “primary key” and is required to be distinct for all records B-tree index, hashing index are two representative approaches to index the data sets based on the primary key
In P2P world, recent DHT-based Peer-to-Peer systems such as CAN [5], Chord [7] or Tapestry [12], essentially provide a one-dimensional indexing mechanism to locate data present in the system by implementing a Distributed Hashing Table DHT is actually a data structure for storing of pairs (key, data) in a distributed manner, which allows fast locating data when a key is given As an example, it is possible to search in
Trang 35Gnutella for music by Celine Dion In such a one-dimensional distributed indexing infrastructure, given a specific value of the key, a list of peers that stores the request can
be located with performance guarantees in terms of logical hops While one-dimensional indexing provides efficient retrieval on the primary key, retrieval by the combination or correlation of several attributes requires to query several independent DHTs, each for one attribute and then concatenates the results in a database–like “join” operation
2.3 Multi-dimensional indexing for P2P system
Multi-dimensional index is based on the notion that more than one field or attribute of a record type are specified as constituting the primary key set for records of that type and that each key in the key set plays an equal role in determining the physical placement of
Trang 36A review of these methods is given by Gaede and Günther [14] However, there is not multi-dimensional indexing infrastructure over P2P systems yet
2.4 Multi-dimensional Indexing using Hilbert Space Filling Curve
Emerged in the 19-th century, space-filling curve is originally proposed by Peano[15] and David Hilbert [16] gave its first geometrical representation Fig.2.1 reproduces the figures in Hilbert’s paper which shows the first three order Hilbert Curves in two-dimensional space, except that the sequence number of the points starts from 0, instead of 1 More recent development and application of Hilbert space filling curve refers to Lawder [17]
(a) (b) (c)
Figure.2.1 Hilbert Curve in Two-dimensional Space
The production of Hilbert Curve follows a recursive way Fig.2.1 (a) shows the first order Curve in the two-dimensional space consisting of four points The curve is also
Trang 37divided into four segments The one to one mapping between sub-squares and curve segments indicates that the counterpart for adjacent line intervals is always adjacent sub-squares The line passes through the centers of the sub-squares assigns the ordering
of the sub-squares Fig.2.1 (b) shows the next recursive step This is the second order curve Compare Fig.2.1 (b) and Fig.2.1 (a), we can find that each sub-square and its corresponding line intervals have been further divided Fig.2.1 (c) indicates the third step in the sequence
Because we can view the point in the line as the limit of infinite line intervals and also the point in the square can be viewed as the limit of infinite sub-squares, as the recursion process continues infinitely, the one-to-one mapping between the points in the square and all the points in the line is established Further, because the point on the line has fixed order, its corresponding point in the square is assigned an order This is known as Hilbert Curve This curve passes through all the points in the space once and gives each point a unique sequence number called sequence number and thus named as space-filling curve For detail about space filling curves, see sagan [18],Lawder [17]
One important application of Hilbert curve lies in embedding multi-dimensional points into one dimension, while preserving spatial proximity to the largest possible extent and so that the points are clustered in groups of similar points
Trang 38Chapter 3
System Model
3.1 Problem formulation
3.1.1 Data Management Process in DHT-based P2P Systems
The overall data management process for DHT-based P2P system works as follows Assume we are given a set of nodes which are inter-connected into P2P application network The link between two nodes consumes some bandwidth Each peer can either act as the server to answer query requests or play the role of the client to send requests
to other peers at any instant That is, they adopt peer-to-peer networking communication model There is no centralized index about the shared resource subscribed by each peer The information about the shared resource is distributed among the peers in the system Unlike unstructured P2P designs like Napster [1], where the information about the shared resource is stored in a centralized index server, such information in DHT-based P2P designs is distributed among peers As no central server index exists, given query must be forwarded into the system and follow the routing mechanism of P2P overlay networks in order to reach the peer that contains the answers for the query When each peer receives such a query request, it must first check
if the request can be answered by its share of information If not, this peer then
Trang 39determines the next peer to forward to After the node that contains the answers is found, this answer will again be forwarded back to the node that issues the query Finally, according to the returned information, this requesting peer connects the targeted peer to transfer the resource that interests him This forms one typical process
of data management in DHT-based P2P systems Query forwarding, query evaluation
at each peer and query backtracking constitute the major query evaluation costs
The data management problem varies depending on different P2P overlay networks
An examination of data management problem in different P2P implementations reveals that an efficient indexing mechanism is the core part to query processing in P2P systems
For the P2P networks with centralized index server like Napster [1], they do not have the worry of forwarding query All the queries must appeal to the centralized directory server As a result, such systems face the single-point failure and scalability problems From then on, a robust and distributed index is demanded At the other extreme point, Gnutella [2] does not have any centralized index The only way that one query is answered is to flood the query into the whole system This approach directly leads to the high communication costs and high query workload for the system and makes non-efficient query processing as Gnutella’s notorious feature In DHT-based P2P systems, although DHT is essentially a distributed hashing index, the disability to support range lookups has hampered its wide applications in supporting more general queries Thus, if we can adapt DHT into a more general indexing layer that supports
Trang 40range predicate, more complex query is expected to be supported in P2P systems This
is also one of the important motivation for this thesis
For instance, in music file-sharing DHT-based P2P systems, every peer uses the global schema: R (music_name, rank, author, album, year) to express their music collections