Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 95 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
95
Dung lượng
390,13 KB
Nội dung
MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO -PEER SYSTEMS ZHANG GONG NATIONAL UNIVERSITY OF SINGAPORE 2004 MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO -PEER SYSTEMS ZHANG GONG (B Sci., Xi'an JiaotongUniversity, China) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2004 `` II Acknowledgement First, I would like to express my heartfelt thanks to my supervisor, Dr Gary S H Tan, for his supervision through my master study Also, my sincere gratitude goes to Associate Professor Kian-Lee Tan, for all his advice and constant guidance during all phases of this thesis They have conscientiously provided me with careful guidance at every stage of my research, offered various ideas whenever I ran into difficulties, and constructively corrected some of my mistakes in the course of my work I appreciate the fact that participating in their projects has granted me many paths to develop my research and analytical abilities greatly Their support enabled me to both learn and write what is presented in this thesis In addition, they have given me constructive suggestions on my attitude to work, which is helpful to my career development Others that I would like to thank include Ng Yew Kwong, Hu Yu, Gozali Johan Prawira, whom I enjoyed sharing discussions on P2P systems and programming questions with In addition my sincere appreciation is given to my lab fellows, Ameya Virkar, Liu Ming and Liu Peng for their generous help both in my research and in my life, and for the pleasant and friendly environment of the computer system lab Last but not least, I would like to convey my gratitude to the thesis examiners for taking time from their busy schedules to assess my research work `` I Table of Index Chapter Introduction……………………….……………….….……….……… .1 1.1 DHT-based P2P Systems………………… ………… …… ….……………3 1.2 Complex Query Over DHT-based P2P systems…… ………… …….…… 1.2.1 Marrying Database and P2P paradigm…………………….………… 1.2.2 Complex Query over DHT-based P2P systems…………… ………….6 1.3 Multi-dimensional Range Query Evaluation……………………… ……… 1.3.1 Motivations- Three complex queries…… ………….….…………… 1.3.2 Multi-dimensional range query processing………….…….………….15 1.4 Research Contributions………… ………… ….……….…………………17 1.5 Organization of the Thesis…………………… ……… ………………….19 Chapter Literature Review…………… ……………… ……… …………….…21 2.1 Related work…………………………………… …………………… 21 2.2 One-dimensional Indexing for P2P designs……………… ………… 24 2.2 Multi-dimensional Indexing for P2P system…………………… …… 25 2.4 Multi-dimensional Indexing Using Hilbert space filling curve ….……26 Chapter System Model ……… ………………… ……………………………….28 3.1 Problem Formulation …… …………………………………………….28 3.1.1 Data Management Process in DHT-based P2P Systems … ….28 3.1.2 Problem Formulation………………… ……………………… 30 3.2 Design Principles …………………… ……………………………… 32 3.2.1 General Principles ……………………………… …………… 32 3.2.2 Design Goals………………………… ……………………… 33 3.3 System Model ………………………………………………………… 38 3.3.1 Three-tiers Architecture………….… …………………… … 38 3.3.2 Application layer……………………… ………….…… …… 39 `` II 3.3.3 Multi-dimensional Indexing layer………… …………… … 40 3.3.4 Two-fold Property of Partitioning Manner… ……… ……… 34 3.3.5 File systems……………………… …………………………… 49 Chapter Multi-dimensional Range Query evaluation….…………………… 50 4.1 Multi-dimensional Single Point Query Evaluation……………… ……50 4.2 Multi-dimensional Range Query Evaluation… ……………………….51 4.3 Zone Maintenance……………… …………………………………… 54 4.4 System Performance Evaluation………………………… … …57 4.4.1 Multi-dimensional Single Point Query…………… ……………59 4.4.2 Multi-dimensional Range Query Evaluation…………………… 60 4.5 Selectivity Factor…………………………………….………………….64 4.6 Design Improvements……………… …………………………………63 4.6.1Parallelism Strategy……………….………………………………63 4.7 Comparison with Naïve Flooding Method…………………………… 66 Chapter Conclusions ………………………………… ………………………….71 References………………………… ……………………………………………… 74 Appendix A Sample code for Mapping Multi-dimensional Point into Hilbert sequence number………………………… ………………………….………77 Appendix B Join Time Load Balance Code………………………… ………………83 Appendix C Overlay Network’s Transit-Stub Topologies Building ………… ……84 `` III Summary In this thesis, we investigate the issue of enabling current DHT-based P2P systems to support multi-dimensional range query towards the long term goal of providing complex query facilities in P2P systems We adopt a multi-dimensional coordinate space model, which is sorted by Hilbert Space filling curve Sorting make the range query processing in multiple coordinate space possible The way that the space is partitioned is both a zone partitioning way and a single direction sequence dividing way This helps extend DHT functionality layer’s fine property of efficient exact-match lookup into higher dimensions We propose a relay-race like query scheme and introduce some strategies to improve the performance such as introducing parallelism, lookup during processing, et al The performance of the system is evaluated via simulation Several metrics are explored such as the number of the nodes visited, query latency, per hop latency, et al The evaluation shows that the proposed model not only keeps the scalability but also process multi-dimensional range query in bounded costs This system can be incorporated into computational grids to enhance the information discovery capability Compared with current methods of processing range query in P2P systems, our method as a general method outperforms in these aspects: Existing methods are all one-dimensional Our method provides a general method to `` IV answer multi-dimensional rang query and one-dimensional query is one special form of multi-dimensional query To our knowledge, this is the first approach that supports multi-dimensional range query processing Because our method is oriented to multi-dimensional range query, it avoids expensive join operation For current one-dimensional method, if we want to find resource specified by several attributes (dimensions), an independent index infrastructure for each attribute must be built first Then the information infrastructure queries the appropriate indexing infrastructure for each attribute presents in the query and concatenate the results in a database-like “join” operation [13] Our method handles multiple attributes by single DHT-based system Existing methods process range query by a “flooding” manner Our method processes multi-dimensional range query in a lookup way Hence, except that the fine property of DHT-based P2P system is extended, deterministic structure and performance guarantees in terms of logical hops are provided for processing multi-dimensional queries And also high network overhead is reduced Our method provides a deterministic and complete manner to answer multi-dimensional range queries `` V List of Figures Figure 1.1 Napster –Centralized P2P systems………………………… ………… Figure 1.2 Routing in CAN……………………………………… ……………… Figure.2.1 Hilbert Curve in Two-dimensional Space……… ………… 26 Figure 3.1 Three-tier Architecture………………………………………… ……….39 Figure 3.2 Two-dimensional space and range query region………………………….42 Figure 3.3 System Model…………………………………………… …………… 44 Figure 3.4 P2P system with peers …………………………………… …………45 Figure 3.5 Zone partitioning………………………………………………………….46 Figure 4.1 Query evaluation………………………………………………………,,,,,53 Figure 4.2 Space partitioning ……………………………………………………….52 Figure.4.3: Multi-dimensional single point query performance…………………… 60 Figure 4.4 Average Path length………………………………………………………61 Figure 4.5 Query response time…………………………………………………… 61 Figure 4.6 Effect of parallelism strategy on response time………………………… 64 Figure 4.7 Extra communication overheads introduced by parallelism strategy…….65 Figure 4.8 Comparison of the two schemes on the aspect of the number of visited nodes……….68 Figure 4.9 Comparison of the two schemes on the user perceived response time… 68 `` VI List of Tables Table 1.1 relational table distributed into the P2P system……………………….…….9 Table 1.2 the second tables stored in P2P system……………………………………13 `` VII List of Queries Query 1.1……………………………….….………………………………………… 'Query 1.2…………………………………………………………………………….10 Query 1.3……………….…………………… ………… ………………… … 10 Query 1.4:……………………………………….………….……… 14 Query 1.5…………………………………………………………………………… 13 Query 3.1:…………………………………………………………………………….31 Query 3.2:………………………………………………………………………….…31 Query 3.3:……………………………………………………………………….……41 Query 4.1:……………………………………………………………………….……58 Query 4.2.…………………………….…………………………………………… 59 `` VIII Number of Visited Nodes Comparison with Flooding 120 100 80 MD Flooding 60 40 20 100 400 1600 3200 6400 7200 System Size Figure 4.9 Comparison of the two schemes on the aspect of the number of visited nodes X-axis is system size, and Y-axis is number of visited nodes to answer one general 3-dimensional range query This figure is for uniform data distribution Response Time Response Time(s) 20 15 MD Flooding 10 00 72 00 32 80 20 50 System Size Figure 4.10 Comparison of the two schemes on the user perceived response time X-axis is system size, and Y-axis is user perceived response time to answer one general 3-dimensional range query This figure is for uniform data distribution `` 71 Our query scheme is a more general query scheme and it supports the general multi-dimensional range query processing, not limited to one-dimensional range lookup Also, in our scheme, the attribute domain is not limited to integer In general, our scheme differs from the existing method in two important aspects: Unlike the existing method aiming at processing simple range lookup in P2P, our model is a general one One-dimensional range query is the fundamental and simplistic query form that our query scheme supports More generally, it provides the support of multi-dimensional range query support Our query scheme keeps the fine property of DHT-P2P It keeps the forward routing manner This avoids the flooding manner which consumes much bandwidth As we know, the flooding manner brings Gnutella both notoriously non-efficiency and high communication overheads The request must go through nearly all the nodes in the system to have a complete answer In the case that the system size is not so large, this works certainly However, when the system size increases, as more and more queries are issued into the system, this inevitably lead to particularly high workload and non-efficiency of query performance In particular, in practice, in order to prevent the high communication workload, stop point is usually set in query processing Thus the query is not flooded into each node in the system This truly reduces the communication workload However, it poses another problem, that is, this makes the query results incomplete The `` 72 answers may be unable to be found, even if the result is actually stored into the system `` 73 Chapter Conclusions In this thesis, we investigate the issue of enabling current DHT-based P2P systems to support multi-dimensional range query towards the long term goal of providing complex query facilities in P2P systems We adopted a multi-dimensional coordinated space model which is sorted by Hilbert space filling curve The sorting make the range query processing in multiple coordinate space possible The way that the space is partitioned is both a zone partitioning way and single direction sequence dividing way This helps to keep DHT functionality layer’s fine property of efficient exact-match lookup We propose a relay-race like query scheme and introduce some strategies to improve the performance such as introducing parallelism The performance of the system is evaluated via simulation Several metrics are explored such as the nodes visited, query response time, The evaluation shows that the proposed model not only keeps the scalability but also process multi-dimensional range query in bounded costs This system can be incorporated into computational grids to enhance the information discovery capability `` 74 Compared with current methods of processing range query in DHT-based P2P systems, our method as a general method outperforms in these aspects: Existing methods are all one-dimensional Our method provides a general method to answer multi-dimensional rang query and one-dimensional query is one special form of multi-dimensional query To our knowledge, our method is the first approach that supports multi-dimensional range query processing Because our method is oriented to multi-dimensional range query, it avoids expensive join operation For current one-dimensional method, if we want to find resource specified by several attributes (dimensions), an independent index infrastructure for each attribute must be built first Then the information infrastructure queries the appropriate indexing infrastructure for each attribute presents in the query and concatenate the results in a database-like “join” operation [6] Our method handles multiple attributes by single DHT-based system Existing methods process range query by a “flooding” manner Our method processes multi-dimensional range query in a lookup way Hence, except that the fine property of DHT-based P2P system is kept, deterministic structure and performance guarantees in terms of logical hops are provided for processing multi-dimensional queries And also high network overhead is reduced Our method provides a deterministic and complete manner to answer multi-dimensional range queries `` 75 This thesis makes one important step towards the long term goal of providing complex query facilities for P2P systems Although this model is based on CAN, in the future we will try to evaluate it on other DHT P2P designs such Chord, Tapestry We also intend to implement this method in Computational Grids to complement its current information service We are also exploring to support more and complex predicates like regular expressions, near-neighbor queries Also it is interesting to solve other related issues such as using caching to improve query processing power in P2P systems without sacrificing correctness, applying adaptive query processing techniques into P2P systems `` 76 References [1]Napster,http://www.napster.com [7] I Stoica, R Morris, D Karger, M.F [2]Gnutella,http://www.gnutella.wego.c Kaashoek, and H Balakrishnan, “Chord: om A scalable peer-to-peer lookup service [3] FastTrack, http://www.fasttrack.nu for [4] W.S Ng, B C Ooi, K.L Tan, A SIGCOMM ‘01, San Diego, CA, USA, Zhou PeerDB: A P2P-based System for 2001 Distributed Data Sharing International [8] Conference Abbildung on Data Engineering Internet David applications”, Hilbert einer Ueber Linie ACM stetige auf ein (ICDE'2003), Bangalore, 2003 Flachenstuck Mathe-matische Annalen [5] Ratnasamy, S., Francis, P., Handley, 38:459-460, 1891 M., Karp, R., and Shenker, S A scalable [9] content-addressable network In Proc Curves: Their Generation and Their ACM SIGCOMM (San Diego, CA, Application to Bandwidth Reduction August 2001), pp 161-172 IEEE [6] Matthew Harren, Joseph M Theodore Bially Transactions on Space-Filling Information Theory, IT-15(6):658{ 664, Nov 1969 Hellerstein, Ryan Huebsch, Boon Than [10] Gribble, What can databases for Loo, Scott Shenker, and Ion Stoica Peer-to-peer? Complex [11] queries in DHT-based Abhishek Gupta, Divyakant peer-to-peer networks In Proceedings of Agrawal, the first International Workshop on Approximate range selection queries in Peer-to-Peer Systems, 2002 peer-to-peer systems Technical Report `` and Amr El Abbadi 77 UCSB/CSD-2002-23, University of [16] D Hilbert Uber die stetige California at Santa Barbara, 2002 Abbildung einer Linie auf Flachenstuck [12] Zhao, B Y., Kubiatowicz, J., and Math Annln., 38:459 460, 1891 Joseph, A Tapestry: An infrastructure [17] J.K Lawder The Application of for fault-tolerant wide-area location and Space-Filling Curves to the Storage and routing Tech Rep UCB/CSD-01-1141, Retrieval of Multi-dimensional Data University of California at Berkeley, PhD thesis, Birkbeck College, University Computer Science Department, 2001 of London, 2000 [13] I Stoica, R Morris, D Karger, M.F [18] Hans Sagan Space-Filling Curves Kaashoek, and H Balakrishnan, “Chord: Springer-Verlag, 1994 A scalable peer-to-peer lookup service [19] Theodore Bially Space-Filling for ACM Curves: Their Generation and Their SIGCOMM ‘01, San Diego, CA, USA, Application to Bandwidth Reduction 2001 IEEE [14] Volker Gaede, Oliver Günther Theory, IT-15(6):658{664, Nov 1969 Multidimensional Methods [20] J Nievergelt, Hans Hinterberger, Surveys, Kenneth C Sevcik The Grid File: An ACM Internet applications”, Access Computing Transactions on Information 30(2):170-231, June 1998 Adaptable, Symmetric Multikey File [15] Giuseppe Peano Sur une courbe, Structure qui remplit toute une aire plane, Database Systems (TODS), 9(1): 38-71, Mathematische Annalen, 36:157, 1890 1984 `` ACM Transactions on 78 [21] E Zegura, K Calvert, and S [22] Artur Andrzejak, Zhichen Xu Bhattacharjee Scalable, Efficient Range Queries for Internetwork How In to Model Proceedings an IEEE Infocom ’96, San Francisco, CA, May Grid Information Services.Technical Report 1996 `` 79 Appendix A: Join time maintenance code mdNode *load_balance_strategy(mdNode *node, int rlty) { mdNode *nbrnode, *maxnode, *minnode; nbrEntry *nbr; float lo[DIMN],hi[DIMN]; float vol,maxvol,myvol; int peers, minpeers; /* first try to balance load based on the number of nodes sharing one zone */ minpeers=number_peers(node,rlty); minnode=node; nbr=node->nbrs[rlty]; while(nbr) { nbrnode=searchNodeList(nbr->id); peers=number_peers(nbrnode,rlty); if (peers < minpeers) { minpeers=peers; minnode=nbrnode; } nbr=nbr->next; } if(minpeers < (MAXPEERS-1)) { return minnode; } /*the second check strategy is to balance based on the size of the zone owned by one node */ convert_coorstr(node->c[rlty],lo,hi); myvol=calc_volume(lo,hi); maxvol=-1; nbr=node->nbrs[rlty]; while(nbr) { nbrnode=searchNodeList(nbr->id); convert_coorstr(nbrnode->c[rlty],lo,hi); vol=calc_volume(lo,hi); if(vol > maxvol) { `` 80 maxvol=vol; maxnode=nbrnode; } nbr=nbr->next; } if(maxvol > myvol) { return maxnode; } else { return node; } } Appendix B: Sample code to build overlay network’s Transit-Stub topologies #include "md.h" void initTopoList() { int i; for(i=0;inode=node; topoList[hashkey]->next=NULL; `` 81 } else { while((ptr->next != NULL) && ((ptr->node)->key != key)) ptr=ptr->next; if(((ptr->node)->key)==key) { printf("Error in topo list tried to add existent entry\n"); exit(1); } ptr->next=(topoListEntry *)malloc(sizeof(topoListEntry)); ptr=ptr->next; ptr->node=node; ptr->next=NULL; } } topoNode *searchTopoList(int key) { int hashkey; topoListEntry *ptr; hashkey=hash(key); ptr=topoList[hashkey]; if(ptr==NULL){ return (topoNode *)NULL; } while((ptr->next != NULL) && ((ptr->node)->key != key)) ptr=ptr->next; if((ptr->node)->key==key){ return(ptr->node); } else return((topoNode *)NULL); } void deleteTopoListEntry(int key) { int hashkey; topoListEntry *ptr; `` 82 topoListEntry *temptr; hashkey=hash(key); ptr=topoList[hashkey]; if(ptr==NULL) { printf("Node list : Tried to delete nonexistent topo entry \n"); exit(1); } if((ptr->node)->key == key) { topoList[hashkey]= topoList[hashkey]->next; free(ptr->node); free(ptr); return; } while((ptr->next != NULL) && (((ptr->next)->node)->key != key)) ptr=ptr->next; if(ptr->next==NULL) { printf("Topo list : Tried to delete nonexistent entry \n"); exit(1); } if((((ptr->next)->node)->key) == key ) { /**** Found it ******/ temptr=ptr->next; ptr->next=temptr->next; free(temptr->node); free(temptr); } return; } int create_toponode(int keyval) { int i,nodeId; topoNode *node; node `` = (topoNode *)malloc(sizeof(topoNode)); 83 node->key = keyval; node->edgelist=NULL; addTopoListEntry(node->key,node); return node->key; } void add_edge(int from, int to, float rtt) { topoNode *tnode; edge *new; tnode=searchTopoList(from); new=(edge *)malloc(sizeof(edge)); new->key=to; new->rtt=rtt; new->next=tnode->edgelist; tnode->edgelist=new; } void duplex_link(int from, int to, float rtt) { add_edge(from,to,rtt); add_edge(to,from,rtt); } float return_rtt(int from, int to) { topoNode *node; edge *eptr; node=searchTopoList(from); if(node==NULL) { printf("node not found in topology\n"); exit(1); } eptr=node->edgelist; while(eptr) { if(eptr->key == to) { return eptr->rtt; `` 84 } eptr=eptr->next; } printf("return_rtt:edge from %d - %d not found\n", from, to); exit(1); } int create_topo(char *fname) { FILE *fp; int i,val,from,to,rtt; int numnodes; char buffer[100]; if(!(fp=fopen(fname,"r"))) { fprintf(stderr,"error:opening file %s",fname); exit(1); } fscanf(fp,"%d",&numnodes); for(i=0;i[...]... decentralized and distributed hash table The most notable functionality of hash table is quick exact-match lookups The index based on hash table is difficult to support range query except the way of overall scan Hence, DHT -based P2P only support exact-match lookups and does not support range query or multi- attributes query efficiently This `` 7 inherent deficiency aggravates poor query facilities in P2P systems. .. have to build another relational table to store such kind of independent theme except the hash indexing infrastructure built on the music_name This table is distributed into the P2P systems according to the hash value of the attribute author_name As a result, current P2P music sharing system has two tables distributed into it It is easy to find the information bounded to the individual table For instance,... functionality of one -dimensional indexing 3 For one -dimensional range, the usual convenient way is to sort first, then query However, without specific organization, the data in the multi- dimensional space is not ordered Lack of order prevents multi- dimensional range query to be processed in a convenient way like one -dimensional range query in a sorted setting This work attempt to sort the multi- dimensional. .. the multi- dimensional space, but also keep the space locality The sorted space enables the multi- dimensional query to have a similar sorted setting As a consequence, multi- dimensional range query can be processed in the similar efficient way as one dimensional range query in the sorted setting 4 Existing methods process one -dimensional range query by a “flooding” manner Our method processes multi- dimensional. .. proposed multi- dimensional range query processing model for P2P networks The work reported in this thesis is similar in spirit to that of Harren et al [6], in that we are interested in supporting database query processing over P2P networks Our contention is that in order to support complex queries in the distributed context of peer- to -peer systems, we need to extend the current P2P exact name lookups to range. .. and their values forms one relational table Such a relational table is distributed into each peer That is, each peer stored one horizontal partition of the table In our example, the music file is described by the schema (Music_id, `` 8 Music_name, Author, Orchestra, Year) and showed in Table 1.1 In DHT -based P2P systems, such a relational table will be distributed into the system based on one attribute... current query processing functionality of P2P systems, new indexing functionality that 1) supports multi- attributes 2) supports range query 3) avoids expensive join operation is solicited Our method of multi- dimensional range query processing is towards the goal of providing these key functionalities and try to reach the goal of supporting more general complex query `` 14 1.3.2 Multi- dimensional range query. .. Without range query processing functionality, it is difficulty to support general relational query over P2P systems Gupta et al [11] advance an initial step to enable general query processing over P2P data sharing paradigm They try to solve the range query problem by an approximate way Approximate answers are provided to the given range query by using Locality Sensitive Hashing functions for relevant range. .. to exploit the processing power of range query in DHT P2P architectures in a multi- dimensional setting In the proposed mechanism, `` 15 multi- dimensional range query is evaluated in a deterministic and complete manner This is important to support the more general and complex query in P2P As showed in Query 1.5, in the existing DHT P2P query methods such as the work of Harren et al, a common range query. .. step towards the long term plan of marrying the powerful query facilities of traditional database with P2P networks — multi- dimensional range query evaluation in DHT -based P2P systems This chapter is organized as follows: first we briefly introduce DHT -based P2P designs; next, we briefly overview the need for building complex query facility over DHT -based P2P systems; then, we explore the motivation to ... idea of Distributed Hash Table impoverishes the query facility Distributed hash table is essentially a decentralized and distributed hash table The most notable functionality of hash table is... Complex Query over DHT -based P2P systems ………… ………….6 1.3 Multi-dimensional Range Query Evaluation …………………… ……… 1.3.1 Motivations- Three complex queries…… ………….….…………… 1.3.2 Multi-dimensional range query. ..MULIT-DIMENSIOANL RANGE QUERY EVALUATION FOR DISTRIBUTED HASH TABLE BASED PEER-TO -PEER SYSTEMS ZHANG GONG (B Sci., Xi'an JiaotongUniversity, China) A THESIS SUBMITTED FOR THE DEGREE OF MASTER