An Improvement Solution For Multiple Attribute Information Searching Based On Structured P2P Networks.pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	44
Dung lượng	27,36 MB

Nội dung

Output file о ѵ ч ^ !N ỏ 广 An Improvement Solution for Multiple Attribute Information Searching Based On Structured P2P Networks Nguyen Thanh Dat Faculty of Information Technology Hanoi University of[.]

о ѵ ч ^ !N ỏ 广 An Improvement Solution for M ultiple A ttribute Information Searching Based On Structured P2P Networks N g u y e n T h a n h D a t Faculty of Information Technology Hanoi University of Engineering and Technology Vietnam National University, Hanoi Supervised by Doctor Nguyen Hoai Sổn A thesis s u b m itte d in fu lfillm e n t o f the re q u ire m e n ts fo r th e degree o f M a ste r o f In fo rm a tio n Technology Decem ber, 2009 Đ A I H Ọ C QUO C G IA HA NỘI TRUNG TẨM THÔNG TIN THU VIỀN A - L 1.3 P2P network m o d e ls 1.3.1 P2P N e tw o rk 1.3.2 P2P Network Models 10 1.3.2.1 Unstructured P2P N e tw o rk 10 Hybrid P2P N etw ork Structured P2 P N etw ork 12 14 1.4 DHT-based P ro to c o l 15 1.3.2.2 1.3.2.3 1.4.1 Distributed Haah Table - DHT 15 1.4.2 CHORD Protocol 16 1.4.2.1 Topology 17 1.4.2.2 Lookup and I n s e r t 1.4.2.3 Join and Leave 20 1.4.2.4 Stabilization and F a ilu re 1.5 Summary , 18 21 22 R e la te d W o rk s 2Л 24 INS/Twine ：Information distribution based on Attribute-Value trees 24 2.1.1 S o lu tio n 24 2.1.2 System architecture 26 2.1.3 System architecture 26 2.1.4 S u m m a ry 27 iii — 2.2 _ TABLE OF CONTENTS CDS: Irifonnation D istribution Based O il Load Balancing M a trix 2.2.1 28 S o lu tio n 28 2.2.2 System architccturc 28 2.2.2.1 Registering a content n a m e 29 2.2.3 System architecture 29 2.2.4 2.2.3.1 Query re solutio n 30 2.2.3.2 S u m m a ry 30 Load Balancing M a trix (LB M ) 2.2.4.1 2.2.5 2.3 The structure of L B M .31 System architecture 31 2.2.5 2.2.6 30 LB M management mechanism 32 S u m m a ry 33 Data In d e x in g 2.3.1 33 S o lu tio n 33 2.3/2 Insert a f i l o 36 2.3.3 Lookups 37 2.3.4 S u m m a ry 38 ậ.4 SMAV: Searching - M ultiple-attribu te V a lu e 38 2.4.1 S o lu tio n 38 2.4.2 Distribution of information c o n te n t 40 2.4.3 Information content name query .42 2.4.4 S u m m a ry 42 A n Im p ro v e m e n t S o lu tio n fo r M u ltip le - a ttr ib u te In fo r m a tio n Search in g on S tru c tu re d P P N e tw o rk 3.1 44 I d e a 45 3.2 Three Levels MappingModel 3.2.1 Overview 46 3.2.2 Thrce-Levclb Sub-key Mapping S to rin g 48 3.2.3 Distribution of information c o n te n t 50 3/2.4 Inform ation q u e ry 54 3.2.5 S u m m a r y 57 3.3 The Dynamic ThresholdV alue s 57 3.3.1 A formula of threshold v a lu e s 57 3.3.2 Adjusted Distribution A lg o rith m 59 3.3.3 Updating Threshold Value Periodically 61 46 TABLE OF CON TENTS 3.1 _ _ v 3.3.4 Adjusted Lookup A lg o rith m 62 Summary 64 S im u la tio n s and E v a lu a tio n s 65 Q ualitative 65 4.2 Simulation D e s c rip tio n 67 4.3 Evaluation Based On S im u la tio n s 68 E valuations 4.3.1 Load balancing 68 4.3.2 Distribution of informationc o n te n t 71 4.3.3 Routing Performance Conclusions and Future W ork 74 77 List of Figures 1.1 Client/Server network m o d e l 1.2 Peer-To-Peer network model with peers 10 1.3 Locating resources in a Gnutella-likc P 2P e n v iro n m e n t 11 1.4 An example of H yb rid P2P M o d e l 13 L5 15 1.6 Distribution data progress based on D H T Chord's key space with 23 points 17 1.7 Chord's com ponents 17 1.8 Lookup progress of Chord*s protocol with key 19 1.9 Joining phase of a node in C H O R D protocol 21 2.J Meta string and A V T rc o .25 2.2 Architecture of IN S /T w in S y s te m 26 2.3 S p litting a rcbourcc description into s t r a n d s 27 2.4 The architecture of CDS system 29 2.5 An example of distribution of AVs in nodes 30 2.6 The structure of Load Balancing Matrix for { 이， V i } 31 2.7 An example of described d a t a 34 2.8 Sample File Queries .34 2.9 Mappings between q u e rie s 35 2.10 Query mapping for three descriptors 36 2.11 An example of mappings t r e e 36 2.12 An example of a path of queries 37 2.13 Key - sub-key m a p p in g s 40 3.1 The number of hop levels of a content name in pure S M A V 3.2 Mappings are created fr 이ฑ key k ị 49 3.3 The generation of distributed keys from a content n a m e 53 3.4 Block diagram of query progress of improved S M A V 55 49 LIS T OF FIG U R E S v ii 3.5 3.6 An example of query progress with common k e y s 56 Combining of common keys and a uncommon k e y 61 4.1 The distribu tion of ЛѴ pairs in content n a m e s 68 4.2 Number of inform ation contents stored in each of 5000 n o d e s 69 4.3 The number o f queries is processeci by each of 5000 nodes 70 4.4 4.5 Load balancing among nodes .70 Mappings stored in every node 72 4.Г) Mappings is created by CNs in a DSMAV b S M A V .72 4.7 4.8 4.9 The number of keys stored in every n o d e 73 Level-к Sub-keys are created by three solutions 73 Logicai hop count required for each query 74 4.10 The maximum number of hop level of three s o lu tio n s 75 4.11 The number of successful queries 75 List of Tables 1Л Comparison: Client/Server vs P P 1.2 Definition of variables for node ท using m -bit identifiers 18 2.1 Mapping tabic between distributed key and content n a m e s 41 2.2 Mapping table between distributed key and s u b -k e y s 41 2.3 Mapping table between distributed key and uncommon k e y s 42 3.1 Mapping Table between distributed key and content n a m e s 51 3.2 Mapping table between level-2 sub-keys and distributed k e y s 53 3.3 Mapping Table between keys and co n te n ts 60 viii Abstract Conventional information searching engines such as Google, Yahoo, and Wikipedia support only Keyword-based searching on websites They cannot search information in various kinds of resources such as personal devices like Laptop, PDA} Cell Phone or sharing files in P2P Network Besides, DHT-based P2P networks such as Chord, CAN, Pastry can achieve cxact (!ucry (i.e query of an exact key) with characteristic of scalability, efficicncy and fault-tolerate However, in the Cítóe of complex queries such as range query or multiple-attribute query, pure DHT is not efficient since lots of query messages must be sent In this thesis, we focus our intentions on m ultiple-attribute query on DHTbatícd P2P network The big problem here is the unbalance among nodes due to the appearance of common attribute/value pairs (AV pairs) in content names The main idea of our method is to lim it number of content items, which assigned to an ID by creating sub-IDs from multiple AV pairs if those AV pairs appear in lots of content names, to threshold value of each node To reduce query cost, our system also keeps the mapping between an ID and its sub IDs if existed in the node responsible for the ID Moreover, we store only mappings, which are created in distribution progress, to nodes Our method can achieve both efficiency and a good degree of load balancing even when the distribution of AV pairs is skewed Our simulation result shows the efficiency of our solution in respects of lookup time and the degree of load balancing Chapter Introduction 1.1 Overview and Motivation With the unprecedented growth of information technology, today wecan see that information is appearing in everywhere Information might be found in various kinds of resource« «uch as personal dcvices like Laptops, PDAs, Cell phones , websites in the Internet, sharing files in P2P network 1., From the explosion of information, there are more and more information search ing demands in somewhere Every day we need lots of information to communicate and work efficiently and easily For instance, we search for weather forecast information before a trip or a picnic We also search for information of the latast news of the day，refercncc« of a product to buy, information of market priccs, etc In lots of cases, if we seize desired information quickly and exactly, we might have more suc cessful opportunities in communication and work Therefore，information searching is a necessary demand in nowadays information age The emergence of new applica tions and services will require an efficiency information searching system which can realize complex query on contcnt names in a sealable manner (พ Adjie-Winoto & Liliey 1999；Carzaniga Sz Wolf, 2001; Foster & Tuecke，2002) There are many large systems to allow searching information such as conventional search engines: Google, Yahoo Amazon, eBay, Wikipedia Google engine allows users to search information based on keywords on Internet This engine can link to billions of websites to search information Information of each website is described by keywords and then they are processed and stored in servers of Google Conventional search systems often use Client/Server model where servers pro 1.1 Overview and M otivation vide searching services to clients However, Client/Server model have some disad vantages Firstly, it has limitation in scalability Servers are made with high cost because it need a very big capacity of processing and storing Secondly, each server may be a single point of failure When server goes down，operations will be ceascd Moreover, as the big number of simultaneous client requests to a given server increaiies the server can become overloaded When a big amount of clients join to the network, traffic congestion on the network has also been an issue Rcccntly, the appcarancc of Pccr-to-Pccr (P2P) network model has attracted the interest of lots of people, P2P with their decentralized control, self-organization and adaptation have emerged as a significant social and technical phenomenon over the last year Unlike Client/Server model, P2P networks aim to aggregate largo num bers of computers that join and leave the network frequently In pure P2P systems, individual computers communicate dircctly with each other and bharc information and resources without using dedicated servers For example, they provide infrastruc ture for communities that share CPU cycles (e.g., SETI@Home, Entropia) and/or storage spacc (e.g., Napster (Idit Kcidar, 2006; Napster, 1999) FrccNet, Gnutella (Gnutella, 1999)) or that support collaborative environments (Groove) In P2P networks, all clients provide resources, including bandwidth, storage spac(ỵ, and computing power If there are more and more many nodes to join to the svstem, the total capacity of the systcn would be more and more increase This is not true of Client/Server network model with a fixed set of servers, in which adding more clients could mean slower data transfer for all users The distributed nature of P2P networks also increases robustness in case of failures by replicating data over multiple peers, and by enabling peers to find the data without relying on a centralized index server In the latter ease, there i« 110 single point of failure ill the system Information searching on P2P network is attended in recent years Advantages of P2P network model allows us to construct information searching systems with capabilities of scalability and fault-toleratc Bccausc of the whole of data of system are distributed to all nodes; each node is responsible for a portion of data and to take part, in search progrevss The Gnutella network (Gnutella, 1999) supports to share and search file« It searches data by flooding messages to the whole network Nevertheless? Gnutella network requires high overhead; the search may be failed because a query may be not routed to the node is responsible for desired information Hence, it leads to search information inefficiently eDonkey (Weikum, 2002) network 1.5 Summary based P2P network, the system can guarantee capacity of cfficicncy fault-tolerant and load balancing However, w ith pure DHT-based P2P model, a d istribu te d inform ation searching system need to tacklc some problems bccausc pop ula rity of data, which is appeared with high frequency Popular data cause load unbalancing among nodes of the network- Moreover, pure DHT-based protocols only support exactly searching It moans th a t if we want to look up an inform ation content, which is described by a key, we must to query w ith whole data/s description to achieve desired data Some proposed solutions, which support M u ltip le -a ttr ibute inform ation search ing w ith partial queries, may tackle above problems Few of them allow searching efficiently while others support load balanced of the network These solutions use DHT-based P P protocols such as C A N , PASTRY, CHORD In the next sections, \\:v present some proposed solutions for Multiple-attribute inform ation searching oil Structured P P network Chapter R elated Works 2.1 INS/Twine: Information distribution based on Attribute-Value trees This section presents an IN S /T w ine solution for M u ltip le -a ttrib u te inform ation searching on D H T -based P P network The solution shows data organization and distribution based on a ttrib u te value trees The subsection describes components, which its interaction allows implement the proposed solution The last, we present summary evaluation for IN S /T w ine system 2.1.1 Solution The IN S /Tw ine (M Balazinska, 2002) proposed an approach to resource discovery that achieves scalability via hash-based partitioning of resource descriptions amongst a set of symmetric resolvers The system works w ith a rb itrary a ttrib u te sets It handler queries based on orthogonal and hierarchical attributes, w ith no content or location constraints I t also handles partial queries, queries th a t contain only a subs(ỵt of the attributes originally advertised by resources The INS system maps resources to resolvers by transform ing descriptions into numcric keys in a manner that maintains their expressiveness, facilitates even data distribu tion and enables efficient query resolution Additionally, IN S/Tw ine handles resource and resolver dynamism by treating all data as soft-state By using an efficient distributed hash table process of some protocols such as PASTRY, CAN, CHORD, IN S / 丁wine system distributes all resources available to 24 2.1 I N S / T w iiie : In fo r m a tio n d is tr ib u tio n based on A ttr ib u te - V a lu e trees 25 all users independent o f location, which contains IP address，application protocol and port number I t transforms each resource description, which includes hierarchies of attribute-value pairs in to a set of numeric keys Therefore, each unique subsequence of attributes and values, which is called a strand, w ill be extracted to query rcsourccs Then Tw ine computes a hash value for each strand, which creates the numeric keys Indeed, the goal of IN S /T w ine is to describe resources and queries into a canonical form ：an attribute-value tree (AVTree), I t is therefore to compare description of queries to original description, w ith zero or more truncated attribute-value pairs Figure 2.1 shows an example of a resource description of INS/Twine system bv using AVTrec which represents resources th a t can be annotated w ith mcta-data descriptions res Citmtra ІШП Acomptuty IW modtfl Amodťl subjccl tTiiJJic üUbjvVl Figure 2.1: Meta string and AVTtcc I ท Figure 10 the resource r is described as a M eta string, which corresponds w ith an AVTroe When, the resource description would match the queries: (/1 : く re s / > cam era < man > ACompany < /ไทan > < /re s > (/2 :< res > c a iiiv < /re s > By extracting a rcsourcc description to subscqucncc of attributes and values, many queries match w ith a resource by comparing AVTrees This work support partial queries Therefore it also support for approximate queries instead of com plete queries th a t specify the exact resource descriptions Furthermore，this model also allows more flexible queries by separating string values into several attribute» value pairs For example, < model > Com paqPrcsarioC Q 40 < /m odel > could be divided into < modelพ > Cơmpaq く /modelw > ，< modelw > Presario く Ị model w > and，< modeiw > CQAO < /model ìíĩ > allowing queries of type < rnodelw > CQ40 < Ịm o d tlพ >• 26 2.1.2 C h a p te r R e la te d W o rk s System architecture tnlentional Nanif КсчоКіт (INR) insertíVWR)! lỉỉc rs ìe e Clieni ' IA N K ẻ ãã x p p ic a n o n * • rcsuh R lo o ku p V ) Uesolver f S i ) i Sii «， spht(V) ♦ ๆ Storage -J Quer> ГСЯКІІ R üCTUhSi.T.WNRi Hngme result R ^ - - — ■SfrandMappri K íÄ hữ.OũSi) | distributed to a set o f nodes, which are responsible for ЛѴ pairs of a content name This scheme archives the degree of load balancing w ith the participation of nodes Furthermore, the determ ination of the node responsible for each A V pair is done faster than INvS/Twine solution, which spends more tim e in extracting strands from an AV tree However, it s till exists some disadvantages th a t should be improved CDS so lution caused more redundant inform ation such as data and query More nodes store the same set of content name« This is the nodes th a t belong to the same part ition Sometime this storage is not necessary because of more queries are only to process w ith a subset of partition The size of L B M would increase rapidly if it is doubled for expanding CDS has to proccss more queries to archive matching results Some queries need to process the whole of nodes in m atrix before receive results So, procession and management cost and searching tim e may be problems of the system 2.3 Data Indexing Data indexing (Garces-Erice к Ross, 2004) is an underlying DHT-based P2P data storage system, which supports M ultiploattributc information searching by creating m ultiple indexes, organized hierarchically, which perm it users to locate data even using scarce inform ation, although at the price of a higher lookup cost The data itself is stored on only one (or few) of the nodes and discovered based on user’s queries 2.3.1 Solution Data indexing proposed as a distribution inform ation discovery solution I t also based on Structured P2P architecture, w ith protocols such Chord, CAN I t means that information contents are transformed to a set of key by using a hashing func tion Each key is distributed / queried to / from node take responsibility for that key Each node manages a keys space and determines other keys based on neighbor nodes Inform ation distribu tion and searching progression not often over lim ited 34 C h a p te r R e la te d W o rk s number of hops Content names of Data indexing system arc files, which contain useful attributes such as author5 title, conf year, size etc Data indexing solution owrcomc the limitation of DHT-based system by creating multiple indexes, orga nized hierarchically, which permit users to locate data even using scarce information, although at the price of a higher lookup cost The data itself is stored on only one (or few) of tile nodes Otherwise, it also allows to perform uncompleted queries and to support users with intermediate queries returned from the system In Data Indexing solution, content names (data items) are files, which are identified by descriptors, which are somi-structured XML data (Figure 2.7) Hence, descriptors are indexed, stored and distributed among nodes It also treats XM L documents as a tree of nodes and offers an expressive way to specify and select parts of this tree Moreover, it allows the use of wildcard (*) and ancestor/descendant ( //) operators in queries (Figure 2,8) < 21：： !1 ! ^ 隹> < a r - ic e > < firs t> J c ^ n < f i r s t > ร.T*ith く ir s t> J o h r ,< /? ir « t> < f i r รไ:> А Іа я < /f i r s Smirh < i t ระ>ว0« < z i t :e>n>v6 < c o r i> S ：O C C ^ < / c o n ะ!> < c o n í > I к ?OCĨH< / c o n ỉ > < c ọ n f > IN r O C O M < /c o n f> < у « в г> і9 ^ < /ѵ в а г> < y t« r > ：9 < /y *4 r> < y e a r> l9 < /y « a r> 312352 < /a rtíci« > 255827 < /ả rric ie > < TCP 3i5í35 ariic*e> Figure 2.7：Ali example of described data q -/a rticìe [a u th o r[first/Jo h n ][ỉa st/S m ith ]] [titie ,T C P ][co n ß S lG C O M M ][ye a r/im ][size/315635] q2 - a rticle [a u th o i [first/JohnJpast/Sm ithJJ [conßlN FO C O M ] Cj3 = a rticle /a u th o r[fìrst/JohnJ[ỉasí/S m ithJ q4 = a rticle /title T C P = /arricle/conflN F O C O M q6 -/a rticIe/a iu ho r/Ia sf/S m ith • v• '• ไ 7; ' ' : Ĩ-ỈỈ： พ * ' 1ไ^ ฯ ไ ' M " ' - ' ：•' • ะ ' , U - ' l ' i • ' ' ^ ' Figure 2.8: Sample File Queries From a descriptor d, the system would create a query q, which contains all the ЛѴ pairs of d only When, q is called ďs most specific query Hence, Data indexing defines some conceptions the following: If query qo is the most specific query for any file f, the file f is returned from node ท If not, node ท returns a list of queries {ộb Ợ2, …， 9ทฺ} such that the mappings (g0； (/i), with qo contains qi Then, the user can choose one or more queries (li and repeat this process recursively until the desired nles have been found A path of queries is createci in searching progress from

Ngày đăng: 03/02/2023, 19:38