Research on node ranking in peer-to-peer networks

Tài liệu tham khảo chuyên ngành viễn thông Research on node ranking in peer-to-peer networks

Trang 1

Hoàng Cường

Research on node ranking in peer-to-peer networks

KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY

Ngành: Công nghệ thông tin

HÀ NỘI - 2010

Trang 3

Hoàng Cường

Trang 4

ABSTRACT

This paper defines and describes a fully distributed NODE ranking algorithm for “peer to peer” systems The research puts forward new approach for ranking nodes over peer to peer Synthesizing foundation and promoting new method which is feasible for peer to peer networks Integration of this algorithm into P2P keyword search can produce dramatic benefit both in terms of effectiveness for users and decrease in network traffic The incremental search algorithm provided

approximately a ten-fold reduction in network traffic for two-word and three-word queries.

Trang 5

1.1.1 Peer to Peer overview 5

1.1.2 Architecture of Peer to Peer Systems Error! Bookmark not defined.7 1.1.3 Distributed hash tables 8

1.2 Ranking in Peer to Peer networks 9

1.2.1 Introduction Error! Bookmark not defined 1.2.2 Ranking Roles Error! Bookmark not defined 1.2.3 Research’s important objects Error! Bookmark not defined Chapter 2: Ranking on DHT Peer to Peer Networks 11

Trang 6

3.2.1 Major problems to exploit Error! Bookmark not defined 3.2.2 Ranking Idea Error! Bookmark not defined Chapter 4: Ranking on Details Error! Bookmark not defined 4.1 Ranking algorithm Error! Bookmark not defined 4.2 Ranking’s features Error! Bookmark not defined

Chapter 5: Evaluation 50

Chapter 6: Related Work 52

Chapter 7: Contributions and future work 53

References 54

Trang 7

List Images

Image 1.1.1 Peer to Peer means connected together.Error! Bookmark not defined Image 1.1.3 Distributed hash tables example Error! Bookmark not defined Image 1.2.1 System must to have the ranking engine to find the one.Error! Bookmark

not defined.

Image 2.1 A 16-node Chord network example Error! Bookmark not defined Image 2.2.2: How Pagerank works Error! Bookmark not defined Image 2.3: Distributed Nodes Graph example Error! Bookmark not defined.Image 3.2.1: Google almost is not exact Error! Bookmark not defined.Image 3.2.2: Intersect Idea Error! Bookmark not defined.Image 3.4: Factor Percent Error! Bookmark not defined.Image 4.1: Bandwidth is the key of ranking trusted Error! Bookmark not defined.Image 4.1.2: Example of sub-graph semantic rank Error! Bookmark not defined.Fig 4: A global graph of both local nodes and external nodes Error! Bookmark not

Fig 5: An external local graph without a strategy Error! Bookmark not defined.Fig 6: An external local graph Error! Bookmark not defined.Image 4.2: Eigenvalue Error! Bookmark not defined.Image 4.2.3: Random walk Error! Bookmark not defined.Image 4.2.4: (n+1) graph nodes Error! Bookmark not defined.Image 4.2.5: Graph example - 6 nodes Error! Bookmark not defined.Image 4.2.6: Multiplication result example Error! Bookmark not defined.Image 4.2.7: Multiplication result example – at iterators Error! Bookmark not

defined.

Trang 9

Bookmark not defined

Table 3.2.4: HITS convergence ( take lots time than Pagerank) Error!

Bookmark not defined

Table 5.1: the number of iterators which converges……….Error!

Bookmark not defined.

Trang 10

Peer-to-peer was popularized by file sharing systems like Napster File sharing is the practice of distributing or providing access to digitally stored

information, such as computer programs, multi-media (audio, video), documents, or electronic books It may be implemented through a variety of storage, transmission, and distribution models and common methods of file sharing incorporate manual sharing using removable media, centralized computer file server installations

on computer networks, World Wide Web-based hyperlinked documents, and the use of distributed peer-to-peer networking

1.1.1 Peer to Peer overview

In its simplest form, a peer-to-peer (P2P) network is created when two or more PCs are connected and share resources without going through a separate server A P2P network can be an ad hoc connection—a couple of computers connected via a

Universal Serial Bus to transfer files A P2P network also can be a permanent

infrastructure that links a half-dozen computers in a small office over copper wires Or a P2P network can be a network on a much grander scale in which special protocols and applications set up direct relationships among users over the Internet

The initial use of P2P networks in business followed the deployment in the early 1980s of free-standing PCs In contrast to the mini-mainframes of the day (e.g Fuitsu/ICL, IBM AS/400, IBM Mainframe, Unisys, … ), which used by over 16,000

Trang 11

organizations, the list, or selections from this list, will be of particular interest to those companies who either supply medium sized to large scale systems or who offer

products and/or services related to the use of such systems; any data is supplied with the named head of IT, as standard

Mini- mainframes served up word processing and other applications to dumb terminals from a central computer and stored files on a central hard drive, the then-new PCs had self-contained hard drives and built-in CPUs The smart boxes also had onboard applications, which meant they could be deployed to desktops and be useful without an umbilical cord linking them to a mainframe

Shared file and printer access within a local area network may either be based on a centralized file server or print server, sometimes denoted client–server paradigm, or on a decentralized model, denoted peer-to-peer network topology or Workgroup (computer networking) In client–server communications, a client process on the local user computer takes the initiative to start the communication, while a server process on the file server or print server remote computer passively waits for requests to start a communication session In a peer-to-peer network, any computer can be server as well as client It’s fantastic and efficient

In effect, every connected PC is at once a server and a client There's no special network operating system residing on a robust machine that supports special server-side applications like directory services (specialized databases that control who has access to what)

Image 1.1.1 : “Peer to Peer” means “connected together”

Trang 12

In a P2P environment, Unlike client-server networks, where network information is stored on a centralized file server PC and made available to tens,

hundreds, or thousands client PCs, the information stored across peer-to-peer networks is uniquely decentralized Because peer-to-peer PCs have their own hard disk drives that are accessible by all computers, each PC acts as both a client (information requestor) and a server (information provider) In the diagram below, three peer-to-peer workstations are shown Although not capable of handling the same amount of information continuance that a client-server network might, all three computers can communicate directly with each other and share one another's resources

1.1.2 Architecture of P2P systems

Peer-to-Peer Architecture distinguishes itself by its distribution of power and function Rather than concentrating its power in the server, Peer-to-Peer models rely on the power and bandwidth of participants They form ad hoc connections between nodes for sharing all kinds of information and files Peer-to-Peer discards hierarchical notions of clients and servers (clients at the top, servers on the bottom) and replaces it with equal peer nodes that function simultaneously as clients and servers This also discards the idea of a central server, which exists in Client-Server Architecture

There are several classifications of Peer-to-Peer networks These include pure/hybrid and structured/unstructured Peer-to-Peer networks Pure

P2P networks merge the role of clients and servers as equals and do not provide a central server for managing the network or a central router that forwards requests to other networks

Hybrid P2P models, on the other hand, do contain a central server that stores peer information and responds to request for information stored on that server In this configuration, peers host available resources since there no central server provides this function

Peers also make central servers aware of what resources they want to share and make those resources available to peers that request them Also, route terminals function as used addresses and are indexed to find an absolute address

The structure of P2P networks is determined by the nature of the overlay network, which consists of all participating peers as equal nodes Nodes in an overlay network are connected through virtual or logical links that create a path to the

underlying network

Essentially, overlay networks are network built on top of other networks to-Peer networks are considered overlay networks because they are usually built on top of the Internet Structured P2P networks use a global protocol so that searches can be routed TO any peers/nodes BY any peers/nodes on the network

Peer-To retrieve rare files, more structured overlay links are required The most common structured P2P network is the distributed hash table (DHT) DHTs are decentralized distributed systems that store names and values Any participating node in the network can lookup and retrieve values Maintenance of the DHT mapping is distributed among the nodes The ownership of each file is assigned to a peer, but the

Trang 13

addition or deletion of peers or files doesn’t cause major disruptions This makes them very scalable

Unstructured P2P networks establish links more arbitrarily To join, a peer only has to copy the links of an existing node and then add its own links as it develops To find a desired file, however, the request must be flooded throughout the network This doesn’t always necessarily return the desired results if the file being requested is rare There is no correlation between peer and content Also, flooding increases network traffic, slowing down responses and file sharing

The primary advantage of P2P networks is that all clients contribute their

resources These resources include computing power, bandwidth, and storage space In traditional Client-Server models there are a fixed number of servers, so the addition of clients slows down network processing In Peer-to-Peer models, as nodes are added, system resources increase (contributed by the added nodes) to accommodate demand

In P2P models, updates must be applied and copied to peers in the network, which requires a lot labor and is prone to errors However, Client-Server paradigms often suffer from network traffic congestion This is not a problem for P2P, since network resources are in direct proportion to the number of peers in the network Also, Client-Server paradigms lack the robustness of P2P networks Robustness refers to a network’s ability to bounce back or continue functioning if one of the components fails If a server fails in Client-Server models, the request cannot be completed In P2P, a node can fail or abandon the request Other nodes still have access to resources needed to complete the download

Trang 14

1.1.3 Distributed hash tables

Image 1.1.3: Distributed hash tables example

Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table: (key, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given key Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption This allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures

DHTs form an infrastructure that can be used to build peer-to-peer networks Notable distributed networks that use DHTs include BitTorrent's distributed tracker, the Kad network, the Storm botnet, YaCy, and the Coral Content Distribution

Trang 15

Image 1.2.1: System must to have the ranking engine to find the one

We then proposed that some distributed page ranking algorithm, proves their convergence partially, and discusses some interesting products they The indirect transmission in this article was introduced that reduces representence which the representence ceiling and achieves between page rankers may promote Between the convergence time which and the band width relations consumes are also discussed Finally, we verify certain discussions by the basis true data set's experiment

1.2.2 Ranking Roles

The determination “the importance” the link structure became based on the page ranking's homepage was searching an engine's important technology Specially, the hit algorithm maintains each page a jack and the authority score, the authority and the jack score calculates based on the page connection relations in the hyperlinked environment Google the use PageRank algorithm determines “the score” the homepage double counting matrix eigenvector/feature vector/proper vector

When net's size growth, it difficult and becomes difficultly, for the existing search engine can include the entire net We need to be may promote about page quantity and user's quantity distributed search engine Not only in a distributed search engine, the page ranking is essential in its improvement's inquiry result centralization relative, but should and the availability carries out distributed for the measurable quantity A direct way achieves distribution the page ranking to call the hit or the PageRank algorithm to the distributed environment But it is not to do a that trivial matter Two hits and PageRank are the redundant algorithms Each pack of riding instead of walking needed previously the step estimated result, the synchronous

Trang 16

operation needed However, achieves the synchronous communication, in the width disseminates in the distributed environment is difficult Moreover, must consider carefully the page divides into with the representence ceiling, when carries out the distributed page ranking

The coordinated cover network which constructs was won the prestige to take recently from has organized, toughness, a large-scale distribution system's

construction platform In this article, we try to carry out the effective page ranking in the peer-to-peer network crown which constructs We first propose according to the google PageRank some distributed page ranking algorithm, and proposes about theirs some interesting products and the result Is more important than because of the

representence ceiling CPU and in the distributed page ranking's memory usage, our then discussion about relaxes the representence ceiling's strategic page to divide into with the idea Through this execution, our paper makes the following contribution:

• Through the use true data set, we provide two kind of distributed page ranking algorithm, proves their convergence partially, and verifies their characteristic • We recognize main the point in dispute and question and the distribution page

ranking concern in the P2P network crown which constructs

IDs and keys are assigned an m-bit identifier using consistent hashing The SHA-1 algorithm is the base hashing function for consistent hashing Consistent hashing is integral to the robustness and performance of Chord because both keys and IDs (IP addresses) are uniformly distributed and in the same identifier space

Consistent hashing is also necessary to let nodes join and leave the network without disruption

Each node has a successor and a predecessor The successor to a node (or key) is the next node (key) in the identifier circle in a clockwise direction The predecessor is counter-clockwise If there is a node for each possible ID, the successor of node 2 is node 3, and the predecessor of node 1 is node 0; however, normally there are holes in the sequence For example, the successor of node 153 may be node 167 (and nodes

Trang 17

from 154 to 166 will not exist); in this case, the predecessor of node 167 will be node 153

Since the successor (or predecessor) node may disappear from the network (because of failure or departure), each node records a whole segment of the circle adjacent to it, i.e the r nodes preceding it and the r nodes following it This list results a high possibility that a node is able to correctly locate its successor or predecessor, even if the network in question suffers from a high failure rate

Image 2.1: A 16-node Chord network example

The Chord protocol is one solution for connecting the peers of a P2P network Chord consistently maps a key onto a node Both keys and nodes are assigned an m-bit identifier For nodes, this identifier is a hash of the node's IP address For keys, this identifier is a hash of a keyword, such as a file name It is not uncommon to use the words "nodes" and "keys" to refer to these identifiers, rather than actual nodes or keys There are many other algorithms in use by P2P, but this is a simple and common approach

2.2 Pagerank

Trang 18

2.2.1 Description

PageRank is the key link parsing algorithm, names by the Google Co-founder Lary Page, uses from assigns a digit as extra as the hyperlinked set of each element document, for example world wide network, with "Goal Google Internet search engine; measuring" It in set relative importance Perhaps the algorithm is utilized in the individual all collection mutual quotation and about It assigns to all specific element E digital weight also calls E PageRank, and indicated by PR (E)

2.2.2 Algorithm

PageRank will be the possibility distribution which will use in is symbolizeed possibly willfully clicks in the link person will arrive all special data PageRank may calculate for all size document collection In several research papers are divided evenly by the supposition release between collection all documents in the

computational process at the beginning The PageRank computation requests several passes, calls " iterations" To reflects the theory real value strictly through the

adjustment approximate PageRank collection value

The possibility is expressed takes in 0 and a 1 scope value A 0.5 possibility are expressed together takes " 50% chance" Something occurrence Therefore, there 0.5 method PageRank will be 50% opportunity click the human who will link willfully in one will be directed to and 0.5 PageRank this article

Simplified algorithm

Trang 19

Image 2.2.2: How PageRank Works

Supposition four homepage microcosms: A, B, C and D The PageRank initial approximation will be divided evenly between these four documents Therefore, each document 0.25 will start from estimate PageRank

By the PageRank original shape original value is completely 1 This means that all data the sum total is the page total in the net The PageRank newest edition (will see also the following convention) the supposition in 0 and 1 between possibility distributions Therefore here will use together the simple possibility distribution original value 0.25

If page B, C and D each link only, they every one will discuss 0.25 PageRank to A All PageRank PR () will gather therefore in this simplification's system to A, because all links will aim at A

This is 0.75

Supposition page B has a link to call C, and calls A, but page D has the link to all three data In the link vote's value in the page all links outward is divided Therefore, page B calls A for quite a 0.125 value's vote and quite a 0.125 value vote calls C D' Only 1/3; s PageRank is A' Counting; s PageRank (about 0.083)

In other words, PageRank linked outward by one discussed with document' Is equal; s has links outward the L() normalization quantity division PageRank score (supposition, a concrete URL link each document only counting)

In the general case, the PageRank value for any page u can be expressed as: ,

i.e the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all data linking to page u), divided by the number L(v) of links from page v

Damping factor

The PageRank theory will maintain at the link clicks on the surf rider who will fictionalize to stop willfully clicking finally Possibility, in all step, the human will

Trang 20

extend will be damping factor various research has tested different damping factor D., but the usual supposition, the damping factor will be established nearby 0.85

The damping factor from 1 is subtracted (, and in algorithm some variations, result divides (N) by document the quantity in collection), and this deadline then increases the PageRank score sum total product which and follows on somebody's heels to the damping factor Namely

So any page's PageRank is derived in large part from the PageRanks of other data The damping factor adjusts the derived value downward The original paper, however, gave the following formula, which has led to some confusion:

Then any page' s PageRank is obtaining majority of from other page of PageRanks The damping factor downward adjustment obtains value Original text, however, has given the following convention, has caused some confusions:

Between them the difference is in the first convention sum total PageRank value to one, but obtains in second convention each PageRank is multiplied by N, and the sum total becomes N In page and Brin' A statement; s paper " All PageRanks sum total is one" and supports the above convention by other Google employee's request the first distortion

Each time it crawls the net, and reconstructs its index, Google evaluates the PageRank score When Google increases the document quantity in its collection, PageRank initial approximation for all document reduction

The convention use obtains tastelessly, in several clicks and switch after random page a random surf rider’s model The page PageRank value reflection random surfrider will land in that page through the click in the link opportunity May understand that takes the condition is the page, and the transition is equally all possible and is between the page link Markov chain

If the page link to other data, it has not become the water trough, and terminates the random surfing the process However, the explanation is quite simple If the

random surf rider arrives at the water trough page, it picks another URL stochastically, and continues again the surfing

When calculates PageRank, the page has not linked outward the supposition and in collection other data of connections Therefore their PageRank score is divided evenly

Trang 21

in other data In other words, is fair with is not water trough's page, these random transitions increase to net's all knots, when remaining possible usual d = 0.85, estimated that uses their browser' from a frequency common surfrider; s bookmark characteristic

Therefore, the equality is as follows:

where p1,p2, ,pN are the data under consideration, M(pi) is the set of data that link to pi, L(pj) is the number of outbound links on page pj, and N is the total number of data

jacency matrix This makes PageRank a particularly elegant metric: the eigenvector is

The PageRank values are the entries of the dominant eigenvector of the modified ad

where R is the solution of the equation

where the adjacency function is 0 if page pj does not link to pi, and normalized such that, for each i

,

i.e the elements of each column sum up to 1 (for more details see

the computation section below) This is a variant of the eigenvector centrality measure sed commonly in network analysis

the PageRank eigenvector are fast to approximate (only a few iterations are needed)

Because of the large eigengap of the modified adjacency matrix above, the values of

Trang 22

As a result of the Markov theory, may display page PageRank be the possibility is in that page after many clicks This accidentally equals t − 1 t is expectation of) the place request's click (or jumps willfully quantity obtains from the page returns to itself

The major object is it favors a older page, because is new, the very good first page, will not even have many links, only if it will be an existing stand (is a stand part crowded wrap page which will connect, for example Wikipedia) The Google table of contents (itself derivative opening table of contents project) allows the user to look in the category the PageRank sorting result The Google table of contents is PageRank determined directly the demonstration order Google provides only service In Google' the s other search service (e.g its main net search) PageRank uses in considering the relevance in search result demonstration dozens of data Several strategies proposed that accelerates PageRank the computation Operated PageRank various strategies to arrange to use diligently together the improvement search result ranking and decides as the currency to do to the link the advertisement

These strategies have attacked the PageRank concept reliability severely, seeks determined that which documents in fact take seriously by the net community Google knew that the punishment the link farm which and other plans designs inflates

artificially PageRank Google starts in December, 2007 to punish effectively sells the paid text link the stand How does Google identify the link farm and other PageRank operational tool is in Google' In; s business secret

2.3 Distributed Computing

The distributed computing is the computer science area research distributional system Distributional system through a computer network service including many autonomous computers The computer achieves a common goal mutually according to the order interaction The computer program which runs in the distributional system said that a distributed program, the distribution programming writes such program' s process And the distributed computing mentions the use distributional system explanation estimate question

In the distributed computing, the question is divided many responsibilities, the computer explains everybody

Trang 23

Image 2.3: Distributed Nodes Graph example

We pass use computer’s hope automation; s many responsibilities held responsible with answer the type: We hope to ask the question, and the computer should cause the answer In the computer science theoretically, is called the estimate question like this voluntarily It is estimated that the question has each template including the instance is an explanation officially together The example is the question which we asked that and the explanation is anticipates the answer to these questions

(How does the theory computer science seek needs to understand the estimate question possibly through use that the complex theory solution computer (the

computability theory) and high efficiency computation) In the tradition, said the question perhaps through the use solution computer, if perhaps we design all concrete instances are correct explanation algorithm causes Perhaps such algorithm possibly implements the computer program which runs in an general calculator: Studies from the input question instance's holiday eye, carries out some computation, and causes the explanation to adopt the product

Formalism for example random access ' perhaps the s machine or the universal Turing machine use the achievement to carry out such algorithm continuously general calculator' s abstraction model In many computer situations, consistent and distributed computing area research similar question or execution interaction process system computer: Which estimate question how can solve in such network and the high efficiency place? However, it is not obvious in concurrent or the distributional system situation, “solves the problem is all meanings”

2.4 Computing PageRank in a distributed system

Trang 24

Lectured the net graph in distribution system's recent research work to divide into messes up the website or the domain case The net is molded takes many messes up the network server Is divided in net's ultra link two categories, the internal cut-off link and the mutual server link The internal server link is between the page link in the server, and these links use in calculating on each server's place PageRank intermediate vector The mutual server link is between the page link with the different server, and they use in calculating ServerRank ServerRank surveys the different network server's relative importance The server which submits is being merged finally from many network server's result causes an arrangement ultra link name list

The ranking algebra proposed that deals with the ranking in the different granularity level, is utilized possibly also in gathering the place ranking and the stand ranking obtains the global ranking Has in one disperses the system fully in the

PageRank approximation work, each of the same generation is autonomous, and perhaps of the same generation mutually overlaps Was proposing the JXP algorithm, each of the same generation calculates the place PageRank score, then meets other of the same generations and increases it gradually through the exchange information willfully about the global net graph knowledge, then recomputation in place of the same generation's PageRank score

This conference and the recomputation process is duplicated, collects the enough information until of the same generation If of the same generation meets the sufficient number of times exchange information finally, JXP score polymerization to the real global PageRank score Supposes is each page of out degree in global graph awareness However, these operations are providing the approximation the focal point are the global graph, in centralized system or distribution system

2.5 tf-idf

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query

One of the simplest ranking functions is estimated by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model

Motivation

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow." A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents To further distinguish them, we might count the number of times each term occurs in each document and sum them all

together; the number of times a term occurs in a document is called its term frequency

Trang 25

However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and "cow" Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant documents Hence an inverse document frequency factor is

incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely

Mathematical details

The term count in the given document is simply the number of times a given term appears in that document This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document dj Thus we have the term frequency,

• | D | : total number of documents in the corpus

• : number of documents where the term ti appears (that is

) If the term is not in the corpus, this will lead to a division-by-zero It is therefore common to use

Then

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms The tf-idf value for a term will always be greater than or equal to zero

Trang 26

Example

Consider a document containing 100 words wherein the word cow appears 3 times Following the previously defined formulas, the term frequency (TF) for cow is then 0.03 (3 / 100) Now, assume we have 10 million documents and cow appears in one thousand of these Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4 The TF-IDF score is the product of these quantities: 0.03 × 4 = 0.12

Chapter 3:

Building a new algorithm for ranking nodes in chord networks

3.1 Targets and Missions of Research

The P2P deployment is the building distributed search network Proposed that

the system support discovers with retrieval all results, but lacks the essential

information to arrange them User, however, is mainly to most is related and is not

all most possible results to be interested The use random sampling, we expand the well-known information retrieval ranking algorithm class they may apply like this in this distributed establishment How do we analyze the ceiling our method, and the quota our system scale along with the document, the system size, the inquiry document to ties the mapping correctly (uniform to non-uniform) and the type increases the digit (rarely to universal deadline) Our analysis and the imitation indicated that a) these extensions are the high efficiency, and possibly calls with a ceiling to the large-scale system, and b) uses the result accuracy which and the centralized implementation the distributed ranking obtains is comparable

3.2 Idea:

Trang 27

Table 3.2.1: The Pagerank converge and HITS converge

When I write a small code, I see that’s the converges of two algorithms: Pagerank HITS’s better converge more than Pagerank It’s can be see as the

HITS-experiment’s image

experiment test with Graph 1000 nodes

The blue line is the converge ( iterators ) of Pagerank The red line is the converge ( iterators ) of HITS

Table 3.2.2: The Pagerank converge increasing to fast

Trang 28

I also see that’s the converges of the algorithm: Pagerank

It’s can be see as the experiment’s image that’s the time to calculate converge of Pagerank algorithm It goes to much faster It can’t be done in peer to peer network ( the fast increase of time is more than the fast increase of graph’s size a lot )

The When I write a small code, I see that’s the converges of two algorithms: HITS-Pagerank HITS’s better converge more than Pagerank It’s can be see as the experiment’s image

The Graph with 1000 nodes

The blue line is the converge ( iterators ) of Pagerank The red line is the converge ( iterators ) of HITS

Image taken from the experimental results, the convergence of the Pagerank algorithm 1000 nodes network simulation, 2000 nodes, 4000 nodes (Execution time calculation)

Light blue line is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 4000 Dark blue line is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 2000 The red dot is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 1000

Computing time increases when the number of exponential increase network node Want to calculate all the nodes in the network takes several performance (larger network node -> lose greater efficiency (greater than performance node added)

Table 3.2.3: Pagerank convergence are not steady when Epsilon small

Image taken from the experimental results of the convergence Pagerank algorithm Network Simulation 1000 node (Execution time calculation)

Trang 29

Dark blue line includes the red dot is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 1000 under the different conditions randomly

With a little error Epsilon, convergence of k Pagerank is completely stable (Change gap)

3.2.4: HITS convergence ( steady+ take lots of time more than Pagerank)

Image taken from the experimental results the convergence of HITS

algorithm Network simulation is 1000 nodes (Terms Iterator going to converge) Dark blue line includes the red dot is the only way to calculate the time

convergence of HITS algorithm is at 1000 node network under the different conditions randomly

With a little error Epsilon, convergence of HITS stable than PageRank Idea:

Combined HITS and Pagerank HITS method n nodes filter out content authentication best results) and used to calculate the Pagerank that n nodes (n very little compared to the total number of network nodes)

Advantages of this new approach: • Exact result

• Accurate

• Computing easier • Easy feasible • faster

Let’s going to see what’s happened to get a system which has some features like that in deeply later Basically, let’s analyze a simple example in Google search engine:

Định dạng
Số trang	59
Dung lượng	1,3 MB